Systeme und Informatikanwendungen Nikisch GmbHsysGen GmbH - Am Hallacker 48a - 28327 Bremen - info@sysgen.de

Welcome to the new website of sysGen. Please use our contact form if you have any questions about our content

.
Due to the widening chip crisis and the resulting, significant price increases of the major IT manufacturers, online price calculations are currently not possible. We therefore point out that price inquiries via our website may differ from the final offer!

Superior Effectiveness of Liquid Cooling with Proven Deployments at Scale

Reduces Costs and Environmental Impact

  • Liquid cooling reduces power usage and lowers carbon emissions from fossil fuel power plants. Reducing the environmental impact of today's data centers is becoming a corporate responsibility.

Switching from Air Conditioning to More Effective Liquid Cooling Reduces OPEX by more than 40%

  • A Switch from Air Conditioners to Liquid Cooling Technology Saves Energy
  • Additional power is saved by reducing system Fan Operation
  • 1 Year Average Payback on Facility Investment increases the ROI

Liquid Cooling Efficiency Dramatically Improves the PUE of Data Centers for High Performance, High Power CPUs, and GPUs

  • Liquid is fundamentally more efficient at removing heat by up to 1000x
  • Future generations of CPUs and GPUs may require liquid Cooling as air cooling capacity is exceeded
  • The Highest performance and Highest Density servers can be supported, increasing computing capacity per sq. ft.

Benefits

Why Liquid Cooling is Becoming Necessary

Costs to Cool Datacenter/Computer Rooms are Growing
Latest Generation CPUs
280 Watts
Latest Generation GPUs
500 Watts

Liquid Cooling Solutions

Liquid to Liquid Cooling

Air to Liquid Cooling

Immersion Cooling

Liquid Cooling protects against performance drops, make full use of your IT infrastructure

What you need to pay attention to, what we want to protect you from:
The performance of a GPU can be affected by the operating temperature. Although NVIDIA GPUs have a maximum temperature below which their use is supported, certification testing has shown that operating at a lower temperature can significantly improve performance in some cases.

A typical system has multiple fans for air cooling, but the amount of cooling for each device in the enclosure depends heavily on the physical layout of all components, especially the position of the GPUs in relation to fans, baffles, dividers, risers, etc. Many enterprise systems have programmable fan curves that set the fan speed based on the GPU temperature for each fan. Often, the default fan curve is based on a generic base system and does not take into account the presence of GPUs and similar devices that can generate a lot of heat. 

In one example of a system with four GPUs, certification testing showed that one of the GPUs was operating at a much higher temperature than the other three. This was simply due to the specific internal layout of the components and the airflow characteristics in this particular model. There was no way to anticipate this. Adjusting the fan curve eliminated the hot spot and improved the overall performance of the system.  
Because systems can vary widely in design, there is no universal fan curve profile that can be recommended. Instead, the certification process is invaluable in identifying potential performance issues due to temperature and verifying which fan curves produce the best results for each server tested. These profiles are documented for each certified system. 
Both BIOS settings and firmware versions can affect performance and functionality. This is especially the case with NUMA-based systems. The certification process determines the optimal BIOS settings for best performance and identifies the best values for other configurations, such as NIC PCI settings and boot grub settings. Multi-node testing also determined the optimal settings for the network switch. In one example, a system achieved RDMA communications at nearly 300 Gb/s and TCP at 120 Gb/s. Once the settings were properly configured, performance increased to 360 Gb/s for RDMA and 180 Gb/s for TCP, both of which were near line rate.

Fast transfer of data to and from the GPU is critical for optimal performance in accelerated workloads. In addition to the need to transfer large amounts of data to the GPU for both training and inferencing described above, transferring data between GPUs can become a bottleneck during the so-called all-reduce phase of multi-GPU training. This is also true for the network interface, as data is often loaded from remote memory or transferred between systems in the case of multi-node algorithms. Since GPUs and NICs are installed in a system via the PCI bus, improper placement can result in suboptimal performance. 

NVIDIA GPUs use 16 PCIe lanes (referred to as x16), which allow 16 parallel channels of data transfer. NVIDIA NICs can use 8, 16, or 32 lanes depending on the model. In a typical server or workstation, the PCI bus is divided into slots with different numbers of lanes to accommodate the needs of different peripherals. In some cases, this is further affected by the use of a PCI riser card, and the number of slots can also be set in the BIOS. If a GPU or graphics card is installed in the motherboard without taking these factors into account, the full capacity of the device may not be used. For example, an x16 device could be installed in an x8 slot, or the slot could be limited to x8 or less by a BIOS setting. The certification process uncovers these issues, and the optimal PCI slot configuration is documented when a system is certified.

NUMA (Non-Uniform Memory Architecture) is a special design for the configuration of microprocessors in a multi-CPU system used by certain chip architectures. In such systems, devices such as GPUs and NICs have an affinity with a particular CPU because they are connected to the bus belonging to that CPU (in a so-called NUMA node). When running applications that involve communication between GPUs or between GPU and NIC, performance can be severely degraded if the devices are not paired optimally. In the worst case, data has to be transferred between the NUMA nodes, which leads to high latency times. 

Insufficient memory is a cause of poor performance in many applications, especially in machine learning, both training and inference. In training, an algorithm typically analyzes large amounts of data, and the system should be able to hold enough data in memory to keep the training algorithm running. In inferencing, memory requirements depend on the use case. For batch inferencing, the more data that can be held in memory, the faster it can be processed. However, with streaming inferencing, data is typically analyzed as it comes in, so the amount of memory required may not be as great. By analyzing the results of numerous certification tests, NVIDIA was able to establish memory size guidelines based on the number of GPUs and the amount of GPU memory. In one case, a system with four GPUs and 128 GB of RAM failed the certification test. When the memory was increased to 384 GB, overall performance increased by 45% and the system was able to pass certification.

Real World Examples

  • 1,520 general-purpose CPU nodes (3rd Gen Intel® Xeon® Scalable Processors)
  • 42 GPU nodes (NVIDIA A100)
  • Supermicro SuperBlades

Cooling Type – Liquid to Liquid (Direct Liquid Cooling)
Read More About Osaka University and Supermicro Servers

Cooling Type – Liquid to Liquid (Direct Liquid Cooling)
Press Release: Supermicro Scalable Liquid-Cooled Supercomputing Cluster Deployed at Lawrence Livermore National Laboratory for COVID-19 Research
ITRI and KDDI co-worked to design and build the immersion cooling edge data center in in cooperation with several global IT companies, including our partner Supermicro


Cooling Type – Immersion

Sysgen Offers a Range of Liquid Cooling Integrated Solutions
​​​​​​​ In Cooperation with superMicro

  • Direct To Chip Solutions with Rack Integration, Test, Burn-in and Onsite Installation
  • Experience with Active RDHx and Onsite Integration
  • Abundant Experience with Multiple Partners

Systems Available with Liquid Cooling On Request

2U HGX A100 4-GPU System

4U HGX A100 8-GPU Systems


2U 4-Node BigTwin® Servers
(pictured: single node)


SuperBlade

Ultra Servers


Request your quote from sysGen

​​​​​​​To the inquiry: 
​​​​​​​Quite and COOL turnkey Systems and clusters for the performance-hungry workloads of our time in affordable prices.

We are looking forward to your request, from medium-sized businesses, the arts and all areas of industry and research.

To Learn More about How we can help you

  • Reduce Your Data Center Costs and Lower the PUE
  • Achieve Higher Performance from your CPUs and GPUs
  • Get Higher Density with Innovative Solutions