Superior Effectiveness of Liquid Cooling with Proven Deployments at Scale

Reduces Costs and Environmental Impact

  • Liquid cooling reduces power usage and lowers carbon emissions from fossil fuel power plants. Reducing the environmental impact of today's data centers is becoming a corporate responsibility.

Switching from Air Conditioning to More Effective Liquid Cooling Reduces OPEX by more than 40%

  • A Switch from Air Conditioners to Liquid Cooling Technology Saves Energy
  • Additional power is saved by reducing system Fan Operation
  • 1 Year Average Payback on Facility Investment increases the ROI

Liquid Cooling Efficiency Dramatically Improves the PUE of Data Centers for High Performance, High Power CPUs, and GPUs

  • Liquid is fundamentally more efficient at removing heat by up to 1000x
  • Future generations of CPUs and GPUs may require liquid Cooling as air cooling capacity is exceeded
  • The Highest performance and Highest Density servers can be supported, increasing computing capacity per sq. ft.

Benefits

Liquid Colling Benefits Image

Why Liquid Cooling is Becoming Necessary

Costs to Cool Datacenter/Computer Rooms are Growing
Latest Generation CPUs
280 Watts
Latest Generation GPUs
500 Watts

Liquid Cooling Solutions

Liquid to Liquid Cooling

LIQUID TO LIQUID COOLING Image

Air to Liquid Cooling

AIR TO LIQUID COOLING Image

Immersion Cooling

IMMERSION COOLING Image

Liquid Cooling protects against performance drops, make full use of your IT infrastructure

What you need to pay attention to, what we want to protect you from:
High operating temperature
The performance of a GPU can be affected by the operating temperature. Although NVIDIA GPUs have a maximum temperature below which their use is supported, certification testing has shown that operating at a lower temperature can significantly improve performance in some cases.

A typical system has multiple fans for air cooling, but the amount of cooling for each device in the enclosure depends heavily on the physical layout of all components, especially the position of the GPUs in relation to fans, baffles, dividers, risers, etc. Many enterprise systems have programmable fan curves that set the fan speed based on the GPU temperature for each fan. Often, the default fan curve is based on a generic base system and does not take into account the presence of GPUs and similar devices that can generate a lot of heat. 

In one example of a system with four GPUs, certification testing showed that one of the GPUs was operating at a much higher temperature than the other three. This was simply due to the specific internal layout of the components and the airflow characteristics in this particular model. There was no way to anticipate this. Adjusting the fan curve eliminated the hot spot and improved the overall performance of the system.  
Because systems can vary widely in design, there is no universal fan curve profile that can be recommended. Instead, the certification process is invaluable in identifying potential performance issues due to temperature and verifying which fan curves produce the best results for each server tested. These profiles are documented for each certified system. 
Non-optimal BIOS and firmware settings
Both BIOS settings and firmware versions can affect performance and functionality. This is especially the case with NUMA-based systems. The certification process determines the optimal BIOS settings for best performance and identifies the best values for other configurations, such as NIC PCI settings and boot grub settings. Multi-node testing also determined the optimal settings for the network switch. In one example, a system achieved RDMA communications at nearly 300 Gb/s and TCP at 120 Gb/s. Once the settings were properly configured, performance increased to 360 Gb/s for RDMA and 180 Gb/s for TCP, both of which were near line rate.
 Improper PCI slot configuration
Fast transfer of data to and from the GPU is critical for optimal performance in accelerated workloads. In addition to the need to transfer large amounts of data to the GPU for both training and inferencing described above, transferring data between GPUs can become a bottleneck during the so-called all-reduce phase of multi-GPU training. This is also true for the network interface, as data is often loaded from remote memory or transferred between systems in the case of multi-node algorithms. Since GPUs and NICs are installed in a system via the PCI bus, improper placement can result in suboptimal performance. 

NVIDIA GPUs use 16 PCIe lanes (referred to as x16), which allow 16 parallel channels of data transfer. NVIDIA NICs can use 8, 16, or 32 lanes depending on the model. In a typical server or workstation, the PCI bus is divided into slots with different numbers of lanes to accommodate the needs of different peripherals. In some cases, this is further affected by the use of a PCI riser card, and the number of slots can also be set in the BIOS. If a GPU or graphics card is installed in the motherboard without taking these factors into account, the full capacity of the device may not be used. For example, an x16 device could be installed in an x8 slot, or the slot could be limited to x8 or less by a BIOS setting. The certification process uncovers these issues, and the optimal PCI slot configuration is documented when a system is certified.
Lack of awareness of the NUMA topology. 
NUMA (Non-Uniform Memory Architecture) is a special design for the configuration of microprocessors in a multi-CPU system used by certain chip architectures. In such systems, devices such as GPUs and NICs have an affinity with a particular CPU because they are connected to the bus belonging to that CPU (in a so-called NUMA node). When running applications that involve communication between GPUs or between GPU and NIC, performance can be severely degraded if the devices are not paired optimally. In the worst case, data has to be transferred between the NUMA nodes, which leads to high latency times. 
Effects of insufficient system memory (RAM)
Insufficient memory is a cause of poor performance in many applications, especially in machine learning, both training and inference. In training, an algorithm typically analyzes large amounts of data, and the system should be able to hold enough data in memory to keep the training algorithm running. In inferencing, memory requirements depend on the use case. For batch inferencing, the more data that can be held in memory, the faster it can be processed. However, with streaming inferencing, data is typically analyzed as it comes in, so the amount of memory required may not be as great. By analyzing the results of numerous certification tests, NVIDIA was able to establish memory size guidelines based on the number of GPUs and the amount of GPU memory. In one case, a system with four GPUs and 128 GB of RAM failed the certification test. When the memory was increased to 384 GB, overall performance increased by 45% and the system was able to pass certification.

Real World Examples

Osaka University – SQUID
SQUID Supercomputer Image
  • 1,520 general-purpose CPU nodes (3rd Gen Intel® Xeon® Scalable Processors)
  • 42 GPU nodes (NVIDIA A100)
  • Supermicro SuperBlades

Cooling Type – Liquid to Liquid (Direct Liquid Cooling)
Read More About Osaka University and Supermicro Servers
Lawrence Livermore National Laboratory "Ruby"

Cooling Type – Liquid to Liquid (Direct Liquid Cooling)
Press Release: Supermicro Scalable Liquid-Cooled Supercomputing Cluster Deployed at Lawrence Livermore National Laboratory for COVID-19 Research
Ruby Supercomputer Image
ITRI x KDDI: Immersion Cooling Edge Data Center
ITRI and KDDI co-worked to design and build the immersion cooling edge data center in in cooperation with several global IT companies, including our partner Supermicro

Cooling Type – Immersion

Sysgen Offers a Range of Liquid Cooling Integrated Solutions
​​​​​​​ In Cooperation with superMicro

  • Direct To Chip Solutions with Rack Integration, Test, Burn-in and Onsite Installation
  • Experience with Active RDHx and Onsite Integration
  • Abundant Experience with Multiple Partners

Systems Available with Liquid Cooling On Request

2u HGX A100 Image

2U HGX A100 4-GPU System

4U HGX A100 Image

4U HGX A100 8-GPU Systems


2u 4-Node BigTwin Image

2U 4-Node BigTwin® Servers
(pictured: single node)


Superblade Image

SuperBlade

Ultraservers Image

Ultra Servers


Request your quote from sysGen

​​​​​​​To the inquiry: 
Quiet and COOL turnkey systems and clusters for today's power-hungry workloads at affordable prices.
Ready-to-use systems including racks with Infiniband and/or Ethernet networking. Also available as turnkey solutions.
​​​​​​​With sysGen, you overcome the most important challenges for AI workloads in companies:
Assembling an end-to-end AI solution from various products and integrating it into existing infrastructures.
High performance is critical for AI, machine learning and data analytics workloads. This also includes fast deployment.
Moving from proof-of-concept to enterprise-wide deployment requires effective scaling through efficient use of resources. This is how you ensure manageability, availability of systems and management of infrastructure costs.
We are looking forward to your request, from medium-sized businesses, the arts and all areas of industry and research.

To Learn More about How we can help you

  • Reduce Your Data Center Costs and Lower the PUE
  • Achieve Higher Performance from your CPUs and GPUs
  • Get Higher Density with Innovative Solutions