SUPERIOR LIQUID COOLING EFFECTIVENESSWITH
PROVEN LARGE-SCALE APPLICATION

REDUCES COSTS AND ENVIRONMENTAL IMPACT

  • Liquid cooling reduces power consumption and carbon dioxide emissions from fossil fuel power plants. Reducing the environmental impact of today's data centers becomes a corporate responsibility.

CONVERSION FROM AIR CONDITIONING TO MORE EFFECTIVE LIQUID COOLING
REDUCES OPERATING COSTS BY MORE THAN 40%

  • Switching from air conditioning to liquid cooling saves energy.
  • Additional energy savings due to lower fan operation of the system.
  • 1 year average payback on investment increases ROI.

CONVERSION FROM AIR CONDITIONING TO MORE EFFECTIVE LIQUID COOLING
REDUCES OPERATING COSTS BY MORE THAN 40%

  • Liquid is basically up to 1000 times more efficient at dissipating heat
  • Future CPU and GPU generations may require liquid cooling as air cooling capacity is exceeded.
  • The highest performance, highest density servers can be supported, increasing compute capacity per square foot.

ADVANTAGES

VORTEILE Image

WHY LIQUID COOLING BECOMES NECESSARY

Data center/computer room cooling costs are increasing

Latest generation of CPUs

280 Watts

Latest generation of CPUs

500 Watts

Liquid Cooling Solutions

FLÜSSIG-FLÜSSIG-KÜHLUNG Image

LIQUID-LIQUID COOLING

LUFT-FLÜSSIGKEITSKÜHLUNG Image

AIR-LIQUID COOLING

EINTAUCHKÜHLUNG Image

Immersion cooling

LIQUID COOLING PROTECTS AGAINST PERFORMANCE DEGRADATION,
TAKE FULL ADVANTAGE OF YOUR IT INFRASTRUCTURE

What you need to pay attention to, what we want to protect you from:
Hohe Betriebstemperatur
The performance of a GPU can be affected by the operating temperature. Although NVIDIA GPUs have a maximum temperature below which their use is supported, certification testing has shown that operating at a lower temperature can significantly improve performance in some cases.

A typical system has multiple fans for air cooling, but the amount of cooling for each device in the chassis depends heavily on the physical layout of all components, especially the location of GPUs in relation to fans, baffles, dividers, risers, etc. Many enterprise systems have programmable fan curves that set the fan speed based on the GPU temperature for each fan. Often, the default fan curve is based on a generic base system and does not take into account the presence of GPUs and similar devices that can generate a lot of heat.

In an example of a system with four GPUs, certification tests showed that one of the GPUs operated at a much higher temperature than the other three. This was simply due to the specific internal arrangement of the components and the airflow characteristics in this particular model. There was no way to anticipate this. Adjusting the fan curve eliminated the hot spot and improved the overall performance of the system.

Because systems can vary widely in design, there is no universal fan curve profile that can be recommended. Instead, the certification process is invaluable in identifying potential performance issues due to temperature and verifying which fan curves produce the best results for each server tested. These profiles are documented for each certified system.

Nicht-optimale BIOS- und Firmware-Einstellungen
Both BIOS settings and firmware versions can affect performance and functionality. This is especially the case with NUMA-based systems. The certification process determines the optimal BIOS settings for best performance and identifies the best values for other configurations, such as NIC PCI settings and boot grub settings. Multi-node testing also determined the optimal settings for the network switch. In one example, a system achieved RDMA communications at nearly 300 Gb/s and TCP at 120 Gb/s. Once the settings were properly configured, performance increased to 360 Gb/s for RDMA and 180 Gb/s for TCP, both of which were near line rate.
Unsachgemäße PCI-Steckplatzkonfiguration
Fast transfer of data to and from the GPU is critical for optimal performance in accelerated workloads. In addition to the need described above to transfer large amounts of data to the GPU for both training and inferencing, transferring data between GPUs can become a bottleneck during the so-called all-reduce phase of multi-GPU training. This is also true for the network interface, as data is often loaded from remote memory or transferred between systems in the case of multi-node algorithms. Since GPUs and NICs are installed in a system via the PCI bus, incorrect placement can lead to suboptimal performance.

NVIDIA GPUs use 16 PCIe lanes (referred to as x16), which allow 16 parallel channels of data transfer. NVIDIA NICs can use 8, 16, or 32 lanes depending on the model. In a typical server or workstation, the PCI bus is divided into slots with different numbers of lanes to accommodate the needs of different peripherals. In some cases, this is further affected by the use of a PCI riser card, and the number of slots can also be set in the BIOS. If a GPU or graphics card is installed in the motherboard without taking these factors into account, the full capacity of the device may not be used. For example, an x16 device could be installed in an x8 slot, or the slot could be limited to x8 or less by a BIOS setting. The certification process uncovers these issues, and the optimal PCI slot configuration is documented when a system is certified.
Fehlendes Bewusstsein für die NUMA-Topologie
NUMA (Non-Uniform Memory Architecture) is a special design for the configuration of microprocessors in a multi-CPU system used by certain chip architectures. In such systems, devices such as GPUs and NICs have an affinity with a particular CPU because they are connected to the bus belonging to that CPU (in a so-called NUMA node). When running applications that involve communication between GPUs or between GPU and NIC, performance can be severely degraded if the devices are not paired optimally. In the worst case, data has to be transferred between the NUMA nodes, which leads to high latency times.
Auswirkungen von unzureichendem Systemspeicher (RAM)
Insufficient memory is a cause of poor performance in many applications, especially in machine learning, both training and inference. In training, an algorithm typically analyzes large amounts of data, and the system should be able to hold enough data in memory to keep the training algorithm running. In inferencing, memory requirements depend on the use case. For batch inferencing, the more data that can be held in memory, the faster it can be processed. However, with streaming inferencing, data is typically analyzed as it comes in, so the amount of memory required may not be as large. By analyzing the results of numerous certification tests, NVIDIA was able to establish memory size guidelines based on the number of GPUs and the amount of GPU memory. In one case, a system with four GPUs and 128 GB of RAM failed the certification test. When the memory was increased to 384 GB, overall performance increased by 45% and the system was able to pass certification.

EXAMPLES FROM REALITY

Universität Osaka – SQUID
SQUID Supercomputer Image
  • 1.520 general purpose CPU nodes (3rd generation Intel® Xeon® Scalable processors)
  • 42 GPU nodes (NVIDIA A100)
  • Supermicro SuperBlades

Cooling method - liquid to liquid (direct liquid cooling)
Read more about Osaka University and Supermicro servers

Lawrence Livermore Nationales Laboratorium "Ruby"

Cooling Type - Liquid to Liquid (Direct Liquid Cooling)
Press Release:Supermicro Scalable Liquid-Cooled Supercomputing Cluster to be Deployed at Lawrence Livermore National Laboratory for COVID-19 Research


Ruby SuperComputer Image
ITRI x KDDI: Eintauchkühlung für Edge Data Center
ITRI and KDDI, in collaboration with several global IT companies, including our partner Supermicro, designed and built the Edge data center with immersion cooling.

Cooling Type - Immersion

SYSGEN OFFERS A RANGE OF INTEGRATED LIQUID COOLING SOLUTIONS
IN COLLABORATION WITH SUPERMICRO

  • Direct-to-chip solutions with rack integration, test, burn-in and onsite installation
  • Experience with Active RDHx and onsite integration
  • Diverse experience with multiple partners

SYSTEMS WITH LIQUID COOLING AVAILABLE ON REQUEST

2U HGX A1004-GPU
SYSTEM

4U HGX A100
8-GPU SYSTEMS

2U 4-NODE
BIGTWIN® SERVERS
(IN PICTURE: SINGLE NODE)

SUPERBLADE

ULTRA SERVERS

REQUEST YOUR QUOTE FROM SYSGEN

With sysGen, you overcome the most important challenges for AI workloads in your company:

Assemble end-to-end AI solution from various products and integrate into existing infrastructures.

High performance is critical for AI, machine learning, and data analytics workloads. This also includes fast deployment.

Moving from proof-of-concept to enterprise-wide deployment requires effective scaling through efficient resource utilization. This is how you ensure manageability, system availability, and infrastructure cost management.

Quiet and COOL turnkey systems and clusters for today's power-hungry workloads at affordable prices.

Ready-to-use systems including racks with Infiniband and/or Ethernet networking. Also available as turnkey solutions.

TO LEARN MORE ABOUT HOW WE CAN HELP THEM

  • Reduce the cost of your data center and lower the PUE value
  • Achieve higher performance from your CPUs and GPUs
  • Achieve higher density with innovative solutions

CONTACT US

Please use our contact form for your inquiry.
thank you in advance for your interest in our products, services and solutions.

Alternatively you can use our server / workstation / PC inquiry form for detailed system inquiries.
We look forward to hearing from you.