NVIDIA H100 Tensor Core GPU Overview
the NVIDIA® H100 Tensor Core GPU, based on the NVIDIA Hopper GPU architecture, represents the next major leap in accelerated computing performance for NVIDIA's data center platforms. The H100 accelerates diverse workloads, from small enterprise workloads to exascale HPC to trillion-parameter AI models. The H100 is the world's most advanced chip ever built. It is manufactured on TSMC's custom 4N process with 80 billion transistors and numerous architectural enhancements.
SPECIFICALLY FOR THE CONVERGENCE OF SIMULATION, DATA ANALYSIS AND KI.
HGX H100 8-GPU
This fully connected topology of NVSwitch allows each H100 to communicate simultaneously with every other H100. This communication runs at NVLink bi-directional speeds of 900 gigabytes per second (GB/s), which is more than 14 times the bandwidth of the current PCIe Gen4 x16 bus.
The third-generation NVSwitch also features new hardware acceleration for collective operations with multicast and NVIDIA SHARP in-network reductions. Combined with the faster NVLink speed, the effective bandwidth for common AI collective operations such as All-Reduce increases by three times compared to the HGX A100. NVSwitch acceleration of collective operations also significantly reduces the load on the GPU.
HGX A100 8-GPU | HGX H100 8-GPU | Improvement rate | |
---|---|---|---|
FP8 | - | 32.000 TFLOPS | 6X (compared to A100 FP16) |
FP16 | 4.992 TFLOPS | 16.000 TFLOPS | 3X |
FP64 | 156 TFLOPS | 480 TFLOPS | 3X |
Network internal data processing | 0 | 3.6 TFLOPS | Infinite |
Interface to the host CPU | 8x PCIe Gen4 x16 | 8x PCIe Gen5 x16 | 2X |
Bisection bandwidth | 2.4 TB/s | 3.6 TB/s | 1.5X |
HGX H100 8-GPU with NVLink network support
The emerging class of exascale HPC and trillion-parameter AI models for tasks like precise conversational AI take months to train, even on supercomputers. Compressing this to business speed and completing training within hours requires seamless high-speed communication between all GPUs in a server cluster.
To handle these large use cases, the new NVLink and NVSwitch are designed to allow the HGX H100 8-GPU to support a much larger NVLink domain with the new NVLink Network. Another version of the HGX H100 8-GPU has this new NVLink Network support.
256 A100 GPU pod | 256 H100 GPU pod | Improvement rate | |
---|---|---|---|
NVLINK area | 8 GPU | 256 GPU | 32X |
FP8 | - | 1.024 PFLOPS | 6X (compared to A100 FP16) |
FP16 | 160 PFLOPS | 512 PFLOPS | 3X |
FP64 | 5 PFLOPS | 15 PFLOPS | 3X |
Network internal data processing | 0 | 192 TFLOPS | Infinite |
Bisection bandwidth | 6.4 TB/s | 70 TB/s | 11X |
PLANNED USE CASES AND PERFORMANCE BENEFITS
With the dramatic increase in HGX H100 compute and networking capabilities, the performance of AI and HPC applications is
greatly improved. Today's mainstream AI and HPC models can be fully accommodated within the aggregated GPU memory of a single node. For example, BERT-Large, Mask R-CNN, and HGX H100 are the most performance-efficient training solutions. More advanced and larger AI and HPC models require multiple nodes with aggregated GPU memory. A deep learning recommendation model (DLRM) with terabytes of embedded tables, a large mixed-of-experts (MoE) model for natural language processing, and the HGX H100 with NVLink network accelerate the key communication bottleneck and are the best solution for this type of workload. Figure 4 from the NVIDIA H100 GPU architecture white paper shows the additional performance gains enabled by NVLink-Network.
HGX H100 4-GPU
that connects directly to fourth-generation NVLink.
The point-to-point peer NVLink bandwidth from H100 to H100 is 300 GB/s bidirectional, which is about 5 times faster than today's PCIe Gen4 x16 bus.
The HGX H100 4-GPU form factor is optimized for dense HPC deployment:
- Multiple HGX H100 4-GPUs can be packed into a 1U tall liquid cooling system to maximize GPU density per rack.
- The HGX H100 4-GPU features a fully PCIe-switchless architecture that connects directly to the CPU, reducing system material costs and saving power.
- For CPU-intensive workloads, HGX H100 4-GPU can be paired with two CPU sockets to increase the CPU-GPU ratio for a more balanced system configuration.