NVIDIA H100 Tensor Core GPU Overview

The complexity of artificial intelligence (AI), high-performance computing (HPC), and data analytics is increasing exponentially, requiring scientists and engineers to deploy state-of-the-art computing platforms. The NVIDIA Hopper GPU architecture delivers the highest compute performance at low latency and integrates a wide range of features for data center-scale computing.
the NVIDIA® H100 Tensor Core GPU, based on the NVIDIA Hopper GPU architecture, represents the next major leap in accelerated computing performance for NVIDIA's data center platforms. The H100 accelerates diverse workloads, from small enterprise workloads to exascale HPC to trillion-parameter AI models. The H100 is the world's most advanced chip ever built. It is manufactured on TSMC's custom 4N process with 80 billion transistors and numerous architectural enhancements.
Selection of products

SPECIFICALLY FOR THE CONVERGENCE OF SIMULATION, DATA ANALYSIS AND KI.

With the NVIDIA H100 tensor-core GPU, you'll benefit from unprecedented performance, scalability, and security for any workload. With the NVIDIA® NVLinkSwitch® switch system, up to 256 H100s can be connected to accelerate exascale workloads, while the dedicated Transformer Engine supports trillion-parameter language models. H100 taps into innovations in NVIDIA Hopper™ architecture to deliver industry-leading conversational AI and accelerate large speech models up to 30x over the previous generation.

HGX H100 8-GPU

The HGX H100 8-GPU is the main building block of the new GPU server generation Hopper. It houses eight H100 Tensor Core GPUs and four third-generation NVSwitches. Each H100 GPU has multiple fourth-generation NVLink ports and is connected to all four NVSwitches. Each NVSwitch is a fully non-blocking switch that fully connects all eight H100 Tensor Core GPUs.
DGX H100 NVSwitch Connected

This fully connected topology of NVSwitch allows each H100 to communicate simultaneously with every other H100. This communication runs at NVLink bi-directional speeds of 900 gigabytes per second (GB/s), which is more than 14 times the bandwidth of the current PCIe Gen4 x16 bus.

The third-generation NVSwitch also features new hardware acceleration for collective operations with multicast and NVIDIA SHARP in-network reductions. Combined with the faster NVLink speed, the effective bandwidth for common AI collective operations such as All-Reduce increases by three times compared to the HGX A100. NVSwitch acceleration of collective operations also significantly reduces the load on the GPU.

HGX A100 8-GPU
HGX H100 8-GPU
Improvement rate
FP8
-
32.000 TFLOPS
6X (compared to A100 FP16)
FP16
4.992 TFLOPS
16.000 TFLOPS
3X
FP64
156 TFLOPS
480 TFLOPS
3X
Network internal data processing
0
3.6 TFLOPS
Infinite
Interface to the host CPU
8x PCIe Gen4 x16
8x PCIe Gen5 x16
2X
Bisection bandwidth
2.4 TB/s
3.6 TB/s
1.5X
Table 1. Comparison between HGX A100 8-GPU and the new HGX H100 8-GPU
*Note: FP performance includes economy

HGX H100 8-GPU with NVLink network support

The emerging class of exascale HPC and trillion-parameter AI models for tasks like precise conversational AI take months to train, even on supercomputers. Compressing this to business speed and completing training within hours requires seamless high-speed communication between all GPUs in a server cluster.

To handle these large use cases, the new NVLink and NVSwitch are designed to allow the HGX H100 8-GPU to support a much larger NVLink domain with the new NVLink Network. Another version of the HGX H100 8-GPU has this new NVLink Network support.

DGX H100 NVSwitch Connectors
Figure 2. High-level block diagram of the HGX H100 8-GPU with NVLink network support
System nodes built with HGX H100 8-GPU with NVLink network support can be fully connected to other systems via the Octal Small Form Factor Pluggable (OSFP) LinkX cables and the new external NVLink switch. This connection allows up to a maximum of 256 GPU NVLink domains. Figure 3 shows the cluster topology.
Connection between H100 and Switches
Figure 3. 256 H100 GPU pod
256 A100 GPU pod
256 H100 GPU pod
Improvement rate
NVLINK area
8 GPU
256 GPU
32X
FP8
-
1.024 PFLOPS
6X (compared to A100 FP16)
FP16
160 PFLOPS
512 PFLOPS
3X
FP64
5 PFLOPS
15 PFLOPS
3X
Network internal data processing
0
192 TFLOPS
Infinite
Bisection bandwidth
6.4 TB/s
70 TB/s
11X
Table 2. Comparison between 256 A100 GPU pods and 256 H100 GPU pods
*Note: FP performance includes economy

PLANNED USE CASES AND PERFORMANCE BENEFITS

With the dramatic increase in HGX H100 compute and networking capabilities, the performance of AI and HPC applications is
greatly improved. Today's mainstream AI and HPC models can be fully accommodated within the aggregated GPU memory of a single node. For example, BERT-Large, Mask R-CNN, and HGX H100 are the most performance-efficient training solutions. More advanced and larger AI and HPC models require multiple nodes with aggregated GPU memory. A deep learning recommendation model (DLRM) with terabytes of embedded tables, a large mixed-of-experts (MoE) model for natural language processing, and the HGX H100 with NVLink network accelerate the key communication bottleneck and are the best solution for this type of workload. Figure 4 from the NVIDIA H100 GPU architecture white paper shows the additional performance gains enabled by NVLink-Network.

Comparison of DGX A100 H100 and H100 + NVLink Network
Figure 4. application performance increase when comparing different system configurations
All performance data is preliminary and based on current expectations and subject to change as products ship. A100 cluster: HDR IB network. H100 cluster: NDR IB network with NVLink network where indicated.
# GPUs: Climate Modeling 1K, LQCD 1K, Genomics 8, 3D-FFT 256, MT-NLG 32 (batch sizes: 4 for A100, 60 for H100 at 1 second, 8 for A100 and 64 for H100 at 1.5 and 2 seconds), MRCNN 8 (batch 32), GPT-3 16B 512 (batch 256), DLRM 128 (batch 64K), GPT-3 16K (batch 512), MoE 8K (batch 512, one expert per GPU)

HGX H100 4-GPU

In addition to the 8-GPU version, the HGX family also offers a 4-GPU version
that connects directly to fourth-generation NVLink.
DGX H100 NVLink for 4 GPU

The point-to-point peer NVLink bandwidth from H100 to H100 is 300 GB/s bidirectional, which is about 5 times faster than today's PCIe Gen4 x16 bus.

The HGX H100 4-GPU form factor is optimized for dense HPC deployment:

  • Multiple HGX H100 4-GPUs can be packed into a 1U tall liquid cooling system to maximize GPU density per rack.
  • The HGX H100 4-GPU features a fully PCIe-switchless architecture that connects directly to the CPU, reducing system material costs and saving power.
  • For CPU-intensive workloads, HGX H100 4-GPU can be paired with two CPU sockets to increase the CPU-GPU ratio for a more balanced system configuration.