Specifically designed for the convergence of simulation,
data analytics and AI

Massive data sets, huge models in Deep Learning, and complex simulations require multiple GPUs with extremely fast interconnects and a fully accelerated software stack. The NVIDIA HGX™ AI supercomputing platform combines the full power of NVIDIA GPUs, NVIDIA® NVLink®, NVIDIA InfiniBand networking, and a fully optimized NVIDIA AI and HPC software stack from the NVIDIA NGC™ catalog for the highest application performance. With end-to-end performance and flexibility, NVIDIA HGX enables researchers and scientists to combine simulation, data analytics, and AI to drive scientific progress.

UNMATCHED END-TO-END PLATFORM FOR ACCELERATED COMPUTING

NVIDIA HGX represents the world's most powerful servers with NVIDIA A100 Tensor Core GPUs and high-speed interconnects. With 16 A100 GPUs, HGX A100 delivers up to 1.3 terabytes (TB) of graphics memory and over 2 terabytes per second (Tb/s) of memory bandwidth, achieving unprecedented acceleration.

Compared to previous generations, HGX delivers up to 20x AI acceleration with Tensor Float 32 (TF32) and HPC delivers 2.5x acceleration with FP64. NVIDIA HGX performs a staggering 10 petaFLOPS, making it the most powerful accelerated and vertically scalable server platform for AI and HPC.

The HGX has been extensively tested and is easy to deploy. It integrates with partner servers for guaranteed performance. The HGX platform is available in both 4-GPU and 8-GPU HGX motherboards with SXM GPUs. It is also available as PCIe GPUs for a modular deployment option that delivers the highest compute performance on mainstream servers.

HGX Application and Software Package Image

DEEP LEARNING PERFORMANCE

With 144 cores and 1 TB/s of memory bandwidth, the NVIDIA Grace CPU Superchip delivers unprecedented performance for CPU-based high-performance computing applications. HPC applications are compute-intensive and require the most powerful cores, the highest memory bandwidth, and the right amount of memory per core to accelerate results.

Grace CPU Superchip and Grace Hopper Superchip are expected to be available in the first half of 2023.

UP TO 3 TIMES FASTER KI TRAINING ON THE
LARGEST MODELS

DLRM Training
HGX Data Model Performance Image
DLRM on HugeCTR framework, precision = FP16 | NVIDIA A100 80 GB batch size = 48 | NVIDIA A100 40 GB batch size = 32 | NVIDIA V100 32 GB batch size = 32.
The size and complexity of Deep Learning models have exploded, requiring systems with large amounts of memory, massive processing power, and fast interconnects for scalability. With ultra-fast multilateral GPU communication through NVIDIA® NVSwitch™, HGX A100 provides enough power for even the most advanced AI models. A100 80 GB GPUs double the graphics memory, allowing a single HGX A100 to provide up to 1.3 TB of memory. Steadily growing workloads on the very largest models, such as Deep Learning Recommendation Models (DLRM), which have massive data tables, are accelerated by up to 3x over the performance of HGX systems with A100 40 GB GPUs.

PERFORMANCE OF MACHINE LEARNING

2 times faster than A100 40 GB in Big Data Analytics benchmark

DLRM Training
HGX Data Analytics Performance Image
Big Data Analytics benchmark | 30 Analytical Retail Queries, ETL, ML, NLP on 10 TB dataset | V100 32 GB, RAPIDS/Dask | A100 40 GB and A100 80 GB, RAPIDS/Dask/BlazingSQL

Machine learning models require loading, transforming, and processing very large datasets to gain insights. With over 1.3TB of unified memory and multilateral GPU communication via NVSwitch, HGX 80 GB has the power to load and perform computations on huge datasets to quickly gain actionable insights.

In a big data analytics benchmark, the A100 80 GB achieved insights with 2x higher throughput than the A100 40 GB, making it ideal for increasing workloads with ever-growing data sets.

HPC PERFORMANCE

HPC applications need to perform enormous amounts of computation every second. By dramatically increasing the compute density of each server node, the number of servers required is significantly reduced. This leads to large cost savings and reduces the space and energy requirements in data centers. For HPC simulations and the associated high-dimensional matrix multiplication, a processor must retrieve data from many environments for computation. Therefore, interconnecting GPUs through NVLink is ideal. HPC applications can also leverage TF32 in A100, achieving up to 11 times higher throughput in four years for dense matrix multiplication tasks with single precision

An HGX A100 with A100 80 GB GPUs provides a two-fold increase in throughput over A100 40 GB GPUs in Quantum Espresso, a materials simulation, leading to faster insight.

11 times more performance at HPC in four years

Leading HPC applications
HGX Throughput Performance Comparison Image
Geometric mean of application acceleration vs. P100: benchmark application: Amber [PME-Cellulose_NVE], Chroma [szscl21_24_128], GROMACS [ADH Dodec], MILC [Apex Medium], NAMD [stmv_nve_cuda], PyTorch (BERT Fast Fine Tuning], Quantum Espresso [AUSURF112-jR]; Random Forest FP32 [make_blobs (160000 x 64:10)], TensorFlow [ResNet-50], VASP 6 [Si Huge] | GPU nodes with dual-socket CPUs with 4x NVIDIA P100, V100 or A100 GPUs.

Up to 1.8 times faster performance for
HPC applications

Quantum espresso
HGX Speed Performance For Hpc Image
Quantum Espresso measurement with CNT10POR8 dataset, precision = FP64.

TECHNICAL DATA FOR HGX A100

NVIDIA HGX is available as a single motherboard with four or eight A100 GPUs, each with 40 GB or 80 GB of GPU memory. The 4 GPU configuration is fully connected with NVIDIA NVLink®, and the 8 GPU configuration is interconnected via NVSwitch. Two NVIDIA HGX A100 motherboards can be combined with NVSwitch interconnect to create a high-performance single node with 16 GPUs.

HGX is also available in a PCIe form factor as an easy-to-deploy option that delivers the highest compute performance on mainstream servers with 40GB or 80GB of GPU memory each.

This powerful combination of hardware and software lays the foundation for the ultimate AI supercomputing platform.

A100 PCIe
4-GPU
8-GPU
16-GPU
GPUs
1x NVIDIA A100 PCIe
HGX A100 4-GPU
HGX A100 8-GPU
2x HGX A100 8-GPU
Form factor
PCIe
4x NVIDIA A100 SXM
8x NVIDIA A100 SXM
16x NVIDIA A100 SXM
HPC and AI computations
(FP64/TF32*/FP16*/INT8*
19.5TF/312TF*/624TF*/1.2POPS*
78TF/1.25PF*/2.5PF*/5POPS*
156TF/2.5PF*/5PF*/10POPS*
312TF/5PF*/10PF*/20POPS*
Working memory
40 or 80 GB per GPU
Up to 320 GB
Up to 640 GB
Up to 1,280 GB
NVLink
Third generation
Third generation
Third generation
Third generation
NVSwitch
N/A
N/A
Second generation
Second generation
NVSwitch bandwidth for connections between GPUs
N/A
N/A
600GB/s
600GB/s
Total aggregated bandwidth
600GB/s
2.4 TB/s
4.8 TB/s
9.6 TB/s
* With low density

INSIGHT INTO THE NVIDIA AMPERE ARCHITECTURE

Read this technical paper to learn what's new with the NVIDIA Ampere architecture and
​​​​​​​it's implementation in the NVIDIA A100 GPU.