ACCELERATING THE MOST IMPORTANT WORK OF OUR TIME

The NVIDIA A100 Tensor Core GPU delivers unprecedented acceleration at any scale for AI, data analytics and high-performance computing (HPC) to tackle the world's toughest computing challenges. As the engine of the NVIDIA data centre platform, the A100 can efficiently scale to thousands of GPUs or be partitioned into seven GPU instances with NVIDIA Multi-Instance GPU (MIG) technology to accelerate workloads of any size. And third-generation Tensor Cores accelerate any precision for a wide range of workloads, reducing time to insight and time to market.

THE MOST POWERFUL END-TO-END KI AND HPC DATA CENTRE PLATFORM

A100 is part of NVIDIA's complete data centre solution, which includes building blocks for NGC™ hardware, networking, software, libraries and optimised AI models and applications. The most powerful end-to-end AI and HPC platform for data centres enables researchers to deliver real-world results and deploy solutions at scale in production.

SPECIFICATION

NVIDIA A100 for NVLink
Peak FP64
9.7 TF
Peak FP64
Tensor Core 19.5 TF
Peak FP32
19.5 TF
Peak FP32
Tensor Core 156 TF | 312 TF*
Peak BFLOAT16
Tensor Core 312 TF | 624 TF*
Peak FP16
Tensor Core 312 TF | 624 TF*
Peak INT8
Tensor Core 624 TOPS | 1,248 TOPS*
Peak INT4
Tensor Core 1,248 TOPS | 2,496 TOPS*
GPU Memory
40 GB
GPU Memory
Bandwidth 1,555 GB/s
IInterconnect
NVIDIA NVLink 600 GB/s
PCIe Gen4
64 GB/s
Multi-instance GPUs
Various instance sizes with up to 7MIGs @5GB
Form Factor
4/8 SXM on NVIDIA HGX™ A100
Max TDP
Power 400W
* With thrift

UP TO 6X HIGHER POWER WITH TF32 FOR KI TRAINING

BERT pre-training throughput with Pytorch, including (2/3) Phase 1 and (1/3) Phase 2 | Phase 1 Seq Len = 128, Phase 2 Seq Len = 512; V100: NVIDIA DGX-1™ server with 8x V100 using FP32 precision; A100: DGX A100 server with 8x A100 using TF32 precision.

DEEP LEARNING TRAINING

AI models are becoming more complex as they face the next challenges, such as accurate conversational AI and deep recommendation systems. Training these models requires massive computing power and scalability.

The third-generation NVIDIA A100 Tensor Cores with Tensor Float (TF32) precision deliver up to 20x performance over the previous generation with no code changes and an additional 2x increase with automatic mixed precision and FP16. Combined with third-generation NVIDIA® NVLink®, NVIDIA NVSwitch™, PCI Gen4, NVIDIA Mellanox InfiniBand and the NVIDIA Magnum IO™ Software SDK, scaling to thousands of A100 GPUs is possible. This means that large AI models such as BERT can be trained on a cluster of 1,024 A100s in as little as 37 minutes, offering unprecedented performance and scalability.

NVIDIA's leadership in training was demonstrated in MLPerf 0.6, the first industry-wide benchmark for AI training.

DEEP LEARNINGNVIDIA DEVELOPER BLOG A100 TENSOR CORE GPU

DEEP LEARNING INFERENZ

The A100 offers groundbreaking new features to optimise inference workloads. It offers unprecedented versatility by accelerating a full range of accuracies, from FP32 to FP16 and INT8 down to INT4. Multi-Instance GPU (MIG) technology enables multiple nets to run simultaneously on a single A100 GPU for optimal use of compute resources. Structural sparsity support provides up to 2x performance improvement, in addition to the other performance enhancements of A100 Inference.

NVIDIA already delivers market-leading inference performance, as demonstrated by the first industry-wide benchmark for inference, MLPerf Inference 0.5. A100 brings 20x performance to further extend this leadership.

MORE ABOUT DEEP LEARNING INFERENCE

UP TO 7X HIGHER PERFORMANCE WITH MULTI-INSTANCE GPU (MIG) FOR KI INFERENCE

BERT Large Inference | NVIDIA T4 Tensor Core GPU: NVIDIA TensorRT™ (TRT) 7.1, Präzision = INT8, Stapelgröße = 256 | V100: TRT 7.1, Präzision = FP16, Stapelgröße = 256 | A100 mit 7 MIG-Instanzen von 1g,5gb: TRT in der Vorproduktion, Stapelgröße = 94, Präzision = INT8 mit Sparsity.

9X MORE HPC PERFORMANCE IN 4 YEARS

Geometric mean of application accelerations compared to P100: benchmark application: Amber [PME-Cellulose_NVE], Chroma [szscl21_24_128], GROMACS [ADH Dodec], MILC [Apex Medium], NAMD [stmv_nve_cuda], PyTorch (BERT Large Fine Tuner], Quantum Espresso [AUSURF112-jR]; Random Forest FP32 [make_blobs (160000 x 64 : 10)], TensorFlow [ResNet-50], VASP 6 [Si Huge], | GPU nodes with dual-socket CPUs with 4x NVIDIA P100, V100, or A100 GPUs.

HIGH PERFORMANCE COMPUTING (HPC)

To enable next-generation discoveries, scientists are looking to simulations to better understand complex molecules for drug discovery, physics for potential new energy sources and atmospheric data to better predict and prepare for extreme weather patterns.

The A100 introduces double precision tensor cores, marking the biggest milestone since the introduction of double precision computing in GPUs for HPC. It allows researchers to reduce a 10-hour double-precision simulation running on NVIDIA V100 Tensor Core GPUs to just four hours on the A100. HPC applications can also take advantage of the TF32 precision in the A100's Tensor Cores to achieve up to 10x higher throughput for single-precision dense matrix multiplication operations.

MORE ABOUT HPC

HIGH PERFORMANCE DATA ANALYTICS

Customers need to be able to analyse, visualise and turn huge amounts of data into insights. But scale-out solutions often get bogged down because these data sets are scattered across multiple servers.

Accelerated servers with A100 provide the compute power needed - along with 1.6 terabytes per second (TB/s) of storage bandwidth and scalability with NVLink and third-generation NVSwitch - to handle these massive workloads. Combined with NVIDIA Mellanox InfiniBand, the Magnum IO SDK and the RAPIDS suite of open source software libraries, including RAPIDS Accelerator for Apache Spark for GPU-accelerated data analytics, the NVIDIA data centre platform is uniquely capable of accelerating these massive workloads with unprecedented performance and efficiency.

MORE ABOUT DATA ANALYTICS

7X HIGHER INFERENCE THROUGHPUT WITH MULTI-INSTANCE GPU (MIG)

BERT Large Inference | NVIDIA TensorRT™ (TRT) 7.1 | NVIDIA T4 Tensor Core GPU: TRT 7.1, precision = INT8, stack size = 256 | V100: TRT 7.1, precision = FP16, stack size = 256 | A100 with 1 or 7 MIG instances of 1g,5gb: stack size = 94, precision = INT8 with sparsity.

CORPORATE USE

A100 with MIG maximises the use of GPU-accelerated infrastructure like never before. With MIG, an A100 GPU can be partitioned into up to seven independent instances, giving multiple users access to GPU acceleration for their applications and development projects. MIG works with Kubernetes, containers and hypervisor-based server virtualisation with NVIDIA Virtual Compute Server (vComputeServer). MIG enables infrastructure managers to offer a right-sized GPU for each job with guaranteed quality of service (QoS), optimising utilisation and extending the reach of accelerated compute resources to each user.