THE Universal System for AI Infrastructure

NVIDIA-certified systems from Supermicro

The NVIDIA A100 Tensor Core GPU delivers unprecedented acceleration at any scale for the world's most powerful elastic data centers in AI, data analytics, and HPC. A100 is built on the NVIDIA Ampere architecture and is the driving force behind NVIDIA's data center platform. A100 delivers up to 20 times the performance of the previous generation and can be partitioned into seven GPU instances to dynamically adapt to changing requirements. A100 is available with 80 GB of memory. For the first time, the A100 80 GB uses the world's highest memory bandwidth of over 2 terabytes per second (TB/s) to handle the largest models and data sets.

A100 PCIE Product Brief (PDF 332 KB)

NVIDIA A100 datasheet (PDF 867 KB)

The most powerful end-to-end platform
for AI and HPC in the data center


A100 is part of NVIDIA's complete data center solution stack, which includes building blocks forNGC™ hardware, networking, software, libraries, and optimized AI models and applications. It represents the most powerful end-to-end AI and HPC platform for data centers, enabling researchers to deliver realistic results and deploy solutions at scale.


Up to 3 times faster AI training for the largest models

DLRM Training
DLRM on HugeCTR framework, precision = FP16 | NVIDIA A100 80 GB batch size = 48 | NVIDIA A100 40 GB batch size = 32 | NVIDIA V100 32 GB batch size = 32.

The complexity of AI models is rapidly increasing to meet new challenges such as conversational AI. Training them requires tremendous computational power and scalability.

NVIDIA A100 Tensor Cores with Tensor Float (TF32) precision provide up to 20x more performance over NVIDIA Volta, requiring no code changes to do so and providing an additional 2x boost with automatic mixed precision and FP16. Scaling to thousands of A100 GPUs is possible in combination with NVIDIA® NVLink®, NVIDIA NVSwitch™, PCI Gen4, NVIDIA® Mellanox® InfiniBand®, and the NVIDIA Magnum IO™ SDK.

Training workloads like BERT can be solved at scale with 2,048 A100 GPUs in under a minute, setting a world record for solution time.

For the largest models with massive data tables such as Deep Learning Recommendation Models (DLRM), the A100 80 GB achieves up to 1.3 TB of unified memory per node and provides up to 3 times the throughput of the A100 40 GB.

NVIDIA's leadership in MLPerf has been solidified by multiple performance records in AI training benchmarks across the industry.



The A100 introduces breakthrough features to optimize inference workloads. It accelerates a wide range of precision, from FP32 to INT4, and multi-instance GPU (MIG) technology allows multiple networks to run simultaneously on a single A100 GPU for optimal use of compute resources. In addition to the A100's other inference performance enhancements, structural low density provides up to 2x more performance.

For state-of-the-art conversational AI models such as BERT, the A100 provides up to 249x faster inference throughput over CPUs.

For the most complex models with limited batch sizes, such as RNN-T for automatic speech recognition, the increased memory capacity of the A100 80GB doubles the size of any MIG, delivering 1.25 times greater throughput than the A100 40GB.

NVIDIA demonstrated market-leading performance for inference in MLPerf, and the A100 extends that lead with 20x more performance.

up to 249 times higher performance in AI inference vs. CPUs
BERT-LARGE Inference
BERT-Large-Inference | CPU only: Dual Xeon Gold 6240 at 2.60 GHz, precision = FP32, batch size = 128 | V100: NVIDIA TensorRT™ (TRT) 7.2, precision = INT8, batch size = 256 | A100 40 GB and 80GB, batch size = 256, precision = INT8 with low density.
Up to 1.25 times higher performance in AI inference vs.
A100 40 GB
RNN-T Inference: Single Stream
MLPerf 0.7 RNN-T measured with (1/7) MIG instances. Framework: TensorRT 7.2, dataset = librispeech, precision = FP16.


To unlock the next generation of discovery, scientists are looking at simulations to better understand the world around us.

NVIDIA A100 introduces Tensor Cores with twice the precision, representing the biggest leap in performance for HPC since the introduction of GPUs. Combined with 80GB of the fastest graphics memory, researchers can reduce a previously 10-hour, double-precision simulation on A100 to less than four hours. HPC applications can also leverage TF32, achieving up to 11 times higher throughput on dense single-precision matrix multiplication tasks.

For those HPC applications with the largest data sets, the additional memory of the A100 80 GB provides up to a 2x increase in throughput in Quantum Espresso, a materials simulation. The massive memory and unmatched storage bandwidth make the A100 80 GB the ideal platform for next-generation workloads.

11 times more performance at HPC in four years
Leading HPC applications
Geometric mean of application acceleration vs. P100: benchmark application: Amber [PME-Cellulose_NVE], Chroma [szscl21_24_128], GROMACS [ADH Dodec], MILC [Apex Medium], NAMD [stmv_nve_cuda], PyTorch (BERT Fast Fine Tuning], Quantum Espresso [AUSURF112-jR]; Random Forest FP32 [make_blobs (160000 x 64:10)], TensorFlow [ResNet-50], VASP 6 [Si Huge] | GPU nodes with dual-socket CPUs with 4x NVIDIA P100, V100 or A100 GPUs.
Up to 1.8 times higher performance for HPC applications
Quantum espresso
Quantum Espresso measurement with CNT10POR8 dataset, precision = FP64.


Up to 83 times faster than on CPU, 2 times faster than A100 40 GB in Big Data Analytics benchmark

Big Data Analytics benchmark | 30 Analytical Retail Queries, ETL, ML, NLP on 10 TB dataset | CPU: Intel Xeon Gold 6252 2.10 GHz, Hadoop | V100 32 GB, RAPIDS/Dask | A100 40 GB and A100 80 GB, RAPIDS/Dask/BlazingSQL

Data scientists need to be able to analyze, visualize, and gain insights from large data sets. However, scaling solutions are often held back by the fact that data sets are spread across multiple servers.

Accelerated servers with A100 deliver the compute power - along with massive memory, 2 terabytes per second (TB/s) of storage bandwidth, and scalability via NVIDIA® NVLink® and NVSwitch™ - to handle these massive workloads. Combined with InfiniBand, NVIDIA Magnum IO™, and theRAPIDS™ suite of open source libraries, including the RAPIDS Accelerator for Apache Spark for GPU-accelerated data analytics, NVIDIA's data center platform accelerates these massive workloads with unmatched performance and efficiency.

In a big data analytics benchmark, the A100 80GB achieved 83x higher throughput insights than CPUs and 2x higher performance than the A100 40GB, making it ideal for increasing workloads with ever-growing data sets.



7 times higher inference throughput with multi-instance GPU (MIG)

BERT Large Inference

BERT Fast Inference | NVIDIA TensorRT™ (TRT) 7.1 | NVIDIA T4 Tensor Core GPU: TRT 7.1, accuracy = INT8, lot size = 256 | V100: TRT 7.1, accuracy = FP16, lot size = 256 | A100 with 1 or 7 MIG instances of 1 G, 5 GB: Lot size = 94, accuracy = INT8 with low density.

A100 withMIG optimizes the utilization of GPU-accelerated infrastructure. With MIG, an A100 GPU can be partitioned into up to seven independent instances, allowing multiple users to take advantage of GPU acceleration simultaneously. On the A100 40GB, each MIG instance can be allocated up to 5GB, and the increased memory capacity doubles this to 10GB on the A100 80GB.

MIG works with Kubernetes, containers andhypervisor-based server virtualization. MIG enables infrastructure management to assign a customized GPU to each task with guaranteed quality of service (QoS), giving each user access to accelerated computing resources.


the best out of your systems

An NVIDIA certified system consisting of A100 and NVIDIA Mellanox SmartnNICs and GPUs is validated for performance, functionality, scalability, and security, so organizations can easily implement complete solutions for using AI from the NVIDIA NGC catalog.

Get the most out of your systems with our recommendations from NVIDIA and Supermicro.
there's something here for every project and every budget!


The universal system for AI infrastructure


NVIDIA DGX™ A100 is the universal system for all AI workloads, delivering unprecedented compute density, performance and flexibility in the world's first 5 petaFLOPS AI system. NVIDIA DGX A100 features the world's most advanced accelerator, the NVIDIA A100 Tensor Core GPU, enabling organizations to consolidate training, inference, and analytics into a unified, easy-to-deploy AI infrastructure with direct access to NVIDIA AI experts.

Watch videoDownload datasheetDownload Brochure
Price on demand
Price on demand


Workgroup appliance for the AI era


Data science teams, while leading innovation, are also often looking for available AI compute resources to complete their projects. They need a dedicated resource that can plug into a power outlet anywhere and provide maximum performance for multiple users working simultaneously around the world. NVIDIA DGX Station™ A100 delivers AI supercomputing for data science teams by realizing data center performance without a data center or additional IT infrastructure. Powerful performance, a fully optimized software stack, and direct access to NVIDIA DGXperts deliver faster insights.

Download infographicDownload datasheet
Price on demand
Processor: Dual AMD EPYC™ 7002/7003 Series Processors
Processor: Dual Socket P+ (LGA-4189) 3rd Gen Intel® Xeon® Scalable Processors
Processor: Dual AMD EPYC™ 7002/7003 Series Processors
Processor: Dual Socket P+ (LGA-4189) 3rd Gen Intel® Xeon® Scalable Processors

Your direct line to the experts at sysGen!



NVIDIA A100 for NVLink
NVIDIA A100 for PCIe
Peak FP64
9.7 TF
9.7 TF
Peak FP64 Tensor Core
19.5 TF
19.5 TF
Peak FP32
19.5 TF
19.5 TF
Tensor Float 32 (TF32)
156 TF | 312 TF*
156 TF | 312 TF*
Peak BFLOAT16 Tensor Core
312 TF | 624 TF*
312 TF | 624 TF*
Peak FP16 Tensor Core
312 TF | 624 TF*
312 TF | 624 TF*
Peak INT8 Tensor Core
624 TOPS | 1,248 TOPS *
624 TOPS | 1,248 TOPS *
Peak INT4 Tensor Core
1.248 TOPS | 2,496 TOPS *
1.248 TOPS | 2,496 TOPS *
GPU memory
40GB / 80GB
GPU memory bandwidth
1.555 GB/s / 2.039 GB/s
1.555 GB/s
NVIDIA NVLink 600 GB/s**PCIe
Gen4 64 GB/s
NVIDIA NVLink 600 GB/s**PCIe
Gen4 64 GB/s
Multi-instance graphics processor
Different instance sizes with up to 7 MIGs at 10 GB
Different instance sizes with up to 7 MIGs at 5 GB
Form factor
4/8 SXM on NVIDIA HGX™ A100
Max. TDP power
400 W / 400 W
* Low-density**
SXM GPUs via HGX A100 server boards; PCIe GPUs via NVLink bridge for up to 2 GPUs

NVIDIA RTX Workstation GPU's

NVIDIA Ampere Architecture Insights

Learn what's new
with the NVIDIA Ampere architecture and its implementation in the

Read whitepaper