THE EXPLOSION OF AI
Demand for personalized services has led to a dramatic increase in the complexity, number, and variety of AI-powered applications and products. Applications use AI inference to recognize images, understand speech, or make recommendations. To be useful, AI inference has to be fast, accurate, and easy to deploy.
UNDERSTANDING INFERENCE PERFORMANCE
With inference, speed is just the beginning of performance. To get a complete picture about inference performance, there are seven factors to consider, ranging from programmability to rate of learning.
The NVIDIA TensorRT Hyperscale Inference Platform delivers on all fronts. It delivers the best inference performance at scale with the versatility to handle the growing diversity of today's networks.
NVIDIA T4 POWERED BY TURING TENSOR CORES
Efficient, high-throughput inference depends on a world-class platform. The NVIDIA® Tesla®
T4 GPU is the world’s most advanced accelerator for all AI inference workloads. Powered by
NVIDIA Turing™ Tensor Cores, T4 provides revolutionary multi-precision inference
performance to accelerate the diverse applications of modern AI.
NVIDIA DATA CENTER COMPUTE SOFTWARE
NVIDIA Tensor RT
NVIDIA TensorRT is a high-performance deep learning inference platform that can speed up applications such as recommenders, speech recognition, and machine translation by 40X compared to CPU-only architectures.
NVIDIA Tensor RT Inference Server
NVIDIA TensorRT Inference Server is a microservice that simplifies deploying AI inference in data center production. TensorRT Inference Server supports popular AI models and leverages Docker and Kubernetes to integrate seamlessly into DevOps architectures. It is available as a ready-to-deploy container from the NGC container registry and as an open source project.
Kubernetes on NVIDIA GPUs
Kubernetes on NVIDIA GPUs enables enterprises to scale up training and inference deployment to multi-cloud GPU clusters seamlessly. With Kubernetes, GPU-accelerated deep learning and high performance computing (HPC) applications can be deployed to multi-cloud GPU clusters instantly.
NVIDIA DeepStream is an application framework for the most complex Intelligent Video Analytics (IVA) applications. Developers can now focus on building core deep learning networks rather than designing end-to-end applications from scratch given its modular framework and hardware-accelerated building blocks.
THE POWER OF NVIDIA TensorRT
NVIDIA TensorRT™ is a high-performance inference platform that includes an optimizer,runtime engines, and inference server to deploy applications in production. TensorRT
speeds apps up to 40X over CPU-only systems for video streaming, recommendation,
and natural language processing.
Features and Benefits
The Most Advanced AI Inference Platform
NVIDIA Tesla T4 has the world’s highest inference efficiency, up to 40X more compared to CPUs. T4 can analyze up to 39 simultaneous HD video streams in real-time using dedicated hardware-accelerated video transcode engines. Bringing all this performance in just 70 watts (W) makes NVIDIA T4 the ideal inference solution for mainstream servers at the edge.
24X Higher Throughput to Keep Up with Expanding Workloads
Tesla V100 GPUs powered by NVIDIA Volta™ give data centers a dramatic boost in throughput for deep learning workloads to extract intelligence from today’s tsunami of data. A server with a single Tesla V100 can replace up to 50 CPU-only servers for deep learning inference workloads, so you get dramatically higher throughput with lower acquisition cost.
Maximize Performance with NVIDIA TensorRT and DeepStream SDK
NVIDIA TensorRT optimizer and runtime engines deliver high throughput at low latency for applications such as recommender systems, speech recognition and image classification. With TensorRT, models trained in 32-bit or 16-bit data can be optimized for INT8 operations on Tesla T4 and P4, or FP16 on Tesla V100. NVIDIA DeepStream SDK taps into the power of Tesla GPUs to simultaneously decode and analyze video streams.
Deliver High Throughput Inference that Maximizes GPU Utilization
NVIDIA Tensor RT Inference Server delivers high throughput data center inference and helps you get the most from your GPUs. Delivered in a ready-to-run container, NVIDIA TensorRT Inference Server is a microservice that concurrently runs models from Caffe2, NVIDIA TensorRT, TensorFlow, and any framework that supports the ONNX standard on one or more GPUs.
PRODUCTION-READY DATA CENTER INFERENCE
The NVIDIA TensorRT inference server is a containerized microservice that enables applications to use AI models in data center production. It maximizes GPU utilization, supports all popular AI frameworks, and integrates with Kubernetes and Docker.
Optimize your deep learning inference solution today