The world's largest and profitable companies are data-driven. Become a data-driven Real-Time Enterprise!

Data scientists spend a lot of time evaluating data iterated through Machine Learning (ML) experiments. Each hour required to study data sets, extracting features, and customizing ML algorithms extends the time it takes to get robust results.

Why are data analysis and machine learning important?

Organizations are increasingly data-driven - capturing market and environmental data through analysis and machine learning to identify complex patterns, identify changes, and make predictions that directly impact performance. Managing a business through data-driven processes has become essential to staying at the forefront of the industry. Data-driven organizations must manage a wide variety of data.

NVIDIA shows how much faster RAPIDs is on NVIDIA GPU based Systems

Why now?

The availability of open source large-scale data analysis and machine learning software, such as Hadoop, NumPy, Scikit Learning, Pandas and Spark, have triggered the Big Data revolution. Large companies from huge industries, such as retail, finance, healthcare, logistics, adopted data analysis to improve their competitiveness, responsiveness and efficiency. Few percent improvements could impact their bottom line by billions. Data analysis and machine learning are the largest HPC segment today.

The current situation

For businesses trying to stay competitive, it’s not easy to learn from increasingly vast volumes of data, cope with the complexity of analysis or keep up with siloed analytics solutions while on legacy infrastructure. What use is valuable data if data analysis takes far too long? Quickly made available results would have avoided losses in value, possible profits would have been achievable, fraud damage would have been prevented by faster reactions.

What is the real problem?

Today’s data science problems demand a dramatic increase in the scale of data as well as the computational power required to process it.

A day in the life of a Data Scientist

What is the problem that RAPIDS is solving?

Don't take a kid for a strong man's job, don't take a CPU for a fast GPU's job! While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For the same reasons, scientific computing and deep learning has turned to NVIDIA GPU acceleration, data analytics and machine learning where GPU acceleration is ideal.

What is RAPIDS?

RAPIDS is built on more than 15 years of NVIDIA® CUDA® development and machine learning expertise. It’s powerful new software for executing end-to-end data science training pipelines completely in the GPU, reducing training time from days to minutes. NVIDIA created RAPIDS – an open-source data analytics and machine learning acceleration platform. RAPIDS is based on Python, has pandas-like and Scikit-Learn-like interfaces, built on Apache Arrow in-memory data format, and can scale from 1 to multi-GPU to multi-nodes.

Rapids Science Pipeline ( Frameworks, Libraries and other Layers )

RAPIDS integrate easily into the world’s most popular data science Python-based workflows. RAPIDS accelerate data science end-to-end – from data prep, to machine learning, to deep learning. And through Arrow, Spark users can easily move data into the RAPIDS platform for acceleration.

Machine Learning to Deep Learning: All on GPU

While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For the same reasons, scientific computing and deep learning has turned to NVIDIA GPU acceleration, data analytics and machine learning where GPU acceleration is ideal.

Is there a solution that speeds up the processing significantly?

Yes, now with Nvidia's effort to push the GPU acceleration into Machine learning and High-Performance Data Analytics (ML/HPDA), the company reports that the RAPIDS platform delivers 50x speed-ups, using the XGBoost machine learning algorithm for training on an NVIDIA DGX-2 supercomputer, compared with CPU-only systems. So, RAPIDS for Data Science can reduce computing times from days to minutes.

These applications profit from using RAPIDs:
  • Big Data
  • Forecasting, Trends, Prediction
  • Pattern Recognition
  • Credit Card Fraud
  • Risk Management
Best used with these frameworks:
  • Apache Arrow
  • Python
  • Pandas
  • SciKit
GPU Applications

Recommended Hardware

Rapids Recommended Configurations

RAPIDS Deployment Stage Recommended GPU Configuration Minimum CPU Cores Minimum Main Memory Boot Drive Local Data Storage Networking Connections
Development 2 x Quadro GV100 & NVLINK 10 128 GB 500GB SSD 2TB SSD 1GbE / 10GbE
Development & Production 4x V100 & NVLINK 20 256 GB 500GB SSD 4TB SSD 1GbE / 10GbE
Production 4x V100 SXM2 & NVLINK 20 256 GB 500GB SSD 4TB SSD 10GbE / 100GbE/IB
Production 8x V100 SXM2 & NVLINK 40 512 GB 500GB SSD 4TB SSD / NVMe 10GbE / 100GbE/IB
Production 16x V100 SXM3 & NVSWITCH 56 1 TB 500GB SSD 10TB SSD / NVMe 40GbE / 100GbE/IB

Development Systems

For development systems, sysGen offers you the devCube or NVIDIA's DGX Station. The devCube is a well tested and proven system used by many of our customers for deep learning tasks. In the following table you will find the general and recommended specs for our systems.

Production Systems

Productive servers are dedicated to the high demands of continuous operation and constant utilization. Redundant power supplies and enterprise-class components are a part of our service.

Optimized Software Stack3


NVIDIA RAPIDS includes CUDF, CUML and CUGRAPH as its core tools. With cuDF you can prepare and wrangle your raw data. Afterwards cuML uses an optimized machine learning model training algorithm to process the prepared data.
Afterwards your data will be visualized and displayed to you.

Apache Arrow

Apache Arrow is a columnar, in-memory data structure that delivers efficient and fast data interchange with flexibility to support complex data models.


The RAPIDS cuDF library is a DataFrame manipulation library basen on Apache Arrow that accelerates loading, filtering, and manipulation of data for model training data preparation. The Python bindings of the core-accelerated CUDA DataFrame manipulation primitives mirror the pandas interface for seamless onboarding of pandas users.


RAPIDS cuML is acollection of GPU-accelerated machine learning libraries that will provide GPU versions of all machine learning algorithms available in scikit-learn.


This is a framework and collection of graph analytics libraries that seamlessly integrate into the RAPIDS data science platform.

Deep Learning Libraries

RAPIDS provides native array_interface support. This means data stored in Apache Arrow can be seamlessly pushed to deep learning frameworks that array_interfache such as PyTorch and Chainer.

Visualizatrion Libraries Coming Soon

RAPIDS will include tightly integrated data visualization libraries based on Apache Arrow. Native GPU in-memory data format provides high-performance, high-FPS data visualization, even with very large datsets.

3 This info is based upon NVIDIA's accessible information

Introducing RAPIDS

During the GPU Technology Conference in Munich the graphics card manufacturer Nvidia presented the open source platform Rapids. It is primarily aimed at users in the fields of data science and machine learning and represents a collection of libraries that should enable GPU-accelerated data analysis. In addition to Nvidia, companies such as IBM, HPE, Oracle and Databricks have also announced their support for the project.

The graphics card manufacturer explains that Rapids is based on Cuda, the in-house platform for parallel programming. The new platform will enable developers to create end-to-end pipelines for data analysis. Nvidia has achieved up to 50 times faster results on the DGX-2 supercomputer compared to systems that rely only on CPUs. The platform builds on well-known open source projects such as Apache Arrow, pandas and scikit-learn, and is designed to bring GPU acceleration to popular Python toolchains. Integration with Apache Spark is also planned.

NVIDIA has been working with members of the Python community for two years to create Rapids. Currently, the collection consists of a Python GPU DataFrame library, a C GPU DataFrame library, and alpha versions of a cuML and cuDF library. According to NVIDIA founder Jensen Huang, the complete package will advance the work in the areas of data analysis and machine learning.

The entire Rapids project can be found on GitHub. Further information including installation instructions can be found on the official website. Companies like Walmart are already using the new platform.

Back to Overview