The availability of open source large-scale data analysis and machine learning software, such as Hadoop, NumPy, Scikit Learning, Pandas and Spark, have triggered the Big Data revolution. Large companies from huge industries, such as retail, finance, healthcare, logistics, adopted data analysis to improve their competitiveness, responsiveness and efficiency. Few percent improvements could impact their bottom line by billions. Data analysis and machine learning are the largest HPC segment today.
The current situation
For businesses trying to stay competitive, it’s not easy to learn from increasingly vast volumes of data, cope with the complexity of analysis or keep up with siloed analytics solutions while on legacy infrastructure. What use is valuable data if data analysis takes far too long? Quickly made available results would have avoided losses in value, possible profits would have been achievable, fraud damage would have been prevented by faster reactions.
What is the real problem?
Today’s data science problems demand a dramatic increase in the scale of data as well as the computational power required to process it.
What is the problem that RAPIDS is solving?
Don't take a kid for a strong man's job, don't take a CPU for a fast GPU's job! While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For the same reasons, scientific computing and deep learning has turned to NVIDIA GPU acceleration, data analytics and machine learning where GPU acceleration is ideal.
What is RAPIDS?
RAPIDS is built on more than 15 years of NVIDIA® CUDA® development and machine learning expertise. It’s powerful new software for executing end-to-end data science training pipelines completely in the GPU, reducing training time from days to minutes. NVIDIA created RAPIDS – an open-source data analytics and machine learning acceleration platform. RAPIDS is based on Python, has pandas-like and Scikit-Learn-like interfaces, built on Apache Arrow in-memory data format, and can scale from 1 to multi-GPU to multi-nodes.
RAPIDS integrate easily into the world’s most popular data science Python-based workflows. RAPIDS accelerate data science end-to-end – from data prep, to machine learning, to deep learning. And through Arrow, Spark users can easily move data into the RAPIDS platform for acceleration.
Machine Learning to Deep Learning: All on GPU
While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For the same reasons, scientific computing and deep learning has turned to NVIDIA GPU acceleration, data analytics and machine learning where GPU acceleration is ideal.
Is there a solution that speeds up the processing significantly?
Yes, now with Nvidia's effort to push the GPU acceleration into Machine learning and High-Performance Data Analytics (ML/HPDA), the company reports that the RAPIDS platform delivers 50x speed-ups, using the XGBoost machine learning algorithm for training on an NVIDIA DGX-2 supercomputer, compared with CPU-only systems. So, RAPIDS for Data Science can reduce computing times from days to minutes.
These applications profit from using RAPIDs:
- Big Data
- Forecasting, Trends, Prediction
- Pattern Recognition
- Credit Card Fraud
- Risk Management
Best used with these frameworks:
- Apache Arrow
Boosting Data Science Performance with RAPIDS
RAPIDS achieves speedup factors of 50x or more on typical end-to-end data science workflows. RAPIDS uses NVIDIA CUDA for high-performance GPU execution, exposing that GPU parallelism and high memory bandwidth through user-friendly Python interfaces. RAPIDS focuses on common data preparation tasks for analytics and data science, offering a powerful and familiar DataFrame API. This API integrates with a variety of machine learning algorithms without paying typical serialization costs, enabling acceleration for end-to-end pipelines. RAPIDS also includes support for multi-node, multi-GPU deployments, enabling scaling up and out on much larger dataset sizes.
The RAPIDS container includes a notebook and code that demonstrates a typical end-to-end ETL and ML workflow. The example trains a model to perform home loan risk assessment using all of the loan data for the years 2000 to 2016 in the Fannie Mae loan performance dataset, consisting of roughly 400GB of data in memory. The following figure shows geographical visualization of the loan risk analysis.
The example loads the data into GPU memory using the RAPIDS CSV reader. The ETL in this example performs a number of operations including extracting months and years from datetime fields, joins of multiple columns between DataFrames, and groupby aggregations for feature engineering. The resulting feature data is then converted and used to train a gradient boosted decision tree model on the GPU using XGBoost.
This workflow runs end-to-end on a single NVIDIA DGX-2 server with 16x Tesla V100 GPUs, 10x faster than 100 AWS r4.2xLarge instances, as the following chart shows. Comparing GPU to CPU performance one-to-one, this equates to well over a 50x speedup.
One of the biggest competitive advantages NVIDIA enjoys in this space is a huge ecosystem built around CUDA. Hardware vendors support NVIDIA GPUs while software application vendors and the open source communities support NVIDIA CUDA and the GPUs they rely on. As a result, the company has a sizable advantage in the deep learning training and inferencing markets and is pushing that advantage with RAPIDS. As mentioned earlier, much too long runtimes for previous data analyses can be drastically shortened by using GPU processors. RAPIDS enable’s the use of lightning-fast GPU processors instead of slow x86 processors, the latter still being responsible for general OS tasks.
Optimized Software Stack3
NVIDIA RAPIDS includes CUDF, CUML and CUGRAPH as its core tools.
With cuDF you can prepare and wrangle your raw data. Afterwards cuML uses an optimized machine learning model training algorithm to process the prepared data.
Afterwards your data will be visualized and displayed to you.
Apache Arrow is a columnar, in-memory data structure that delivers efficient and fast data interchange with flexibility to support complex data models.
The RAPIDS cuDF library is a DataFrame manipulation library basen on Apache Arrow that accelerates loading, filtering, and manipulation of data for model training data preparation. The Python bindings of the core-accelerated CUDA DataFrame manipulation primitives mirror the pandas interface for seamless onboarding of pandas users.
RAPIDS cuML is acollection of GPU-accelerated machine learning libraries that will provide GPU versions of all machine learning algorithms available in scikit-learn.
This is a framework and collection of graph analytics libraries that seamlessly integrate into the RAPIDS data science platform.
Deep Learning Libraries
RAPIDS provides native array_interface support. This means data stored in Apache Arrow can be seamlessly pushed to deep learning frameworks that array_interfache such as PyTorch and Chainer.
Visualizatrion Libraries Coming Soon
RAPIDS will include tightly integrated data visualization libraries based on Apache Arrow. Native GPU in-memory data format provides high-performance, high-FPS data visualization, even with very large datsets.
3 This info is based upon NVIDIA's accessible information
During the GPU Technology Conference in Munich the graphics card manufacturer Nvidia presented the open source platform Rapids. It is primarily aimed at users in the fields of data science and machine learning and represents a collection of libraries that should enable GPU-accelerated data analysis. In addition to Nvidia, companies such as IBM, HPE, Oracle and Databricks have also announced their support for the project.
The graphics card manufacturer explains that Rapids is based on Cuda, the in-house platform for parallel programming. The new platform will enable developers to create end-to-end pipelines for data analysis. Nvidia has achieved up to 50 times faster results on the DGX-2 supercomputer compared to systems that rely only on CPUs. The platform builds on well-known open source projects such as Apache Arrow, pandas and scikit-learn, and is designed to bring GPU acceleration to popular Python toolchains. Integration with Apache Spark is also planned.
NVIDIA has been working with members of the Python community for two years to create Rapids. Currently, the collection consists of a Python GPU DataFrame library, a C GPU DataFrame library, and alpha versions of a cuML and cuDF library. According to NVIDIA founder Jensen Huang, the complete package will advance the work in the areas of data analysis and machine learning.
The entire Rapids project can be found on GitHub. Further information including installation instructions can be found on the official website. Companies like Walmart are already using the new platform.