Building an Enterprise-Grade HPC and DL Environment

DEEP LEARNING IS FUELING ALL AREAS OF BUSINESS:

  • Healthcare, e.g. Diagnostics
  • Big Data Analyses
  • Risk minimization in financial transactions
  • Robotics, Manufacture, Production
  • Autonomous driving: Cars, Airplanes, Drones, Rockets, Ships
  • Closing security gaps in systems
  • Risk minimization in financial transactions
  • Quality Assurance
  • Retail, Sales, aftersales language and sentiment analyses
  • AI Cities

Definition: Artificial Intelligence, Machine Learning, Deep Learning

Artificial Intelligence (AI) is part of computer science, which deals with the development of intelligent machines that work and react like humans. It is the great project for the formation of a non-human intelligence. Most important parts are:

  • The main tasks of machine learning (traditional computer vision) are data preparation, feature engineering, model architecture and numerical optimization. Feature engineering takes up almost 80 percent of the preparatory work.
  • Deep Learning, part of Machine Learning, is a collection of easy to train mathematical units that are organized in layers, working together to solve complicated tasks. New are the layered network architecture and a scalable training method. DL learns features directly from the data; explicit feature engineering is not required. It has achieved an extremely high degree of accuracy, surpasses human performance in the classification of images, never tires and delivers results in a fraction of the time.

How to build an Enterprise Deep Learning / HPC environment

Today, deep learning and high-performance computing tasks are performed on the same hardware platform with shared cluster management software and other shared or separately used software. Before we deal with the necessary hardware and software, we first want to deal with the various work processes. Before we deal with the hardware and software required for this, we first want to take a closer look at the different workflows.

AI Workflow and Sizing, it all starts with the data

The most important thing for Deep Learning are your data. The volume could become peta bytes of data. The more the better.
A typical AI/deep learning development Workflow:

The workflow is detailed as follows:

  • data factory collects raw data and includes tools used to pre-process, index, label, and manage data
  • AI models are trained with labeled data using a DL framework from the NVIDIA GPU Cloud (NGC) container repository running on servers with Volta Tensor Core GPUs
  • AI model testing and validation adjusts model parameters as needed and repeats training until the desired accuracy is reached
  • AI model optimization for production deployment (inference) is completed using the NVIDIA TensorRT optimizing inference accelerator

Sizing DL training is highly dependent on data size and model complexity. A single Tesla NVlink Server (e.g. DGX-1) can complete a training experiment on a wide variety of AI models in one day. For example, the autonomous vehicle software team at NVIDIA developing NVIDIA DriveNet uses a custom Resnet-18 backbone detection network with 960x480x3 image size and trains at 480 images per second on such servers allowing training of 120 epochs with 300k images in 21 hours. Internal experience at NVIDIA has shown that five developers collaborating on the development of one AI model provides the optimal development time. Each developer typically works on two models in parallel thus the infrastructure needs to support ten model training experiments within the desired TAT (turn-around-time). A DGX POD with nine DGX-1 systems can provide one day TAT for model training for a five-developer workgroup. During schedule-critical times, multi-node scaling can reduce turnaround time from one day to four hours using eight DGX-1 servers. Once in production, additional DGX-1 systems will be necessary to support on-going model refinement and regression testing.

HPC Workflow and Sizing, it all starts with the Data Model

The most iportant thing for HPC are your data and data model. The volume could become peta bytes of data.
A typical HPC development Workflow:

The workflow is detailed as follows:

  • Research and model development
  • Data Collection and Cleaning
  • Programming and Testing
  • Run your sets of indepandent Experiments
  • Visualization, display your results
  • Check your results, if you detect errors, go to step 1, otherwise archive your results

What are the building blocks of an Enterprise Deep Learning / HPC environment?

Following the workflow, we need the following system modules:

  1. Storage Systems
    1. Storage that is equally well suited for Deep Learning and HPC
      Because supercomputing is write-intensive and sequential in access and AI is read-intensive and randomly oriented in access, the storage systems we offer are designed to support both HPC and AI in the best possible way. This way you avoid double investments in hardware, storage software and training.
    2. Storage that is easily expandable by adding additional servers or JBODs (HDD) or JBOFs (NVMe/SAS/SATA SSD) during operation.
    3. Storage that supports fast operational data, archived data, and extremely fast data stores temporarily mounted on the local NVMe SSDs of converged computing and storage servers.
    4. Fault tolerant Storage, compensation the loss of Data, complete HDD or SSD or complete servers including all data on their RAID volumes - and that with commodity servers and shared nothing hardware.
    5. Further information can be found in the storage part of this paper or in our web offer.

  2. GPU-Compute Server
    1. GPU-Computing that is equally well suited for Deep Learning and HPC
      Since DL models often require an extreme amount of memory, it is very important that the GPU cards have as much local memory as possible. Therefore, current V100 GPU cards have 32 GB HBM2 memory.
    2. HPC is extremely compute-intensive and requires state-of-the-art graphics processors. HPC is extremely computationally intensive and requires state-of-the-art graphics processors. The tensor cores of the V100 can also be used for HPC applications.
    3. Further information can be found in the GPU Computing part of this paper or in our web offer.

  3. Cluster Network
    1. Deploy the most cost effective Network solutions with the most advanced interconnect technology
      The network connects everything together and enables communication between all servers, administrators and developers, turning components into a system.
    2. Today's DL/HPC applications are heavily dependent on high bandwidth and low latency connections. In most cases, cluster fabrics are equipped with 40/100 Gbps FDR/EDR InfiniBand or 40/100 Gbps Ethernet.
    3. Further information can be found in the GPU Computing part of this paper or in our web offer.

  4. Cluster Management Software
    1. Cluster Management Software expanding your data center and unleashing the unlimited power of the cloud.
    2. HPC is extremely compute-intensive and requires state-of-the-art graphics processors. HPC is extremely computationally intensive and requires state-of-the-art graphics processors. The tensor cores of the V100 can also be used for HPC applications.

  5. Optimized software stack for GPU server
    1. NVIDIA is using one optimized software stack for the whole Family.
    2. For non-DGX Server with V100 GPUs, sysGen will provide a similar software stack.
    3. Further information can be found in the Software Management tab of this page

  6. Available Deep Learning Solutions and Frameworks
    1. There are already several viable solutions such as DIGITS from NVIDIA.
    2. Many different frameworks for different programming languages (C, C++, Python, Java, Scala, Matlab) are available for own applications like tensor flow, caffe, PyTorch, Theano and Deeplearning4y.
    3. Further information can be found in the internet.


Es gibt keine Artikel unter dieser Kategorie