Building an
HPC and Artificial Intelligence environment!
Artificial Intelligence (AI) is part of computer science, which deals with the development of intelligent machines that work and react like humans. It is the great project for the formation of a non-human intelligence. Most important parts are:
Today, deep learning and high-performance computing tasks are performed on the same hardware platform with shared cluster management software and other shared or separately used software. Before we deal with the necessary hardware and software, we first want to deal with the various work processes. Before we deal with the hardware and software required for this, we first want to take a closer look at the different workflows.
The most important thing for Deep Learning are your data. The volume could become peta bytes of data. The more the better.
A typical AI/deep learning development Workflow:
Sizing DL training is highly dependent on data size and model complexity. A single Tesla NVlink Server (e.g. DGX-1) can complete a training experiment on a wide variety of AI models in one day. For example, the autonomous vehicle software team at NVIDIA developing NVIDIA DriveNet uses a custom Resnet-18 backbone detection network with 960x480x3 image size and trains at 480 images per second on such servers allowing training of 120 epochs with 300k images in 21 hours. Internal experience at NVIDIA has shown that five developers collaborating on the development of one AI model provides the optimal development time. Each developer typically works on two models in parallel thus the infrastructure needs to support ten model training experiments within the desired TAT (turn-around-time). A DGX POD with nine DGX-1 systems can provide one day TAT for model training for a five-developer workgroup. During schedule-critical times, multi-node scaling can reduce turnaround time from one day to four hours using eight DGX-1 servers. Once in production, additional DGX-1 systems will be necessary to support on-going model refinement and regression testing.
The most iportant thing for HPC are your data and data model. The volume could become peta bytes of data.
A typical HPC development Workflow:
Following the workflow, we need the following system modules:
sysGen GPU-accelerated servers offer the highest possible performance. Reduce the runtimes of your high-performance computing applications and reduce deep learning training and inference times with the NVIDIA Volta architecture. Get access to over 500 HPC applications and all NVIDIA TensorRT™ Deep Learning Frameworks. Watch the following video to see how Volta GPUs offer powerful new solutions to the biggest challenges of our time.
Deep Learning and HPC are extremely compute-intensive and require state-of-the-art graphics processors.
Since DL models often require an extreme amount of memory, it is very important that the GPU cards have as much local memory as possible. Therefore, current V100 GPU cards have 32 GB HBM2 memory.
In the vast majority of cases, model calculations are distributed over 4 to 16 GPU cards. A frequent retrieval of new data from the mass storage slows down corresponding calculations considerably. In many cases, the individual GPU cards can therefore exchange data directly with each other. If the data exchange takes place via the PCIe bus, the calculation slows down to a lesser extent. It is important that all GPU cards are connected via one single CPU. This is referred to as a single root complex. Up to 10 GPU cards can be connected via single root, but a reasonable number is 8 GPU cards. Direct NVLink connections between the GPU cards offer a strong improvement compared to a PCIe connection. A single NVIDIA Tesla® V100 GPU supports up to six NVLink connections and total bandwidth of 300 GB/sec - 10X the bandwidth of PCIe Gen 3.
NVIDIA DGX-2 is the fastest server available today with superior technology and the best price/performance ratio. It reduces runtime and thus development time dramatically. The performance per square foot ratio is the best, ever seen. On top you will get regular software updates with optimized docker containers, keeping your systems up to date. Your clusters will be always up to date.
sysGen supports a bunch of leading File systems, but when it comes to enterprise clusters, we prefere the leading File systems BeeGFS or offer solutions from our partners Pure Storage and DDN Storage. Because supercomputing is write-intensive and sequential in access and AI is read-intensive and randomly oriented in access, these storage systems are designed to support both HPC and AI in the best possible way.
BeeGFS transparently spreads user data across multiple servers. By increasing the number of servers and disks in the system, you can simply scale performance and capacity of the file system to the level that you need, seamlessly from small clusters up to enterprise-class systems with thousands of nodes.
Take a look what BeeGFS can do for you:
Shorten the time to get an insight into your data and information on site and in the cloud
Cluster storage systems for High Performance Computing (HPC) and Deep Learning (DL)
Modern cluster storage must handle workloads for HPC and DL equally well
HPC computing is write-intensive and sequential - AI is more read-intensive and randomized
sysGen storage systems are designed to support both HPC and AI with maximum performance
BeeGFS runs on various platforms, such as X86, OpenPower, ARM and more
Pure Storage FlashBlade is an extremely high-performance, highly secure scale-out architecture for unstructured data.
Elastic performance that grows with data – up to 17 GB/s
Always-fast, from small, metadata-heavy workloads to large, streaming files
All-flash performance with no caching or tiering
Petabytes of capacity
10s of billions of objects and files
“Tuned for Everything” design, no manual optimizations required
Scale-out everything instantly by simply adding blades
DDN A³I storage solutions are fully-optimized to accelerate machine learning and artificial intelligence (AI) applications, streamlining deep learning (DL) workflows for greater productivity.
Streamlined Workflows
FLEXIBLE SCALING
DEEP LEARNING ACCELERATOR
AUGMENTED DATA DISCOVERY
FULLY-OPTIMIZED AND INTEGRATED
FROM THE AT-SCALE EXPERTS
System developers often underestimate the installation and maintenance effort for complex software systems. Loss of time until a system is ready for use and performance losses due to poor tuning cause high costs and may also delay the introduction of new products. Often the maintenance effort for adapting to the rapidly evolving software innovation is also widely underestimated. Therefore, we deliver our systems and clusters with pre-installed software on request. This means that our HPC and DL systems or clusters can be used directly.
For all systems with TESLA GPU cards we install the NVIDIA GPU Cloud Software for free. For NVIDIA DGX systems, we offer NVIDIA support contracts for one, two, or three years. For all other systems we offer a sysGen update service for the NVIDIA GPU Cloud Software. However, you can also carry out the updates yourself.
One software stack for the whole Family:
For HPC and DL cluster systems for parallel multiuser operation, we offer the following solutions:
High-Performance Computing (HPC) is used for scientific, technical and commercial tasks in the calculation, modeling and simulation of complex systems and the processing of large amounts of data. sysGen has been a successful solution provider for HPC clusters for more than 20 years and had supplied the most powerful HPC clusters with GPU coprocessors in the EMEA region. Now traditional HPC and Articial Intelligence problems are running in the Cluster environment and must be handled on the same HPC systems.
Our software automates the process of building and managing Linux clusters in your data center and in the cloud:
READ more: Deep learning and high-performance computing are converging, and the required infrastructure and cluster software are virtually identical for both applications. Take a look at our solutions pages and get an idea of the extreme performance of Tesla V100 solutions.
https://www.sysgen.de/nvidia-dgx-2-the-fastest-path-to-ai-scale-on-a-whole-new-level.html
READ more: You should pay special attention to the world's most powerful HPC/DL server DGX-2. The DGX-s has 16 V100 cards connected bi-directionally via 12 NVSwitches at 2.4 TB/s and work like a single virtual GPU with 512GB memory. Thus, complex tasks are solved at a fraction of the previous computing time.
https://www.sysgen.de/dgx-2.html
Take a look what Pure Storage can do for you:
Elastic performance that grows with data – up to 17 GB/s
Always-fast, from small, metadata-heavy workloads to large, streaming files
All-flash performance with no caching or tiering
Petabytes of capacity
10s of billions of objects and files
“Tuned for Everything” design, no manual optimizations required
Scale-out everything instantly by simply adding blades
BeeGFS transparently spreads user data across multiple servers. By increasing the number of servers and disks in the system, you can simply scale performance and capacity of the file system to the level that you need, seamlessly from small clusters up to enterprise-class systems with thousands of nodes.
Take a look what BeeGFS can do for you:
Shorten the time to get an insight into your data and information on site and in the cloud
Cluster storage systems for High Performance Computing (HPC) and Deep Learning (DL)
Modern cluster storage must handle workloads for HPC and DL equally well
HPC computing is write-intensive and sequential - AI is more read-intensive and randomized
sysGen storage systems are designed to support both HPC and AI with maximum performance
BeeGFS runs on various platforms, such as X86, OpenPower, ARM and more
We deliver all systems, whether single Deep Learning Systems or complete Cluster Systems with pre-installed software. This means that, no matter for which requirement (HPC / DL), each system can be used directly without delaying productivity. Usually we use open-source packages, but depending on Budgt and project size professional management software is also used.
READ more: Deep learning and high-performance computing are converging, and the required infrastructure and cluster software are virtually identical for both applications. Take a look at our solutions pages and get an idea of the extreme performance of Tesla V100 solutions.
https://www.sysgen.de/nvidia-dgx-2-the-fastest-path-to-ai-scale-on-a-whole-new-level.html
READ more: You should pay special attention to the world's most powerful HPC/DL server DGX-2. The DGX-s has 16 V100 cards connected bi-directionally via 12 NVSwitches at 2.4 TB/s and work like a single virtual GPU with 512GB memory. Thus, complex tasks are solved at a fraction of the previous computing time.
https://www.sysgen.de/dgx-2.html
Diese Seite verwendet Cookies für eine noch bequemere Nutzung. Durch das Verwenden unserer Seite akzeptieren Sie die Cookie-Nutzung.
This page uses cookies for an even more convenient use. By using our site, you accept the use of cookies.
Privacy Settings (Google Analytics)