In science and business applications, in-memory computing is a necessary technology for processing of data stored in an "in-memory database". Older systems have been based on traditional storage media (HDDs &SSDs) and relational databases using SQL query language, but these are increasingly regarded as inadequate to meet todays science and business intelligence needs.
Because stored data is accessed much more quickly when it is placed in random-access memory (RAM) or flash memory, instead of using HDD/SSD. "In-memory processing" allows data to be analysed in real time, enabling faster reporting and decision-making in business.
IT department adoption of In-Memory Computing (IMC) is on the rise, alongside specific use case of In-Memory technology, and stream analytics.
No doubt, In-memory computing is the latest paradigm for performance computing, but scaling memory-centric architectures and mega platforms is becoming a huge problem for SaaS, cloud providers, and enterprises that need to harness massive datasets.
Architecting DRAM-based computing clusters should be a straightforward exercise: Have enough DRAM pool capacity across your clusters to enable your data sets to fit. Except real-time data can easily grow and exceed the limits of physical DRAM pools.
Scaling in-memory compute infrastructure is a real problem in the era of up-to-the-second decision making for larger and larger data sets.
When data sets do exceed physical DRAM pools, application performance suffers. Data swapping/paging to a lower performance tier inhibits performance in an era where the competition may have larger DRAM pools and better performance than you. Further architectural complications are encountered when software sharding across numerous nodes and LIFO data cache purging techniques put unique stresses on scaling in-memory computing.
As data sets grow, scaling In-Memory computing infrastructure has its barriers:
Cloud and IT architects face a Sisyphean-like task to keep scaling compute clusters to accommodate growth. Just as you may scale your in-memory compute clusters, data sets will grow and exceed DRAM pools or you may realize immense CAPEX and OPEX in order to scale.
Ultrastar® DC ME200 Memory Extension Drive with Memcached enables scalable in-memory caching for a better TCO
Mobile and web applications succeed or fail by their responsiveness. A shopping website whose product pages take half a second to generate is in danger of losing impatient customers to a quicker site. A mobile gaming world whose state updates take too long will suffer from “lag” and may be dropped by gamers seeking a more immersive and responsive gameplay experience.Memcached has been used to speed up these kinds of uses for over 15 years. It provides a simple, high performance means of updating and storing transient user state or caching results of heavyweight database processes. This has the double benefit of reducing latency for the end user while also minimizing the load on the backend database.
Memcached Server Sprawl
Memcached is a cache whose size can grow to the size of a system’s memory. The larger the cache, the higher its hit rate and effectiveness. Because the amount of DRAM that can economically be added to conventional servers is limited, large arrays of servers are often used to increase the total usable cache for an application. This poses two problems for data center architects: First, while it is really only the extra memory that Memcached needs, each additional server has significant non-memory costs such as the processors, power supplies, and motherboards. Second, the space, power, and cooling required by these additional servers and their DRAM can become a significant operational expense over time.
Avoiding Memcached Sprawl
By increasing the economically reasonable amount of effective RAM available per-server up to eight times, the Ultrastar DC ME200 Memory Extension Drive can help data center architects keep Memcached server sprawl to a minimum. This memory extension not only reduces the number of servers required, it also helps make better use of the remaining servers, by allowing the CPUs and other overhead components in any one server to be amortized over a larger amount of Memcached cache space.
Replacing DRAM with Ultrastar DC ME200 Memory Extension Drive
Western Digital has benchmarked the Memcached performance of Ultrastar memory in order to validate that it provides near-DRAM performance while expanding RAM capacity four or eight times at significant cost savings. Mirroring a typical Memcached use case, the testing consisted of high concurrency, small requests (1KB) with a 10:90 SET to GET ratio from several testing clients to the Memcached server.
The Memcached server was configured as a baseline using only physical DRAM to provide 768 GiB of system RAM. This baseline system then had its DRAM reduced by three-quarters, to only 192 GiB, while using Ultrastar memory to provide a combined total of 768 GiB of effective system RAM. Finally, the server had its physical memory reduced to a mere 96 GiB of DRAM while using Ultrastar memory to provide the remainder of the total 768 GiB of system RAM, a reduction in DRAM usage by 87.5%.
As shown in the graph below, the Ultrastar memory enabled system was able to provide 85% of the performance of a full 768 GiB DRAM solution while only requiring 96 GiB of DRAM backed by the Ultrastar memory devices.Enabling Memcached servers with such capacities and such economical DRAM costs can minimize the number of Memcached servers required in a cluster.
Ultrastar® DC ME200 Memory Extension Drive for databases allows for better memory management for improved server utilization
Relational databases are foundational to business structures across all industries. Organizations and enterprises of all sizes use databases to store everything from contact information, on-line and off-line transactions, warehouse inventories, payroll, and additional transactions of record. Individual organizations can have hundreds or thousands of these types of databases. There are even multiple service providers whose primary focus is managing and keeping these database platforms-as-a-service running as cloud offerings.Often these databases are not large or performance critical enough to warrant the deployment of a server per instance. Database administrators can make better use of their server infrastructure by running multiple database instances on a single server. This multi-instance deployment architecture lets them more fully employ the compute, storage, and networking of modern servers.
Multi-Instance Databases Limited by Memory
The problem is that each database instance uses a portion of main system memory to cache data and updates. In many cases the larger the set-aside memory, the faster the database, so database administrators try to make these RAM allocations as large as possible. For service providers or internal data center architects which are trying to maximize the number of databases hosted on each server, these pools often consume all of system memory before processing or storage limits are reached. Administrators have to make the choice to increase operating costs, by running servers at lower load than they’d like to in order to allocate enough RAM or increase initial costs by outfitting servers with expensive high-capacity memory subsystems.
Maximizing Database Consolidation
The Ultrastar DC ME200 Memory Extension Drive gives architects and administrators another choice, by expanding system memory up to 8 times the capacity of its physical DRAM, at a fraction of the cost of DRAM, using a custom NVMe-based solution. Such memory extension allows database administers to offer appropriately sized per-database buffer memories while simultaneously allowing data center architects to maximize the number of database instances with near-memory performance per server and increase server utilization.
Replacing DRAM with Ultrastar DC ME200 Memory Extension Drive
To validate the value of the Ultrastar memory drive when consolidating multiple OLTP database instances, Western Digital ran a series of tests comparing the performance of a 768 GiB DRAM-only configuration versus that of the same server with only 192 GiB of DRAM combined with the Ultrastar DC ME200. Multiple instances of Oracle® MySQL™, a popular open-source database, were run in parallel while using an OLTP (TPC-C-like) test generator. The aggregate performance of the full DRAM configuration was compared to the aggregate performance of the four-times and eight-times extended configurations. As shown in the graph, the Ultrastar DC ME200 system with only 192 GiB of DRAM and a four-times expansion factor provided 70% of the performance of the 768 GiB DRAM system, showing how multiple instances can be run with significantly less DRAM while still performing acceptably.
Ultrastar® DC ME200 Memory Extension Drive with Redis enables scalable in-memory caching and data stores for a better TCO
Redis in-memory data stores and caching engines improve application performance by storing or caching frequently accessed data items in main memory for faster data retrieval.In order to achieve the highest performance, the entire dataset must be stored in- memory. Also, to future proof a growing dataset, or if the data being handled is larger than the available memory in a single server, Redis engines allow for scale-out configurations to multiple nodes using sharding. The Redis scale-out approach requires careful balancing according to the amount of memory and processing power available on a per-node basis within the cluster.
Scalability and Need For High CPU: Memory Ratio
Redis can quickly become bottlenecked on memory if the message size is sufficiently small and CPU utilization is low. For example, modern server architectures are limited to 12 DIMMs per socket, and in the case of high-density servers, only 6-8 DIMMs per socket. Meanwhile, the processor core counts continue to increase with every generation, reducing the overall memory-to-core ratio even further.As a result, the number of nodes for a large-scale cluster is determined by the amount of memory in each node, and not by the compute capacity per node. This is evident in typical deployments seeing CPU utilization on the nodes in the range of 10%-20% (for most common scenarios).The result of introducing additional servers to house extra memory is unnecessarily high CAPEX and OPEX costs for the infrastructure, i.e., more hardware, more processors, more networking, more cooling, and a larger data center footprint.
Ultrastar DC ME200 Memory Extension Drive Overview
Ultrastar memory drive combines one or more custom NVMe drives, tuned for performance, with a software layer that expands system DRAM onto them. Unmodified Linux operating systems using this technology can address system memory up to eight times the capacity of the DRAM installed in a server with near-DRAM speeds. Memory- intensive Redis caching and memory stores can utilize this extra system RAM without any changes. For example, a 1U server with 256 GiB installed can make use of up to 2 TiB of Ultrastar memory.
Ultrastar Memory Drive Benchmarked Performance
The Redis benchmark uses high concurrency SET/GET operations of small (1kB) and large (100kB) messages, where the Redis server and client-load systems are connected over a 10GbE network connection.UltraStar memory drives allows for multiple terabytes of main memory per server.
However, since it is impossible to set up a comparable 6TB DRAM-only server, an apples-to-apples comparison was performed using a DRAM-only server of 768GB DDR4 vs. a server with 192GB DDR4 plus Ultrastar memory for a total of 768GB.The graphs below indicate Ultrastar memory drive performance of Redis delivered at 86%-94% of DRAM-only performance with a dataset 4x larger than DRAM.
This chart compares two Ultrastar memory configurations to a baseline configuration of the same dual socket system with 1.5TB of DDR4.
- The first Ultrastar memory configuration shows that reducing DRAM from 1.5TiB to 256GiB and adding 2 Ultrastar memory devices reduces total system cost by 35% and provides applications with 33% more memory.
- The second Ultrastar memory configuration shows that reducing DRAM from 1.5TiB to 384GiB and adding 2 Ultrastar memory devices, applications benefit from 100% more memory per node while reducing the system cost by 10%.
The ability to have a server configuration with twice as much memory for a similar cost per server enables users to reduce the number of servers in a Redis cluster by up to 50%, while supporting the desired data-set size. This has direct impact on CAPEX for the entire infrastructure, i.e., servers, networking, that is at least 50%, as well as data-center footprint and power consumption for further OPEX savings.
Ultrastar® DC ME200 Memory Extension Drives enable better memory management for scientific computing
Datasets are growing at an exponential pace. This is great for scientific workloads since more detailed observations are made, confidence intervals can shrink and calculations can arrive at more meaningful answers. Groundbreaking discoveries in medical, astrophysical, metrological, fundamental physics, computational fluid dynamics, and more are now being made thanks to this new data.
In scientific computing, many codes can be broken down to work on small subsets of data with their results coalesced into a single, final result. Such divide-and-conquer methods have enabled the explosion of clusters of discrete servers as used in modern HPC clusters. In these clusters, problems are segmented into chunks that fit into the working memory of the individual servers in the cluster, usually somewhere between 64 and 256 GiB. As the problem grows, either the number of servers required to fit it into DRAM is increased, or it simply takes longer to get an answer as more jobs of the available DRAM-size are queued.
The Ultrastar DC ME200 Memory Extension drive from Western Digital can change that equation by expanding server RAM up to eight times the capacity of its physical DRAM through the use of one or more custom NVMe devices. This can allow certain classes of memory-constrained codes to run on fewer servers, reducing needed cluster time and allowing an HPC cluster to be more effective.
Providing Terabytes of RAM in a 1-U Server
Up to 24 TiB of effective RAM space can be provided in a single 2-socket, 2U server using 3 TiB of DRAM, multiple Ultrastar DC ME200 drives and provided software. More modest memory expansion configurations are supported, for example 1 TiB of effective RAM using only 128 GiB of DRAM and 1 TiB of Ultrastar DC ME200 drives.
Standard Linux operating systems are supported, with the memory extension technology loading before and operating below the level of the OS. The operating system simply sees the combined DRAM and and Ultrastar DC ME200 capacity. Applications simply expand their memory allocations and process larger chunks of datasets per instance.
Code Classes that Can Benefit
Many codes can benefit from the Ultrastar DC ME200 with little to no performance impact, such as:
- Very large matrix codes with high locality operations (1U)
- Highly parallelized, block-based memory accesses
- Streaming/striding memory-bound operations
- Codes which are complicated to split across multiple system images
Codes which are heavily CPU-limited, single-threaded, or which have a completely random access pattern may not benefit fully from the Ultrastar memory drive. However, when the cluster is running multiple single-threaded jobs on a single node, the extra RAM available with the Ultrastar memory drive may still improve node utilization by allowing multiple instances of single-threaded jobs.
SGEMM Example: Very Large Matrix Multiplication
It is very common to have large matrix operations as a large part of scientific codes. A standard kernel used by many of these codes is the Single-precision GEneral Matrix Multiply, or SGEMM. The following test results were obtained using a segmented SGEMM of an in-memory matrix of increasing size, from 768 GiB (DRAM) up to almost 5 TiB (DRAM plus Ultrastar memory drive). The matrix multiply was performed in segments in a similar way to that which it would be broken up to execute over a small cluster of nodes, and the effective FLOPS were calculated. The results are shown in the graph: a matrix seven times larger than DRAM was operated on by the single node with less than a 10% performance degradation versus operating on one fully fitting within DRAM.