I hope you’re excited as me and looking forward seeing first 3D XPoint™ based products in the market. Intel® Optane® SSDs have been already publically demonstrated at IDF’15 and Oracle Open World 2015. Not every performance detail is disclosed, keep in mind these were prototypes, but some key benchmarks (especially small random I/O at low queue depth) were shown. This brings the SSD close to memory than ever. But how close? Can we actually use it as an extension to a system memory? Short answer – yes, we can. There are different ways to do so, starting from a simple swapping/paging, application changes to use nmap()/dinmap()/SSDAlloc(), and some very special products like ScaleMP technologies discussed below.
You may have heard of ScaleMP from their fame of SMP virtualization technology, which allows one to turn a cluster of x86 systems into a single system (SMP), where ScaleMP’s software, vSMP Foundation, runs below the OS layer and handles all the cache coherency and remote IO over the cluster fabric transparently to the OS. That allows the OS and applications to utilize the entire cluster resources (compute, memory and IO) for a single application.
Well, ScaleMP has introduced new extensions – just like they enable the one called “Memory over Fabric” and use algorithms to optimize access patterns and yield magnificent performance, they also enable you to use NVM as if it was DRAM. As simple and transparent as it sounds! vSMP requires having NVMe based SSDs and supports only Intel® SSD Data Center Family for PCIe.
For the examples below, consider a dual-socket system in early 2016. Using commodity DRAM you could reach 768 GB of DRAM (24 x 32 DDR4 DIMMs). The memory subsystem alone would cost ~ $6,000 (32GB DIMMs retail online for about $250 these days). With the ScaleMP we are targeting two key use-cases for Storage Class Memory (SCM) being used as main system memory:
1. Replacing most of the DRAM – using ScaleMP’s technology, you could reduce DRAM to 128GB, using 4 x 32GB DDR4 DIMMs only, and use 2 x Intel® SSD DC P3700 of 400GB each. The benefits?
a. CAPEX saving as the hybrid memory (DRAM+NVM) cost is lower by at least 33%.
b. An OPEX saving of 96 Watts (and similar savings in cooling)
(20 x 6W per DIMM vs. 2 x 12W per 400GB NVMe)
c. Performance in the 75% ~ 80% of DRAM performance range for demanding workloads such as multi-tenant DBMS running TPC-C.
2. Expanding on-top of DRAM – using ScaleMP’s technology, you could easily increase total system memory of the dual-socket server to ~ 8 TB
a. For reaching 8TB RAM using only DRAM, one would need to have the highest-end servers that could support 192 DIMMs and populate it with 128 DIMMs of 32GB, and 64 DIMMs of 64GB. Such servers are power-hungry and require lots of space in the rack.
The alternative, using the dual-socket system described above, would require simply adding 4 NVMe devices of 2TB each – saving over 50% of the memory cost and rack space.
b. On the OPEX side, the difference is dazzling. A high-end system would require 1,152W just for its 192 DIMMs, and the alternative would require ~ 75% less power. I’ll skip describing the additional advantage of improved server density and datacenter standardization.
c. This setup allows the user to run 10x the number of memory demanding workloads on a single server, with the overall throughput being marginally affected.
d. This allows the user to run massive in-memory DBMS in the most economical manner.
By this point, I am sure you are wondering: “the $$$ savings look great, but what about performance?”. Well, performance test results using Intel® SSD DC P3700 are fresh from the oven. First, some details of the benchmark and configurations used:
The selected benchmark was an OLTP load. 5 instances of the MySQL DBMS (Percona distribution) concurrently running TPC-C benchmark, each instance using 25 warehouses with 128 connections – totaling 330GB of memory (all data loaded to main memory) + 160GB of buffer cache.
• Warmup – TPCC runs for a period of 6,000 seconds.
• Measurement – TPCC runs for a period of 7,200 seconds.
The hardware used was a dual-socket E5-v3 system, with one of two configurations:
• DRAM-only: 512 GB RAM (DDR4) – baseline server configuration (no ScaleMP software used for this setup)
• Hybrid DRAM-NVM: using same server, but keeping only 64GB RAM (DDR4), and adding 2 x Intel DC P3700 NVMe SSDs to provide the missing 448GB to the system memory. ScaleMP’s software was used to make the system look the same as the above to the OS.
When running the Linux command ‘free’, the result was same on both configurations (see below). Clearly the ScaleMP software did the job by hiding from the OS the fact that it is using hybrid DRAM-NVM memory subsystem.
[root@s2600wt-0 ~]# free –gh
total used free shared buff/cache available
Mem: 503G 3.4G 316G 9.6M 184G 499G
Now, for the benchmark results. We summed the result of the 5 instances of TPC-C, which are measured in tpmC:
• For the “DRAM-only” configuration we got 217,757
• For the hybrid DRAM-NVM we got: 166,782
In other words, Intel® SSD DC P3700 used as memory replacement reached 75%~80% of DRAM performance! (76.6% to be precise). Keep in mind that number may vary from one application to another, but consider TPC-C representative as basic datacenter workload. It’s good reference point.
The pricing and performance info above is valid for early 2016, and based on only Intel® SSD Data Center Family for PCIe. Think about upcoming Intel® Optane® SSDs, based on 3D XPoint™ technology, will likely enable Intel and ScaleMP to push the performance further closer to DRAM performance.
If Intel and ScaleMP deliver on the promise of improved performance with Optane SSDs, they will arguably eliminate the border between Main Memory and Storage Class Memory (SCM). It will allow SCM to be used for OS and application memory transparently, without any code changes. While the Intel Optane SSDs will reduce the latency to storage, ScaleMP software already makes it byte addressable from application perspective and uses smart caching technology to reduce the average latency to values that are very close to overall DRAM performance. TCO stories look great even considering licensing for vSMP software which is not covered here at all and I should direct you to the ScaleMP’s web site for the details.
If your application is limited by the amount of DRAM in a box, now we can easily say that the sky is the limit for that application!
Andrey Kudryavtsev, Intel Corp.
Benzi Galili, ScaleMP.com