Samsung's KV-SSD: An NVMe SSD with In-storage Key-value Store

In the previous post, we talked about NVMe and we saw how it allows us to utilize the true capabilities of the SSDs. As said before, although NVMe derives have significantly better performance, our storage systems designed with much slower drives in mind can not fully utilize them. That's why we need to revise our assumptions about the storage and adjust our systems accordingly. Samsung's new prototype named KV-SSD is an effort in that direction. KV-SSD is an NVMe SSD with key-value API that allows us to store and retrieve variables size key-value pairs directly to/from the device. This can remarkably simplify the typical software stacks that we have in existing key-value stores and let us fully utilize NVMe drives and scale linearly by adding more NVMe devices. 

Conventional Key-value Stores

Let's first see how the overall system looks like when we run a key-value store like RocksDB on top of an SSD. Figure 1 shows the main components. At the very top, we have the key-value application that uses the key-value API. Next, we have a key-value store such as RocksDB. The key-value store uses the file system managed by the OS to read/write its data. The OS uses the SSD driver to communicate with the SSD. In a conventional SSD, the unit of communication between the OS and the SSD is the block, i.e. OS can write/read blocks each of which may include multiple key-value pairs. 

Figure 1. The software stack on top of SSDs in conventional key-value stores

Let's consider an LSM key-value such as RocksDB. The two main data structures are Memtable and SSTable. Memtable is in the memory and SSTables are in the persistent storage. When we put keys to the key-value store, they will be added to the memtable. Once the memtable reaches a certain size, it is written to the storage as an SSTable. Of course, like always, we need a Write-Ahead Log (WAL) that contains the updates we put into the memtable. The WAL is on the persistent storage, so in case of a crash, we can restore the memtable from the WAL. SSTables are immutable; whenever the memtable reaches a certain size, we create a new SSTable and store it and we will never change it.

 But if we keep creating SSTables and never update them, then our database keeps growing forever!
Yes. That's why we need a background process that merges these SSTables. This is called compaction

So until then, when we want to read a key, where we can find it? 
To read a key, we first check the memtable. If the key is not there, we check the most recent SSTable. If it is not there either, we check the second most recent SSTable, and we continue this until we find the key. 

Can conventional key-value stores fully utilize NVMe SSDs?

So, as you can imagine, there is a non-negligible amount of computation going on the background for a key-value store like RockDB which requires CPU and memory resources. There are other reasons in conventional key-value stores such as having multiple layers, resource contention, maintaining the WAL, and read/write amplification that reduce the ability of the CPU to issue more I/O requests to the device. Usually, with slow storage devices like disks, the CPU doesn't become the bottleneck. But now this is the question: with much faster storage such as NVMe SSDs, is it still true? 

The researchers at Samsung have presented some calculations and experiments to answer this question that we review here. You can refer to the full paper to read more [1]

Consider a modern NVMe SSD that can process 4 KB requests at the rate of 600 MB/s. To saturate such an SSD, the system needs to generate 150,000 requests of size 4 KB per second (600 MB/s / 4 KB). That means the CPU must process each request in less than 7 μs. If our CPU cannot do that and it processes each request in 30 μs, we would need 4 CPUs like that to saturate our NVMe SSD. 

Now, let look at some experiments. They tried to saturate an NVMe SSD with a system with Xeon E5 2.10Ghz with 48 cores. Figure 2 shows the results for various block sizes, the number of threads, and communication mode (sync/async). They also compare the results with and without a file system. The left diagram shows the CPU utilization. Note that this is the total utilization of all 48 cores. Thus, each 2.08% means 1 full CPU core. Generally, with the asynchronous I/O and larger blocks, we can saturate the device more easily. Thus, pay attention to the points marked by arrows. These two points show the least amount of CPU that we need to saturate the SSD with and without a file system. By checking these two points we can see, in this experiment:
  1. At least 1 CPU is needed to be dedicated to saturate the SSD. 
  2. We need more CPU resources to saturate the SSD when we have a file system (i.e. more layers). 
Figure 2. Sequential I/O benchmark with and without file system [1]

Simplifying the Storage Stack with KV-SSD 

KV-SSD lets us get rid of software components needed in conventional key-value stores by directly providing variable size key-value API to user applications. KV-SSD takes care of all processing required to store and receiver key-value pairs, leaving the CPU and memory of the host machine free. In a conventional key-value store, the storage stack must translate key-value pairs to storage blocks before talking to the storage device. Now, with KV-SSD we can talk in key-value pairs.

Figure 3. The software stack on top of KV-SSD

This results in a significantly smaller memory footprint. Regardless of our keyspace, with KV-SSD, the memory requirement of the host machine is O(1). We can estimate the memory requirement by multiplying the size of the I/Q by the size of key-value pairs. Another important benefit is when we remove all software resources between the application and SSD, we remove possible sources of synchronization; suppose we have a key-value store that is using two SSDs in the node. In a conventional key-value store, we have data structures on tops of SSDs that are shared between these two SSDs. These shared data structures increase the contention and are an obstacle for scalability that we wish to achieve by adding more SSDs. On the other hand, KV-SSDs share nothing, so we can scale easily by adding more KV-SSDs. 

What we get with KV-SSD?

Various experimental results are provided in [1]. We don't want to cover all results in the post. Figure 4 is an interesting result that shows how we can scale-in our storage node by adding more NVMe devices. It shows the CPU utilization and the throughput of a single node with various numbers of NVMe devices for three conventional key-value stores namely RocksDB, RockDB-SPDK, and Aerospike along with KV-SSD. As the number of devices increases, they run more key-value instances to saturate the devices. As it is expected, the host system CPU usage is significantly smaller when we use KV-SSDs compared with conventional key-value stores, as KV-SSDs use their internal resources and require fewer computations. As the number of devices increases, the CPU becomes saturated for conventional key-value stores which results in a limited increase in the throughput. On the other hand, using KV-SSDs, we can scale linearly by adding more devices. 

Figure 4. KV-SSD allows us to scale linearly by adding more devices. [1]

Figure 5 shows the write amplification while running the benchmarks. Using KV-SSD the host-side write amplification is 1, i.e. when we want to write a key-value pair, the host system needs only one write. This is expected as everything is managed inside KV-SSDs. On the other hand, in conventional key-value stores, the write amplification is higher due to compaction and writing to the WAL. Note that KV-SSD does not write to WAL, but thanks to battery-backed DRAM achieves the same durability as RocksDB. 

Figure 5. The host-side write amplification using KD-SSD is the optimal value of 1. [1]

In summary, the benefits of KV-SSD comes from the following facts:
  1. Key-value management is done by the storage device itself and consumes resources (CPU and DRAM) of the storage device. Thus, we can scale-in by adding more devices. 
  2. Using KV-SSD, we have less layers, so we have less overhead. 
  3. KV-SSDs in the node share nothing, so we don't have the contention between them. Thus, we can scale easily.
  4. KV-SSD removes the need for WAL while achieving the same durability guarantees of key-values stores with WAL, thanks to its battery-backed DRAM. 
We didn't cover any details about KV-SSD internals. I will try to cover it in another post. 

[1] Kang, Yangwook, Rekha Pitchumani, Pratik Mishra, Yang-suk Kee, Francisco Londono, Sangyoon Oh, Jongyeol Lee, and Daniel DG Lee. "Towards building a high-performance, scale-in key-value storage system." In Proceedings of the 12th ACM International Conference on Systems and Storage, pp. 144-154. 2019.


Popular posts from this blog

In-memory vs. On-disk Databases

ByteGraph: A Graph Database for TikTok

Amazon DynamoDB: ACID Transactions using Timestamp Ordering

Eventual Consistency and Conflict Resolution - Part 1