Ten years after the public availability of DynamoDB in 2012, and fifteen years after the publication of the original Dynamo paper in 2007, the DynamoDB team published a paper in Usenix ATC 2022 on how DynamoDB has evolved over its ten years of operation. The paper covers several aspects of DynamoDB such as adaptation to customer access patterns, smarter capacity allocation, durability, correctness, and availability. In this post, I share some of points that I found interesting in the paper.

I also compare the replication of DynamoDB [1] with the leaderless approach of the original Dynamo paper. Note that I don't necessarily follow the order of the sections of the paper.

Design Principles

These are the six design principles that DynamoDB has tried to be faithful to, throughout its evolution.

DynamoDB is a fully managed cloud service.
DynamoDB employs a multi-tenant architecture. The key challenge here is to give a single-tenant experience to customers while enjoying hardware utilization and cost reduction of multi-tenancy.
DynamoDB achieves boundless scale for tables.
DynamoDB provides predictable performance. Predictable performance is of utmost importance to DynamoDB. It is a very important distinction of DynamoDB compared with previous solutions such as SimpleDB which was the first AWS database-as-a-service. Adding new features while keeping performance predictable is challenging. Previously, in this post, we saw how DynamoDB added support for ACID transactions while holding on to this design principle.
DynamoDB is highly available. Four 9s for single-region, and five 9s for multi-region tables. Availability is calculated for each 5-minute interval as the percentage of requests processed by DynamoDB that succeed. Availability is measured both from the service and client perspective (via AWS internal clients or canary applications).
DynamoDB supports flexible use cases. Multiple data, consistency, and isolation models.

Multi-Paxos instead of Leaderless Replication

Although [1] focuses on the evolution of DynamoDB since its public availability in 2012, I thought, regarding replication, it might be interesting to compare DynamoDB with the original Dynamo [2] introduced in 2007. Unlike the original Dynamo [2] which uses a leaderless replication approach, DynamoDB [1] uses a leader-based approach.

Replication in the Original Dynamo

Nodes form a ring. Data items are distributed between these nodes according to the hash value of their keys. Each node is assigned a range. Given a key, we calculate its hash and then find its coordinator node based on the hash value and the hash ranges. Each key is hosted on the first node in the ring and replicated on the next N nodes.

Dynamo uses a leaderless quorum-based consistency approach. To write a value, the client sends its request to the coordinator node of the key. The coordinator applies the write locally and forwards the write request to the other N-1 replicas. It will then waits for W-1 of them to acknowledge the write before returning to the client. Similarly, to read, the coordinator sends the request to all N-1 replicas and waits for R-1 responses from them. Then, it returns the newest version out of R-1 responses heard from other replicas and its own version. When we use leaderless replication, we may have conflicts. To determine the newest version or detect conflicts, Dynamo uses vector clocks. I have talked about conflict resolution using vector clocks previously in this blog. In case of a conflict, Dynamo returns conflicting versions, letting the application decide to use which version.

R and W are configurable parameters. To make sure our read operations don't read stale data, we can set R and W such that R + W > N. We usually pick an odd number for N and use R = W= (N+1)/2 . Note that although setting R and W such that R + W > N hopefully results in reading the most recent version, that is NOT guaranteed, i.e., there are corner cases where we read stale. That can happen due to sloppy quorums (to tolerate replica failures), conflict resolution, or failure of write on some replicas while succeeding on other replicas.

Even when none of these cases happens, we may have situations where linearizability (the strong consistency in the CAP theorem [3]) is not guaranteed, because reading the most recent version is NOT enough for linearizability. Linearizability requires the datastore to respect the real-time order of operations, i.e., if operation B starts after operation A returned to the client, we should never see the state of the datastore as if A is executed before B. With this definition, linearizability may be still violated even when we guarantee readers always find the most recent version written by a write request that is returned to the client. An example (shown in Figure 1) for this case is provided in [4]. In this example, Reader B reads the value of x after the read operation of reader A is returned to the client. However, as shown in the figure, even with R+W>N (e.g. N=3, R=W=2), reader B may read the old value of x, while A reads the new value, i.e., it seems A is executed after B, and that is not acceptable under linearizability.

Figure 1. An example of how linearizability may be violated even when read and write quorums are intersecting [4]. We have N=3, R=W=2. Reader B starts after reader A's request is returned, but the system appears as if reader B started before A; reader A reads the new value 1, while reader B reads the old value 0.

Replication in DynamoDB

To prevent problems of leaderless replication and provide strong consistency, DynamoDB uses a leader-based replication approach. Replicas of each partition form a replication group. DynamoDB uses Multi-Paxos for leader election (learn more from this blog: Paxos, Raft). To write a value, the client sends its request to the leader replica. The leader replica returns to the client after receiving acknowledgment from a quorum of replicas. To read, the client may access the leader or any of the followers depending on the level of consistency that it wants. For strong consistency, the client must read from the leader, as reading from followers may result in reading stale data. With this replication, linearizability is guaranteed and we don't need to deal with conflicts.

It seems we deal with quorums with Multi-Paxos as well, then how it is different from the quorum approach?

Yea, the idea of waiting for a quorum before returning to the client is shared in both approaches. However, there is a fundamental difference and that is the leaderless vs. leader-based natures of these methods. Scenarios like the one shown in Figure 1 never happen when using leader election; both readers A and B have to refer to the leader to read x, and this leader linearizes the operations. The original Leslie Lamport's Paxos uses majorities for two phases of Paxos (see this). However, majorities are not necessary, and similar to the R + W > N idea, only intersecting quorums are enough for Paxos to work [5].

Better Failure Detection

To avoid split brain during leader changes, DynamoDB uses leases. Specifically, a newly elected leader does not accept writes until the lease of the previous leaders is ended. That can be a few seconds. That means false leader failure detections are costly and must be avoided. Sometimes the leader is healthy and can communicate perfectly with a quorum of replicas. However, a single replica cannot verify the health of the leader. This situation is called a gray network failure in [1]. To prevent the affected replica from disrupting the system by starting a new leader election, DynamoDB uses the following approach: The affected replica first talks to other replicas to see what they think about the leader. Only if they also confirm that leader is dead, it starts the leader election. With this simple change, DynamoDB significantly minimized false leader failure detections.

In cases where the existing leader is intentionally going to be unavailable, e.g., for a deployment, it steps down as the leader by letting the other know, before deployment starts. This way, the new leader does not need to wait out the lease period, so interruption will be shorter.

Caching Metadata

To serve a request, a request router must find the storage node that hosts the key. Initially, DynamoDB stored the mapping information in DynamoDB itself. Whenever a request router did not know the mapping for a requested key, it downloaded the mapping for the entire table that the key belongs to. This simple caching approach resulted in a 99.75% cache hit ratio. However, when the cache of a request router is empty, for most requests, the request router needs to make a call to DynamoDB to get the mapping information causing traffic spikes from the metadata service to DynamoDB that made the system unstable.

To solve this issue, DynamoDB developed MemDS, an in-memory distributed datastore. The storage node push mapping information to MemDS. The request routers still have their local cache. However, they use local caches with two changes compared with the original design. Firstly, the request routers only cache mapping for the requested partitions instead of the entire table. That reduces traffic, especially for tables with many partitions. Secondly, request routers make calls to MemDS for both cache hits and cache misses. In the case of a cache hit, this call to MemDS is done asynchronously.

But what is the benefit of having a local cache if we are going to make calls to the MemDS anyway irrespective of the cache hit/miss?

Note that in case of a cache hit, the call to MemDS is done asynchronously, so we can go ahead and respond to the client without waiting for MemDS, so the benefit of having this local cache is to respond quickly for cache hits.

But why do we need this call to MemDS for a cache hit anyway?

The calls to MemDS irrespective of a cache hit or cache miss results in constant traffic to MemDS. This constant traffic prevents bi-modal behavior and spikes in the system. With this design, it does not matter whether the cache is cold or not; MemDS always receives steady traffic from request routers.

The idea of constant work and its importance for avoiding cascading failures and unstability is discussed by AWS engineers in several blog posts. In [6], Colm MacCárthaigh compares systems with constant work to big coffee urns that don't scale up or down in response to traffic changes.

"This is why many of our most reliable systems use very simple, very dumb, very reliable constant work patterns. Just like coffee urns." [6]

In [7], Matt Brinkley and Jas Chhabra explain the addiction to cache and its problems:

"After a while, no one can remember life before the cache. Dependencies reduce their fleet sizes accordingly, and the database is scaled down. Just when everything appears to be going well, the service could be poised for disaster."

The following are my takeaways from this discussion.

Use cache to speed things up, but don't use it to hide work. Don't get addicted to it; always be prepared for situations where your cache is not efficient for some reason.
Don't try to be smart by developing sophisticated services that quickly scale up and down in response to traffic. Be dumb! make your services do constant work.

That cloud be remembered by another application of the midwit meme that says in many cases, the genius approach is very close to the dumb approach, while the midwit approach is unnecessarily complex or inefficient.

Although adopting simpler approaches, in many cases, is a sign of maturity, it does NOT mean that we must always avoid sophisticated approaches in all situations. As Albert Einstein said, "Everything should be made as simple as possible, but not simpler." Sometimes we have to take smarter approaches to solve more complicated problems. One example is how DynamoDB has evolved regarding capacity allocation, as we will see in the next section.

Capacity Allocation

Like many systems, DynamoDB started with a very simple, static, and local admission control and gradually adopted more sophisticated methods, all the way to a global admission control system and on-demand provisioning. Initially, the capacity allocation was done only explicitly by the clients in terms of Read Capacity Units (RCUs) and Write Capacity Units (WCUs) per table, with the following guarantees to customers:

1 RCU: 1 strongly consistent read per second for items up to 4 KB in size.
1 WCU: 1 write per second for items up to 1 KB in size.

Note that both writes and strongly consistent reads must be done at the leader replicas as explained in the previous section. The RCU and WCU values set for a table are called provisioned throughput.

DynamoDB may partition a table for two reasons: 1) the table is too big and cannot be stored in a single machine, and 2) the desired throughput of the table cannot be provided by a single machine. The goal of the first type of partitioning is data scalability, while the goal of the second type is computation scalability.

Since DynamoDB is a multi-tenant system, it is crucial to isolate clients. To do that, DynamoDB used per-partition capacity allocation. For example, if the provisioned throughput of a table is 1000 RCUs, with two partitions each will have 500 RCUs. Thus, the storage nodes that host these partitions never allow the client to perform more than 500 RCUs on each storage node.

With per-partition capacity allocation, whenever DynamoDB splits a partition into two or more child partitions, the provisioned throughput of the parent was equally divided among the child partitions. This decision was based on the assumption that clients access the items of a table uniformly, so when we split the data equally, the required computation should also be split equally. However, this assumption proved to be wrong, and clients frequently have non-uniform access patterns. For example, a client may access a key range more than the other part of the table at a certain time.

But shouldn't the load be balanced when we use hash partitioning?

Hash partitioning with a good hash function should result in a more uniform load to partitions, but that is not guaranteed; even with hash partitioning, we may get unlucky and a subset of hot keys land on the same partition. In addition to that, note that DynamoDB uses composite keys in form of <partition key + sort key>. Thus, a range of keys can share the same partition key causing them to be hosted on the same partition.

Throughput Dilution and Unexpected Throttling

The per-partition capacity may result in unexpected throttling. To understand how that can happen, consider this example: Suppose we have a table with 8 items, but with a non-uniform access pattern, the client is mostly accessing only 4 items out of these 8 items. Suppose the provisioned throughput is 10 WCUs for this table. So, currently, the client can write to these 4 items 10 times per second. Now, we split the table into two partitions each receiving equal share of 5 WCUs. Suppose these 4 items end up on the same child partition. Now, the client can write to these 4 items only 5 times per second, as opposed to 10 times per second before the split. So, the remaining 5 writes will be rejected, even though the client is not exceeding its provisioned throughput.

It seems the paper says this situation may happen only due to a size split. However, I think it may also happen after splitting due to the throughput limitations of a storage node. For example, in the scenario explained above, suppose the client increases its provisioned capacity from 10 to 12, but the node cannot support 12 WCUs, so we split the table into two partitions each with 6 WCUs. Now, again some of the client's writes will be rejected due to per-partition allocation. In this case, the throttling is in fact more annoying to the client, as the client perceives more throttling right after increasing its provisioned throughput which is counterintuitive.

Smarter Capacity Allocation

The following are approaches DynamoDB used to improve capacity management.

Bursting: Let partitions tap into unused capacity available on the node. Thus, with bursting, DynamoDB lets a partition consume capacity more than its allocation, as long as enough unused capacity is available on the node. The problem with bursting is that it is only useful for short-term bursts.

Adaptive Capacity: When a table is throttled while its consumed throughput is not exceeded its provisioned throughput, adjust the capacity allocation of the hot partitions to let them spike for a longer time. With adaptive capacity, a partition can have longer bursts compared with the bursting using the unused capacity approach. When a client's consumption exceeds the table provisioned throughput, DynamoDB reduces the capacity allocation, because when the client is exceeding its provisioned throughput, it is acceptable to throttle its request. A big disadvantage of this method is that it is a reactive approach, i.e., unwanted throttling should first happen for this mechanism to be triggered.

Why adaptive capacity makes it possible to have longer bursts compared with the bursting approach?

With bursting using unused capacity, the hot partition can only burst if other partitions hosted on a node are not using their capacity. That cannot be guaranteed for a long time. With adaptive capacity, on the other hand, the capacity allocation of a partition is officially adjusted, so it won't be at the mercy of other partitions hosted on the node to burst. Note that because of this official larger allocation, we may need to move the partition to a node with more available capacity.

Global Admission Control (GAC): Instead of an admission control that is tightly coupled with the nodes and decides about the admission based on the partition-level data, DynamoDB replaced adaptive capacity control with GAC which makes it possible to decide based on table-level consumption. This way, without changing the per-partition allocation, GAC allows a partition to burst if the table is not exceeding its provisioned capacity.

But the per-partition allocation is still important, right? Suppose a table has not exceeded its allocation and needs more capacity for one of its partitions. Even though the table is not exceeding its allocation, the node hosting the partition may not afford more capacity for the hot partition. Isn't looking at the big picture and not considering per-partition capacity allocation a problem in this case?

DynamoDB prevents a single partition from consuming all or a significant portion of the resources on a storage node. Thus, the per-partition usage is capped. However, note that this limitation is not the same as static per partition allocations that we used to simply divide table capacity by the number of partitions. This per-partition limitation is more like a defense mechanism rather than an admission control mechanism because the admission control is done at the global level by GAC.

But the main point of adaptive control was the relocation of hot partitions after officially adjusting their capacity allocation. If we replace adaptive control with GAC, now we may get unlucky and hot partitions of multiple tables may end up on a single node while none of them are exceeding their allocation at the table level. So while they should be able to burst, they may not due to the resource limitation of the node.

Yea, exactly. That's why DynamoDB uses the next approach (balancing) in conjunction with GAC.

Balancing: To avoid packing hot partitions on a single storage node, DynamoDB proactively moves partitions and balances the load on storage nodes.

Splitting: Split a hot partition to make it more manageable. DynamoDB splits based on the key usage distribution in this case instead of simply cutting in the middle. DynamoDB avoids splitting in two cases: 1) if a single key is hot, and 2) the client is accessing keys in range. In these cases, splitting won't help.

On-demand provisioning: Since customers usually either underprovision or overprovision, with on-demand tables, customers let DynamoDB adjust provisioned throughput based on the actual consumption.

Durability

The following are some of the methods that DynamoDB uses to guarantee that data is never lost or corrupted after being committed.

Write Ahead Log (WAL): The WAL is replicated and periodically archived to S3.

Checksums: Every log entry, message, or log file is protected by checksums. When archiving to S3, all these checksums are checked. Checksums are not used only for data transmission, data at reset is also continuously verified using these checksums.

Backup and restores: The goal of backup is to protect customers from their logical errors. DynamoDB periodically snapshot tables. Together with archived logs, these snapshots can be used to provide point-in-time restores.

Conclusion

In this post, I shared some of my takeaways from [1]. I compared the replication methods of DynamoDB and the original Dynamo paper. We saw how using Multi-Paxos, DynamoDB can provide strongly consistent reads which is not possible with the leaderless replication of the original Dynamo. DynamoDB developed MemDS to manage mapping metadata. Following the constant work pattern, request routers make calls to MemDS for both cache hits and cache misses, keeping the downstream MemDS service always ready to deal with cases where the cache is not efficient. By smarter capacity allocation mechanisms, DyanamoDB prevents hot partitions and throughput delusion. I briefly listed several approaches DyanmoDB uses for durability, but didn't talk much about it in this post. You can learn more about durability and correctness of DyanmoDB in the paper.

I enjoyed reading this paper and I am sure you will find it interesting too. Many thanks to the DynamoDB team for sharing their knowledge and experience by publishing this paper.

References

[1] Mostafa Elhemali, Niall Gallagher, Bin Tang, Nick Gordon, Hao Huang, Haibo Chen, Joseph Idziorek et al. "Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service." In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pp. 1037-1048. 2022.

[2] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. "Dynamo: Amazon's highly available key-value store." ACM SIGOPS operating systems review 41, no. 6 (2007): 205-220.

[3] Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (June 2002), 51-59. DOI: https://doi.org/10.1145/564585.564601.

[4] Martin Kleppmann. Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. " O'Reilly Media, Inc.", 2017.

[5] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. "Flexible paxos: Quorum intersection revisited." arXiv preprint arXiv:1608.06696 (2016).

[6] Colm MacCárthaigh, "Reliability, constant work, and a good cup of coffee". https://aws.amazon.com/builders-library/reliability-and-constant-work/

[7] Matt Brinkley and Jas Chhabra, "Caching challenges and strategies". https://aws.amazon.com/builders-library/caching-challenges-and-strategies/

Comments

Anonymous said…

Thanks for this great article. In the new Dynamodb (leader based), what happens if I write to the leader, a few replicas have my write, and then the leader crashes before hearing back from the replicas? Will the write be considered as "committed"? (Is there such a notion of a "commit" like there is on kafka?)

Maybe a similar question: when the leader receives a write request, does it first write it to its WAL, and then forward the request to the replicas? If so do we write to the WAL if we receive acks from replicas?

February 2, 2024 at 6:48 PM

Search This Blog

MyDistributed.Systems

DynamoDB, Ten Years Later

Design Principles

Multi-Paxos instead of Leaderless Replication

Replication in the Original Dynamo

Replication in DynamoDB

Better Failure Detection

Caching Metadata

Capacity Allocation

Throughput Dilution and Unexpected Throttling

Smarter Capacity Allocation

Durability

Conclusion

References

Comments

Post a Comment

Popular posts from this blog

In-memory vs. On-disk Databases

ByteGraph: A Graph Database for TikTok

The Paxos algorithm, when presented in plain English, is very simple

Eventual Consistency and Conflict Resolution - Part 2