Eventual consistency is one of the defining characteristics of many modern data stores. By getting rid of strong consistency and embracing eventual consistency, service providers can keep their services available even when replicas cannot talk to each other. Eventual consistency also removes the performance bottlenecks of strong consistency and lets us scale the system much more easily. Due to its significant benefits, eventual consistency has been widely adopted in the industry. Every day, when you are using social networks, shopping online, or even when you are doing online banking, most likely you are using services that rely on eventual consistency.

This is the first part of a two-part post on eventual consistency and the problem of conflict resolution for eventually consistent data stores. In this part, we focus on the definition of the eventual consistency, its sufficient conditions, and conflict resolution using physical and logical clocks. Although eventual consistency can be discussed for any data model, in this post, we focus on the key-value data model with two basic operations PUT(key, value) and GET(key).

Listen to the Audio Blog

Why eventual consistency?

Replication is a common technique to provide higher availability and performance. However, as soon as we create replicas, we have to deal with the inconsistencies between them. An old idea to keep replicas consistent is to use a leader-follower scheme, i.e., one of the replicas becomes the leader and all writes go to it. The other replicas later receive updates from the leader and update their state. With this approach, once the leader replica fails, the service is unavailable until we pick a new leader using leader election algorithms such as Paxos or Raft. When we want to have strong consistency (a.k.a., linearizability) the failure of the leader is not the only case where our service becomes unavailable. Strong consistency requires replicas to talk to each other to serve any request. Now, if due to a network partition, replicas are disconnected and cannot talk to each other, our system becomes unavailable. This is known as the CAP theorem in the community [1] which simply says in presence of network partitions preventing replicas from talking to each other, you can pick either availability or strong consistency.

In addition to availability issues, strong consistency causes performance issues as well. Each write is more expensive, especially with cross-datacenter replication, as it needs the acknowledgment of multiple replicas. Moreover, we will have large latency for clients having large network delays to the leader. For example, when we use geo-replication, we create replicas in different geographical locations. With this leader-follower approach, all clients, no matter where they are, must talk to the leader to write, and this leader may be located in a remote datacenter.

As you see, we have many problems with strong consistency. At some point, the large internet companies with massive scales, dealing with customers from all around the world, started to say "You know what? let's give up on strong consistency, and let clients be able to write to any replica freely. This way, we can keep our services available all the time, achieve higher throughput, and earn more money!". Fortunately, for many applications, clients are tolerant to inconsistencies, i.e., they don't care if things are inconsistent for a short time. For some use cases, yes, we may have some problems. For example, due to eventual consistency, a shopping website may sell a single item to two buyers. For these cases, however, we usually can have "business solutions". For example, we can ask the supplier to restock the item. It may delay the delivery, or in the worst case, the restock may not be possible. In these cases, we can apologize and even compensate the customer for the inconvenience. Today, apologies and compensating customers are very common, even for things like hotel or flight bookings. The thing is, by using eventual consistency, many businesses earn way more money than the amount that they lose to compensate customers due to inconsistencies.

Figure 1. An example apology email. The delay can happen due to many reasons. Selling the last remaining item in a nearby facility to more than one user due to eventual consistency can be one possible reason.

Definition

Although most developers have an intuition about eventual consistency, the exact definition of eventual consistency might be different to different people. A common test for eventual consistency is this question "If we stop writing to your system, do your replicas converge to the same state or not?" If they do, your system is eventually consistent. In practice, of course, we don't expect the writes to stop forever. However, a system with this property behaves according to our intuition about eventual consistency, even when writes don't stop.

As you see, the above description of eventual consistency is not very accurate. Marc Shapiro et al. provided a more formal definition for eventual consistency in [2]. They defined eventual consistency as a combination of three properties:

Eventual delivery: All updates must be eventually delivered to all correct replicas.
Convergence: Correct replicas that have received the same updates must be in equal states.
Termination: All read and write executions must terminate.

[optional read] Eventual Consistency vs. Strong Eventual Consistency

Shapiro et al. provided two definitions to formalize eventual consistency, namely "Eventual Consistency" and "Strong Eventual Consistency" in [2]. The difference is in the definition of convergence.

Eventual Consistency: Correct replicas that have delivered the same updates eventually reach an equivalent state.

∀ i, i: □ ci = cj ⟹ ♢ si ≡ sj

Strong Eventual Consistency: Correct replicas that have delivered the same updates are immediately in equivalent states.

∀ i, i: ci = cj ⟹ si ≡ sj

The c in the above definitions is the causal history of each replica that includes all updates that the replica has received. Eventual consistency says: if causal histories remain the same (□ ci= cj), the states will be eventually identical (♢ si≡ sj). Thus, reaching the same state may take some time and may require additional steps, e.g., rollback, consensus. On the other hand, strong eventual consistency is immediate, i.e., as soon as all updates are applied in two replicas, they are in the same state.

However, note that usually when people talk about eventual consistency, especially in the industry, they actually mean strong eventual consistency, i.e., the delivery may take some time, but once all updates are eventually delivered, replicas must be automatically in the consistent state. I have not seen any real system that uses rollback or consensus to achieve eventual consistency as [2] says. For this reason, I think it is better to just use Eventual Consistency with the second convergence definition.

Sufficient Conditions for Eventual Consistency

Theorem 2.1 in [2] provides sufficient conditions for eventual consistency. The implication of

this theorem for a key-value store is as follows: to have an eventually consistent key-value store, it is sufficient to have eventual delivery (i.e., every update must be delivered to each replica) together with the following conditions:

Condition 1: We have a partial order ≤ over values for a key. (Note that any total order is also a partial order.)
Condition 2: The values we set for a key are monotonically non-decreasing according to ≤.
Condition 3: When we receive an update from a different replica, we merge it with the current value with a merge function that returns the Least Upper Bound (LUB) of the two values according to our partial order. The LUB is defined as follows:

m = v1 ⊔ v2 is LUB of {v1,v2} under ≤ iff

v1 ≤ m, v2 ≤ m,
∀m', v1 ≤ m', v2 ≤ m' ⇒ m ≤ m'

Let's see what these conditions mean. The first condition says you have to show a way to order the values of a key. Your order must be a partial order, i.e., it must have these properties:

reflexive: for any value v1, v1 ≤ v1
transitive: if v1 ≤ v2, and v2 ≤ v3 => v1 ≤ v3
asymmetric: if v1 ≤ v2 and v2 ≤ v1 => v1 = v2

The second condition simply means when we set a new value for a key, we never go back in the order. For example, suppose the current value for key k1 is v1, and we update it to v2. We should have v1 ≤ v2.

The third condition says your key-value store must be equipped with a merge function that given two values for a key, returns their LUB under the partial order. That means when the merge function returns m, given two values v1 and v2,

v1 and v2 are either before m or equal to m in the order, and
m is the first value in the order that is not before v1 and v2.

If it is not clear, don't get stuck here; continue reading. Hopefully, things become clearer after reading the next section.

Conflict Resolution using Physical Clocks

Let's see if we can use timestamps to define the order and merge values. We use the value of the system clock (e.g., System.currentTimeMillis() in Java) to timestamp each new value written for a key. We refer to these timestamps as physical clock timestamps. Every time a client requests to update the value of a key, we assign the value of the system clock as the timestamp to the value and write the value for the key. Now, we define our order according to these timestamps, i.e., a value with a higher timestamp is after a value for the same key with a smaller timestamp. Also, let's define our merge function as a simple Last-Write-Win (LWW) according to the timestamps, i.e., given two values v1 and v2, we define

Merge(v1,v2) = if (v1.timestamp < v2.timestamp) then v2 else v1.

Note that if v1 and v2 are equal, we can break the tie according to the IDs of the nodes that the values are written.

Now, let's see if the simple system explained above satisfies sufficient conditions for eventual consistency or not.

Condition 1 is clearly satisfied because timestamps define a partial order for values, i.e., the order is reflexive, transitive, and asymmetric. Condition 3 is also satisfied because our merge function picks one of v1 and v2 and the picked value has the higher timestamp, so it is the LUB of the two given values. What about Condition 2? Do you think our protocol satisfies it?

Unfortunately, this protocol does not satisfy Condition 2. To understand why, consider the following example: Suppose we have two replicas A and B. We set value v1 with timestamp 10 on replica A. The update is replicated and applied in replica B. Now, a client requests to update the value to v2 on replica B. According to our protocol (highlighted above), we read the value of the system clock and assign it as the timestamp to the value and write the value for the key. However, due to the imperfection of clock synchronization, the clock of B can be behind the clock of A. Thus, we may read a value less than 10 on B. Suppose we read time as 9. As you can see, the updates to our key do not satisfy Condition 2, because we moved backward from 10 to 9. Now, when replica A receives the update from replica B, it applies the merge function and it returns v1 as the winner, because v1 has the higher timestamp. Thus, replica A considers v1 as the final value, but replica B considers v2 as the final value. Due to this issue, replicas A and B won't be ever consistent, i.e., our protocol violates eventual consistency.

Figure 2. Decreasing updates due to clock skew of physical clocks.

How can we solve this problem? We have two options to satisfy Condition 2, thereby guaranteeing eventual consistency:

Reject the second update, and have the client try again, or similarly block the request until physical clock catches up.
Accept the update, but merge the client's value with the current value, so after the client's update we still keep the value as v1, so basically although we return success to the client, we ignore its request to update the value to v2.

The first option is bad, because the main point of choosing eventual consistency is to remain available as long as the replica is up. If there are frequent updates to a key in the replica with larger physical clocks values, most requests to update the key on the replica with smaller physical clock values will be rejected/blocked. The second option is not great either, because it causes a lost update, i.e., we returned success to the client but ignored v2! Note that when two clients concurrently update the same key, it is acceptable to pick one of them as the winner, because it could be the same as if the winner update overwrote the loser update. We have an issue when the loser is writing something based on the winner update. Thus, the loser had to be the final value. For example, in the scenario above, suppose the second client reads v1 and increments it, and writes v2=v1+1 back to store. In this case, the increment by the client is ignored due to the clock skew. When we say "lost update" in this post, we mean this scenario, i.e., the scenario where due to causality between updates, we cannot consider updates concurrent.

Conflict Resolution using Logical Clocks

As we saw above, the issue with the physical clock is the lack of monotonically non-decreasing condition due to clock skews between nodes. We can avoid that problem by using Lamport's logical [2]. Lamport's logical clocks (simply logical clocks hereafter) are simple integer counters. Specifically, each node maintains an integer counter and updates it as follows:

Starts with 0.
Before each local update, increments the counter
Whenever received a remote update with logical value lc, sets the counter to max(lc, current value)

Now, consider this protocol: We timestamp each update with lc where lc is the value of the logical clock, and similar to physical timestamp, define our order according to this logical timestamps. Also, we define our merge function to simply pick one of the given values that is later according to the order defined above, i.e., Merge(v1, v2) = if v1 < v2 then v2 else v1 (again break the ties with replica IDs). Let's see if this protocol satisfies sufficient conditions for eventual consistency.

Condition 1 is satisfied, because our order is a proper partial order. Condition 2 is satisfied according to the logical clock algorithm and the definition of our order, i.e., lc of the new update is guaranteed to be larger than lc of the existing version, thus, our updates are monotonically non-decreasing. Condition 3 is also satisfied because we are picking one of the given values and the picked value has a timestamp greater than or equal to that of given values, thus it is the LUB of two given values. Using this protocol, we can guarantee eventual consistency while avoiding the lost update scenario explained above.

Figure 3. Conflict resolution using logical clocks.

The skew of Logical Clocks

Although using logical clocks we can provide eventual consistency and avoid lost updates when we have causality between the updates, logical clocks have a big issue and that is the lack of a meaningful relationship with the physical time. This property may result in seeing behavior that is unexpected based on the real-time experience of the system.

To understand how logical clocks may result in unexpected behavior, consider this example: Suppose we have two replicas A and B. These two replicas are serving different groups of clients. For example, A and B are geo-replicas serving clients in different geographical locations. The clients using replica A are currently much more active than those using B; maybe A is currently in a day time zone and B is in a night time zone. So the frequency of updates at replica A is much higher than the frequency of updates at B. Although logical clocks will be updated when receiving a remote update, they may go out of sync if message communication is interrupted for some reason, causing the value of the logical clock at A to be much larger than that of B. Now suppose, a client writes value v1 for key k at replica A. Five seconds later another client writes value v2 for key k at replica B. If we used physical clocks, v2 would be the winner which is the expected winner, but since the value of the logical clock at replica A is larger (due to the higher frequency of updates at replica A), v1 will be the winner. Thus, although we return success to client writing to B, its update is lost, this time due to the skew of logical clocks.

Note that in the example above, the system is eventually consistent, so we have no issue from the consistency point of view. However, the behavior is unexpected according to the real-time order of events. The root cause of this issue is the skew between logical clocks that occurs due to update frequency mismatch between replicas.

Conclusion

In this part, we talked about eventual consistency, its definition and sufficient conditions, and the problem of conflict resolution. We reviewed two ways for conflict resolution using physical and logical clocks and saw each method has its disadvantages. Specifically, physical clocks may cause lost updates due to the skew of clocks of different replicas, and logical clocks may violate our real-time expectations due to lack of relation to the physical time.

In part 2, we will see alternative ways for conflict resolution to avoid the issue caused by physical or logical clocks.

References

[1] Seth Gilbert and Nancy Lynch. "Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services." Acm Sigact News 33, no. 2 (2002): 51-59.

[2] Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. "Conflict-free replicated data types." In Symposium on Self-Stabilizing Systems, pp. 386-400. Springer, Berlin, Heidelberg, 2011.

[3] Leslie Lamport. "Time, clocks, and the ordering of events in a distributed system." Commun. ACM 21, 7 (July 1978), 558–565.

Search This Blog

MyDistributed.Systems

Eventual Consistency and Conflict Resolution - Part 1

Listen to the Audio Blog

Why eventual consistency?

Definition

Sufficient Conditions for Eventual Consistency

Conflict Resolution using Physical Clocks

Conflict Resolution using Logical Clocks

The skew of Logical Clocks

Conclusion

References

Comments

Post a Comment

Popular posts from this blog

In-memory vs. On-disk Databases

DynamoDB, Ten Years Later

ByteGraph: A Graph Database for TikTok

The Paxos algorithm, when presented in plain English, is very simple

Eventual Consistency and Conflict Resolution - Part 2