Spanner, 2PC+2PL over Paxos

Spanner provides strictly serializable distributed transactions on a global scale. Spanner was introduced in 2012, and today, it is available as a managed service called Cloud Spanner via Google Cloud Platform. CockroachDB and YugaByteDB are two open-source products based on Spanner. In this post, we see how Spanner provides strictly serializable distributed transactions while being a replicated and partitioned database.

In my previous post, we saw why achieving serializability (and strict serializability, a.k.a. external consistency) for partitioned and replicated databases is a challenging problem. Spanner leverages 2PC, 2PL, and MVCC methods to provide strict serializability for both read-write and read-only transactions. To achieve that, Spanner uses something that is usually considered unreliable in distributed systems--the time. With tightly synchronized clocks, Spanner uses the time to reason about the order of the events in the system and provide strictly serializable transactions.

Listen to the Audio Blog

2PC+2PL over Paxos Leaders

When replicas apply the exact same updates in the exact same order, they will end up in the exact same state. That is known as state machine replication. The agreement on which update to apply next is, in fact, the consensus problem. Spanner uses Paxos for state machine replication. We don't want to cover Paxos in this post. For now, it is enough to know that Paxos lets a group of replicas agree on the order of a sequence of updates.

Ok, now consider a system where data is partitioned, and each partition is replicated. The replicas for each partition form a Paxos group. Spanner provides serializable transactions by simply running 2PC+2PL on the Paxos leaders. One of the leaders is the coordinator, and the rest of the leaders are the participants. All states of the 2PC for both the coordinator and participant are recorded in their Paxos' state machine. Thus, if any of them crashes in the middle of 2PC, the new leader has all the information to continue and complete the 2PC. That increases fault-tolerance, especially for the coordinator, as we know the crash of the coordinator is the main weakness of 2PC.

Figure 1. The main idea of Spanner to achieve serializability over a replicated and partitioned system is to run 2PC and 2PL on the Paxos leaders of the partitions. In this figure, we have three partitions and three replicas per partition.

So far so good. If we only access the leaders, for both reads and writes, the serializability is guaranteed and we can rest assured that we read the most recent versions, as the leaders have the most recent updates. In that case, the replicas are only used for the purpose of fault-tolerance, so the clients cannot go to them even for reads. By only accessing leaders and running 2PL we guarantee that we achieve a serialization order that respects the real-time order of the transactions, i.e. if transaction T1 commits before transaction T2 starts, then the system should appear that T2 has executed after T1. Thus, the clients should never see a state that includes the effect of T2, but does not include the effect of T1. This is called external consistency which is also known as strict serializability (see this post).

If you only care about read-write transactions and you are OK with running all reads and writes only on the leaders and use replicas only for fault-tolerance, you can stop here; running 2PC+2PL over Paxos leaders is the only thing you need to know about Spanner.

Read-only Transactions and Snapshot Reads

We usually want to scale our read operations by allowing users to go to any replica that host the object, not just the leaders. Also, no matter where we run our reads, we prefer not to block writes by reads. Note that when reads are part of a read-write transaction, they have to block writes during the 2PL. However, when a client wants to just read a set of items without writing anything, we don't want these reads to block the writes. For this kind of pure reads, Spanner provides another abstraction called read-only transactions. The reads of read-only transactions do not acquire any lock, so they never block write operations.

Spanner serves read-only transactions via snapshot reads. Using a snapshot, we can read the values for a set of items at a given time such that we are reading the values as if the system was frozen at that time. Spanner assigns a timestamp to each version written for each item and maintains multiple versions for each item. When it wants to read an item for a snapshot at time t, it reads the most recent value of that item with the timestamp not greater than t.

For example, suppose we have following versions for item x:

<value=5, timestamp=6>

<value=3, timestamp=4>

<value=10, timestamp=1>

and we want to read x in a snapshot at time 5. Spanner returns 3 as this version is the version with the highest timestamp no greater than 5.

But how can we be sure that there does not exist a version with timestamp t' <= t that has not been received by the replica that we are reading from? In the example above, we could have a version at time 5 that is not received by the replica that the client is accessing.

To make sure that a replica is up-to-date enough, the replica first calculates a variable called t_safe. As we will explain later, Spanner guarantees that a replica has received all versions with timestamps less than or equal to the calculated t_safe. Now, if t_safe is smaller than t, the operation will be blocked until t_safe becomes greater than or equal to t. This way, Spanner makes sure what you described does not occur.

To serve a read-only transaction, Spanner picks a timestamp (as we will explain below), and runs a snapshot read at that time. Thus, a read-only transaction is basically a snapshot at a timestamp picked by the Spanner.

Ok, we understood how read-only transactions and snapshot reads generally work given we have the timestamps. The following are the details that we will focus on next.

How should we assign a timestamp to a transaction, i.e. to the version written by a transaction?
How should we pick timestamp for a read-only transaction?
How can we calculate t_safe in a replica?

Timestamps

Spanner assigns timestamps such that any version written by a transaction T1 have a timestamp smllar than the timestamp assigned to any version written by transaction T2 that starts after T1 commits. For example, suppose transaction T1 writes value v1 for item x, and Spanner assigns timestamp 5 to this version. T1 commits. Then, transaction T2 starts and writes value v2 for the same item x. Spanner guarantees that v2 has a timestamp greater than 5. We call a versioning that respects the real-time order of the transactions a proper versioning [1].

Isn't proper versioning easy? If we just use the system clock to timestamp, won't the timestamp of v2 be always greater than the timestamp of v1?

In a perfect world that clocks are perfectly synchronized, yes, but with current technology no. Due to the clock skew between machines, we may violate the proper versioning for two transactions executed by two different nodes, if we simply use system clocks.

Here is where Spanner uses TrueTime. TrueTime is a distributed clock system whose API returns time as an interval instead of a number. The real value of time is guaranteed to be within the returned interval.

What do you mean by the "real" value of time?

We mean the time with respect to a universal reference without any error.

Using these intervals we might be able to argue about the order of events. For example, when TrueTime timestamp of events e1 and e2 are [5,7] and [10,12], respectively, we can be sure e2 has occurred after e1 because 10 > 7. However, when there is an overlap between the two intervals, e.g. [5,7] and [6, 8], we can't conclude any order.

When the Spanner wants to commit and assign a timestamp to a transaction, the coordinator asks TrueTime the time. The TrueTime returns an interval [earliest, latest]. The coordinator assigns a timestamp t not less than the latest (we will explain how exactly Spanner does that below). By choosing a timestamp not less than the latest, we guarantee that the timestamp of a transaction is larger than the real start time of the transaction. We call this invariant start.

The coordinator then waits before actually committing the transaction and keeps asking TrueTime and TrueTime keeps returning [earliest, latest] intervals. The coordinator finally commits the transaction when it finds out earliest > t. This guarantees that the timestamp of a transaction is smaller than the real commit time of the transactions. We call this invariant commit wait.

Figure 2. start and commit wait together guarantee that start time < timestamp < commit time.

Let's go back to the external consistency. As we said above, we want this: when T2 starts after T1 commits, then the timestamp of T2 must be greater than that of T1. Let's see how this is guaranteed by Spanner:

timestamp of T1 < real commit time of T1 (commit wait)
real commit time of T1 < real start time of T2 (assumption)
real start time of T2 < timestamp of T2 (start)

=> timestamp of T1 < timestamp of T2

Figure 3. start and commit wait guarantee that when T2 starts after T1 commit, its timestamp is greater than that of T1.

Time synchronization protocols such as NTP also provide uncertainty intervals. Can't we do the same with them?

Yes, you can do that. What is special about TrueTime is that it relies on special hardware that keeps the uncertainty window very small like 7 ms (or even less than 1 ms according to this talk) while it is 100-250ms for NTP. So, if we want to do the same with NTP, we may have to deliberately delay each transaction for 250 ms; not good.

The start and commit wait are invariants that Spanner has to respect, but we have not talked about how Spanner actually assign timestamp sofar. Now, let's see how Spanner actually assigns timestamps to the versions written in a read-write transaction and picks a timestamp for a read-only transaction.

Read-write Transactions

For both reads and writes in a read-write transaction, we have to go to the leaders, and get the locks as part of 2PL. Once the client performed all of its reads and buffered all of its writes, it is ready to commit the transactions. The client picks one of the leaders as the coordinator and requests to commit. The coordinator then runs 2PC to commit the transaction. Each participant, which is the leader of its own replica group, picks a prepare timestamp that must be greater than any timestamp that it has assigned to previous transactions (let's call this invariant prepare_timestamp), and logs a prepare message in its Paxos state and responds to the coordinator. Once the coordinator received all the responses, it picks a timestamp that satisfies the following constraints:

It is not less than any of the prepare timestamps picked by the participants (let's call this invariant greater_than_all_prepare_timestamps).
It is greater than any timestamp picked by itself.
It is greater than the current TrueTime latest. (start)

Once the coordinator picked a proper timestamp, it waits to satisfy the commit wait invariant explained above. Then, it writes a commit to its Paxos state and sends the commit timestamp to the client and all participants. Once a participant receives the outcome of the transaction, it applies the outcome to its Paxos state at the given timestamp and releases the locks.

Read-only Transactions

For read-only transactions that span more than one partition, Spanner gets current [earliest, latest] from TrueTime and picks the latest as the timestamp. This will guarantee that the real start time of the read-only transaction to be smaller than the snapshot read timestamp we assign to it. Let's call this invariant, ro_start, which is similar to start in read-write transactions.

Let's see why this selection of snapshot read timestamp will guarantee external consistency:

To prove external consistency for snapshot reads, we have to show that the effect of any transaction committed before a read-only transaction starts is reflected in it. Suppose transaction T1 is committed before the start of the read-only transaction. Now, we have:

timestamp of T1 < real commit time of T1 (commit wait)
real commit time of T1 < real start time of the read-only transaction (assumption)
real start time of the read-only transaction < snapshot read timestamp (ro_start)

=> timestamp of T1 < snapshot read timestamp

Ok, since snapshot read timestamp < t_safe (see above), we have: timestamp of T1 < t_safe that guarantees T1 is received by the replica, and since the timestamp of T1 < snapshot read timestamp, the effect of T1 will be reflected in the snapshot, i.e. external consistency. (end of proof)

For read-only transactions that read items in a single partition and there is no prepared but not committed transactions, we can use the timestamp of the last committed Paxos write as the snapshot timestamp instead of latest. Since there is no prepared transaction, picking Paxos' highest timestamp will guarantee that any transaction committed before snapshot has a smaller timestamp.

But picking the latest is always correct whether the read-only transaction is single-node or not, right? Then, why not always picking latest? what is the benefit of picking Paxos highest write instead of the latest?

That is true. Picking the latest always results in external consistency. However, note that when snapshot read timestamp is greater than t_safe, the snapshot will get blocked. Thus, to reduce the risk of being block, we want to choose the minimum snapshot read timestamp that satisfies external consistency. At any given time, the highest Paxos timestamp is guaranteed to be smaller than latest.

It is important to note that we have to get the last Paxos timestamp from a node with the most recent Paxos log. The Paxos leader is guaranteed to have most recent Paxos log. Thus, to get the timestamp for read-only transactions that span only one partition, the client must go to the leader. The actual reading via snapshot at that timestamp, however, can be executed in any of the replicas. Note that if the replica is behind the leader, the snapshot read might get blocked.

Calculating t_safe

As we said above, to execute a snapshot at a given time, the executing replica needs to make sure it has received all transactions committed with a timestamp smaller than the snapshot time. Otherwise, the snapshot may miss a version that must have been included. The replica first computes the t_safe and blocks the snapshot at time t, if t_safe < t. Again note that, as we said above, the client is free to contact any of the replicas (not just the Paxos leaders) to execute the snapshot at its desired timestamp.

Let's first consider the easier case where the replica is not aware of any prepared but not committed transaction. Let t_p be the timestamp of the most recent applied Paxos write. In this case, we set t_safe to t_p. Why this is correct? because we know that any future timestamp will be higher than t_p as a result of invariants prepare_timestamp and greater_than_all_prepare_timestamps introduced above.

Now, let's consider the case where the replica is aware of a prepared but not committed transaction. In this case, we might see a future Paxos record with a smaller timestamp than the most recent Paxos write due to older but pending transactions. Thus, we have to take prepared transactions into account when we compute t_safe. Let t_m be the minimum timestamp we assigned to prepared transactions - 1. Picking t_safe = min (t_p, t_m) will guarantee that we will never apply a transaction with a timestamp smaller than t_safe in this replica, which is the guarantee that we need for the external consistency.

For example, suppose a replica receives <prepare T2, 4> and then <commit T1, 6>. Now, t_m is 6, but t_m is 3. Thus, t_safe will be min(6,3) = 3.

But in a Paxos group, not all replicas are guaranteed to know the most recent Paxos write. Thus, it is possible that we have a prepare transaction, but this particular replica is not aware of it. In the example above, the replica might not receive <prepare T2, 4>, and it may choose 6 incorrectly.

That is true that replica may not be aware of the most recent Paxos write. However, what you describe cannot happen, as in the Paxos group writes are guaranteed to apply in order. Thus, it is impossible for the replica to see <commit T1, 6>, but not <prepare T2, 4>. Not being aware of the most recent Paxos writes and the prepare records may result in smaller t_safe which in turn results in a higher risk of blockage due to t_safe < snapshot timestamp, but it does not violate external consistency.

What if a snapshot picks time t, but there is no activity in the system. Thus, t_safe won't advance and the snapshot will block forever. How can we avoid that?

We don't want to go to the details of the Paxos algorithm used by Spanner here, but basically, Spanner guarantees that t_p will advance periodically even when there is no new transaction for the partition.

In summary,

Spanner uses 2PC+2PL over Paxos leaders to support serializability in a replicated and partitioned system.
It uses MVCC to provide consistent snapshots.
It uses TrueTime and waits out the uncertainty window to guarantee external consistency.

References

[1] Corbett, James C., Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31, no. 3 (2013): 1-22.

Comments

Anonymous said…

For example, suppose a replica receives and then . Now, t_m is 6, but t_m is 3. Thus, t_safe will be min(6,3) = 3.
Should this be - Now, t_p is 6, but t_m is 4. Thus, t_safe will be min(6,4) = 4.

September 18, 2023 at 6:50 AM

Mohammad Roohitavaf said…

t_m = the minimum timestamp we assigned to prepared transactions - 1 = 4 - 1 = 3

September 19, 2023 at 8:31 PM

Search This Blog

MyDistributed.Systems