Posts

NuRaftOnTheRocks: A Replicated Key-value Store using RocksDB and NuRaft

Image
NuRaft is an open source C++ Raft implementation. NuRaft is highly plugable; it allows you to provide your own persistent log store and state manager and integrate it with your state machine. That allows you to integrate NuRaft with any application whenever you need log replication. In this post, I show how you can integrate NuRaft with RocksDB to build a replicated key-value store.  Raft Raft is a consensus and log replication protocol. Raft allows multiple processes in a distributed system to agree on the entries of a shared log. Using such a guarantee, we can achieve state machine replication as follows: having a set of state machines starting at the same initial state, we record operations on the state machine on a shared log, and have each state machine perform operations specified in the shared log. This way all state machines remain consistent with each other. State machine replication is very useful. We can create replicated data stores. In this post, we w...

Log Structured Storage

Image
Log-structured storage is a powerful approach to organizing data that prioritizes sequential writes for efficiency and durability. In this post, I delve into the inner workings of log-structured storage, focusing on LSM-trees. We'll explore their write and read paths, the mechanics of compaction, and compare leveled and tiered compaction strategies. Additionally, I'll discuss the differences between LSM-trees and B-trees, and talk about variations like Jungle and Bitcask, highlighting their unique optimizations and trade-offs.  Basics Write Path Write are first appended to a Write-Ahead-Log (WAL) for durability, then they are written to a MemTable which is basically an in-memory sorted data structure such as a balanced tree (e.g. red-black or AVL tree) or a skip list (I have a detailed post about skip list on this blog). After writing a write to WAL and the MemTable, we can return success to the client. We can have multiple MemTables, but at any given time, ...

Ray: A Framework for Scaling Python Applications

Image
Ray is a framework for scaling Python applications. Ray allows you to execute the same Python program that you would run on your laptop on a cluster of computers with minimum effort. All you need is to tell Ray what part of your program needs to be scalable by decorating your Python functions and classes, and Ray takes care of the rest for you. Ray has been used by several companies including OpenAI  for building ChatGPT.

ByteGraph: A Graph Database for TikTok

Image
ByteGraph is a distributed graph database developed by ByteDance, the company behind the popular social network TikTok. As a large social media platform, ByteDance handles an enormous amount of graph data, and using a specialized graph database like ByteGraph is essential for efficiently managing and processing this data. In 2022, ByteDance published a paper on ByteGraph at the VLDB conference. In this post, I will provide an overview of my understanding of ByteGraph and highlight some of the key points from the paper that I found interesting.

DynamoDB, Ten Years Later

Image
Ten years after the public availability of DynamoDB in 2012, and fifteen years after the publication of the original Dynamo paper in 2007, the DynamoDB team published a paper in Usenix ATC 2022 on how DynamoDB has evolved over its ten years of operation. The paper covers several aspects of DynamoDB such as adaptation to customer access patterns, smarter capacity allocation, durability, correctness, and availability. In this post, I share some of points that I found interesting in the paper. 

Dual Data Structures

Image
Dual data structures are concurrent data structures that not only hold data, but also keep track of read requests using reservations. By holding both data and anti-data, dual data structures achieve better performance and fairness compared with other blocking and non-blocking concurrent data structures.

Eventual Consistency and Conflict Resolution - Part 2

Image
This is the second part of a two-part post on eventual consistency and the problem of conflict resolution. In the  first part , we focused on the definition of the eventual consistency, its sufficient conditions, and conflict resolution using physical and logical clocks. We saw each method has its problems. In this part, we will see how we can solve these issues and provide eventual consistency with better conflict resolution. Listen to the Audio Blog Conflict Resolution using Vector Clocks The good thing about physical clock timestamps is that they carry information related to the real-time order of events, e.g., if we write v2, one minute after writing v1, the physical timestamp of v2 will be most likely larger than that of v1, assuming we have reasonable clock synchronization in place. However, physical clocks are not 100% accurate. Thus, due to clock synchronization errors, the timestamps may not reflect ...