System Design ch 6 - Key Value store

Data Replication

To achieve reliability, we replicate data in multiple servers. In the hash ring approach (see System Design ch 5 - Consistent Hashing), we can just move clockwise from the key and select the required number of servers.

Consistency

Since the data is replicated on multiple servers, these servers must be kept in sync. A quorum is established before reading/writing anything.

$N$ : Number of replicas
$W$ : A write quorum of size $W$ . The write must be acknowledged by $W$ replicas to be considered successful.
$R$ : A read quorum. A read operation waits for response from $R$ replicas.

The relation between $N$ , $W$ , $R$ is the tradeoff between consistency and availability/performance. $R + W \geq N$ gives us strong consistency. In every read, there will be one server overlapping with the last write’s servers and so the read will serve the latest data.

For strong consistency, we can vary $R$ and $W$ parameters for different needs:

$R = 1, W = N$ for fast reads
$R = N, W = 1$ for fast writes

When $R + W < N$ , we call it weak consistency. Eventual consistency is a form of weak consistency where, given enough time, updates are propagated to all replicas. It is the design adopted by highly available systems.

Inconsistency resolution

Eventually consistent systems have high availability but they will also accept inconsistent writes. Concurrent writes to different servers can get accepted, how do we reconcile? A solution is to use vector clocks.

Each data item is associated with an array of versions, which is called a vector clock. Version at index i is associated with server i. The vector clock starts with version 0 at each server. When a server does the write, it increments the version at its index.

To compare the clocks, we compare versions at each index. When the version at every index in Clock_1 is bigger or equal to version at every index in Clock_2, then we say that Clock_1 >= Clock_2.

A larger vector clock means the value is newer. Incomparable clocks mean that there’s a conflict, which must be resolved by the client. Observe the vector clocks form a partial order.

For performance, instead of keeping the entire vector, we can keep a map from server index to version. For server indices not present in the map, the version would be 0. Even the map can grow very large, and in that case we can remove the oldest pairs. This can lead to potential issues in conflict detection, but Amazon Dynamo paper says they haven’t encountered an issue.

Failure detection

We need a decentralised way to detect node failures. Gossip protocol is a way to do it. Here nodes maintain heartbeat counters of all members, and after every interval send the heartbeat to a few random members. When a server goes down, its heartbeat won’t be updated in a while and this information will propagate to the entire cluster eventually.

Handling Failures

Temporary Failure

When a node goes down temporarily, we can do this:

Pick $R$ and $W$ healthy servers on the hash ring
If a server doesn’t respond, send data to another server which would send data to the former when it comes back up. This is called hinted handoff (not sure exactly).

Permanent Failure

To handle permanent failure, we implement anti-entropy protocols. These protocols keep replicas in sync. A Merkle Tree is used for inconsistency detection and minimising the data transfer.

Data center outage

Replicate data across data centers

System Architecture Diagram

A coordinator is a node that communicates with the client directly, acting a proxy for communication with the cluster.

Write Path

The write is first stored in permanent storage in a commit log. Then the write is taken into a memory cache which is eventually flushed to SSTables on disk.

Read Path

The read request goes directly to the memory cache. It can read from SSTables on disk as needed. A bloom filter is used here to figure out which SSTable contains the key

Arpit Saxena