Replication is a way to add redundancy by maintaining multiple copies of data in the system. SInce updating multiple copies of data atomically is the same as a consensus problem, this can be expensive to do for all database operations. This chapter looks at more cost-effective ways to make the data look consistent to the user while still allowing participants to diverge somewhat.
Infamous CAP
Availability is a property that measures the ability of a system to respond to every request successfully. This is seen in the context of node failures.
Ideally, we’d like every operation to be consistent. Here consistency means linearizable consistency.
We would like to achieve both consistency and availability while tolerating network partitions. A network partition means a network splitting into two parts such that 2 nodes in different partitions cannot communicate with each other.
The CAP conjecture, formulated by Eric Brewer says that we cannot have both consistency and availability. This gives us a spectrum of choices and two ends correspond to:
- Consistent and partition tolerant: Prefer failing requests to serving inconsistent data. Example: A consensus protocol
- Available and partition tolerant: allow serving potentially inconsistent values to the request. Example: A database system always accepting writes and serving reads as long as even a single replica is up.
PACELC conjecture is an extension of CAP. It states that in the presence of network partitions there’s a choice between availability and consistency (PAC). Else (E), even when the system is running normally, we still have to chose between latency and consistency (LC)
CAP vs ACID
Note that consistency in CAP is different than in ACID. In the latter, it corresponds to transactional consistency in which the database goes from one state to another atomically.
Similarly, highly available in ACID doesn’t require every nonfailed node to respond to every request. Also, availability in CAP puts no bound on execution latency.
Harvest and Yield
CAP theorem defines consistency as linearizable consistency and availability as ability of the system to eventually respond to all requests. This requires us to make a hard choice between either of these.
Instead, we can think of a system in these terms:
- Harvest: Defines how complete a query is. If a query would actually return 100 rows but but could only fetch 99 because some database partition was not accessible, it might be better to return an incomplete result than result
- Yield: Number of requests that completed successfully, compared to total number of requests. This is different than uptime, since a busy node might not be able to serve some requests but it is not down.
These shift the consideration from absolute to relative terms. We can improve the Yield by allowing some requests to return incomplete data, reducing the Harvest. Similarly, we can choose to always return complete data for some mission-critical component, improving Harvest by reducing the Yield.
When we build a system, it helps to think of what Harvest and Yield it will support.
Shared Memory
The distributed system storing the data appears to the outside as having a shared memory. All the node inter-communication is abstracted out. A single unit in this shared memory that a user can perform read/write on, is called a register. We can view shared memory in a distsys as an array of such register.
Operations on these registers are characterised by the time they are invoked and the time they complete. We define an operation to be failed if the calling process crashes in between invocation and completion.
Based on how the registers behave in presence of concurrent reads and writes, we classify them as following types:
- Safe: While a concurrent write is happening, reads to a safe register may return arbitrary values within the range of the register.
- Regular: Read operation can only return the value written by the most recently completed write or the value written by a write that’s happening concurrently.
- Atomic: Guarantee linearlizability - Every write operation has a single moment before which every read operation returns an old value and after which every read operation returns a new one.
Ordering
We talk about concurrency distributed systems in terms of shared memory and concurrent systems, because most of the terms use there apply here as well. However, we can’t use most concurrent algorithms due to a big difference in communication patterns, latency and failure scenarios.
Consistency Models
Two perspectives on consistency:
- state: describe which state invariants which hold, and define allowable relationships between copies of data placed onto different replicas.
- operation: put constraints on the order in which operations occur
Strict consistency
Any write by any process is instantly available for subsequent reads by any process. This involves the concept of a global clock, and hence it is only a theoretical model.
Linearizability
Effects of the write become visible to all readers exactly once at some point between its start and the end, and no client can observe state transitions or side effects of partial (in-flight) or incomplete (interrupted before completion) writes.
This defines a total order on the operations. The order is consistent, in that a read always returns the value written by an operation preceding it or a concurrent one.
No operation happens instantaneously, yet they appear to be atomic.
Linearization Point
Linearization point is a point in between an operation when its writes become visible to all participants. It cuts the history into before and after the operation, making the operation appear atomic.
After the linearization point, any read operation will observe this write or any other write ordered after this one.
Cost of Linearizability
Very expensive to do, many databases avoid it. In concurrent systems, we can use atomic CAS operations to do this. For distributed ones, we’ll have to use consensus.
Linearizability is a property of a single object. Since combined linearizable histories is itself linearizable, a system in which operation on every object is linearizable is also linearizable. However, operations involving multiple objects need additional synchronization.
Reusable Infrastructure for Linearizability
TODO: Skipped
Sequential Consistency
In this model, which is weaker than Linearizability:
- A process observes its operations to happen in the same order it has performed them
- Operations performed by all processes become visible to every process in the same order
- However, there is no time restriction in which an operation must become visible to a process
- It just guarantees that the local history is consistent with the global history
Some observations about the figure:
- While P1 completes before P2, it can be ordered after P2.
- Operations propagating from different sources may be ordered arbitrarily
- Both P3 and P4 observe writes in the same order
- At a point in time, P3 observes the value as 1 whereas P4 observes the value as 2. This won’t be permitted under linearizability
Sequential Consistency != Linearizability
- Linearizability provides a total order on all writes. SC provides total order on writes from the same origin
- Linearizable histories combine to form a linearizable history, SC histories are not composable
Causal Consistency
Global order might not be needed, we only might need order between some writes. In this model, every client sees causally related operations in the same order. Concurrent operations without no causal relation can be seen in different order by different clients.
Here, we have a logical clock and order is established through it. W(x, ∅, 1) -> t1
writes with no dependency and receives logical clock time t1
in response. The write W(x, t1, 2) -> t2
causally orders this write after the previous one. All clients will see these writes in the same order.