Wednesday, May 22, 2013

On the Design of Distributed Systems: Part-1

Perhaps any study of distributed systems must begin with the famous CAP theorem. It states that it is impossible for a distributed system to simultaneously provide all of the three following guarantees.
  1. Consistency (all the nodes see the same data at any given time)
  2. Availability (Property that every request returns a response whether the request was successful or it failed)
  3. Partition Tolerance (or fault tolerance - the system continues to operate despite message losses, partition, network outage or failure of a part of a system)

Now, there is a whole rigorous proof for this theorem but intuitively here is how it can be explained. Suppose you want your system to be highly available then you would want that the system keeps working despite some nodes going down. Just by the virtue of this we are allowing inconsistency, because if those dead nodes became live again they would be in an inconsistent state for some window.

Further, each of these properties can have various sub-types. As an instance we often hear the term eventual consistency and strong consistency. By point 1, we actually mean strong consistency. Similarly a system cannot be indefinitely resilient to failure. As a simple thought experiment think how a system with N copies of data on separate physical machines is resilient to N-1 failures.

The next building block is a Vector clock. Simply stated, it is a kind of versioning mechanism for data. It is built on the  Principle of Causality, the cause must precede its effect. Just marrying this simple idea with the sending of messages you could actually order events in a distributed system. Here is a nice video lecture on vector clocks and logical clocks.

Well when I said order events in a distributed system, I implicitly made it nebulous. You cannot order events which are not causally related in some manner. To summarize, For any two events a and b in a system, the following two relations hold. (Here VC() is the vector clock timestamp of the event)
  1. If a happened before b then VC(a) < VC(b)
  2. If VC(a) < VC(b) then a happened before b

Events which are not causally related are termed  as concurrent events. For such events the happened before order is resolved manually. A simple way to do it is to use the absolute physical time. But it is always to be remembered that physical time cannot be perfectly synchronized. (unless you have a geo-synchronized atomic clock )

So far, so good. In the next post I will describe some of the NoSQL Key/Value stores and how they use these ideas to provide a highly available, robust, fault tolerant data storage.