Kavya G

Understanding Apache Zookeeper and Zab

Zookeeper

Understanding zookeeper by reading and understanding the following paper:

  1. Hunt, Patrick, et al. “ZooKeeper: Wait-free Coordination for Internet-scale Systems.” USENIX annual technical conference. Vol. 8. No. 9. 2010.

1.What is the research problem the paper attempts to address?

Distributed systems need to coordinate among themselves for multiple purposes. These include but are not limited to leader election (which nodes perform which duties), distributed locks, distributed queues, group membership (which nodes are a part of the system), dynamic configuration etc.

NOTE: Even though distributed locks etc. are synchronous primitives, they can be implemented using asynchronous primitives.

All distributed systems must thus have an implementation of each of the coordination protocols that they require.

And why is this a problem ???

2.What are the claimed contributions of the paper?

They claim to create a skeleton coordination protocol core (coordination kernel) with the following properties:

  1. It is wait-free / non-blocking / asynchronous:
  2. It ensures a FIFO ordering of requests from a particular client:
    • Requests sent by the client are executed in the order in which they are issued.
    • Clients can fire a series of operations and be guaranteed that they are processed in the order they were given
  3. It guarantees that writes are linearly serializable.
    • This means that all systems agree on the same order of write operations.
  4. It is fast, highly available and can process requests with low latency and high througput.
  5. Can be used to create higher level coordination protocols using coordination recipes.

3.How do the authors substantiate their claims?

ZooKeeper Architecture

  1. ZooKeeper Data Model
    • ZooKeeper provides a file system abstraction - a hierarchical namespace with data nodes (called znodes).
    • The znodes can have data and children.
    • The znodes service read and write requests from the client.
    • Znodes can be ephemeral or regular.
    • Sequential flag appends a znode with an increasing counter.
    • Data is completely replicated across the different systems (no sharding).
  2. ZooKeeper Sessions
    • Clients can create sessions for sending requests to the znodes.
    • Ephemeral znodes last only for a session.
  3. ZooKeeper Watches
    • The clients can request the znode to asynchronously watch a file for changes and give a one-time signal in case the file is modified.
    • NOTE: Watch only triggers only once after the change.
  4. Consistency Guarantees
    • Sequential consistency: Requests sent by client are executed in the order they are sent.

ZooKeeper Client API

  1. Create
  2. Delete
  3. Exist
  4. getData
  5. setData
  6. getChildren
  7. sync

Guarantees on properties of ZooKeeper

How to build higher constructs using ZooKeeper

All the higher constructs use the znodes (ephemeral and regular) and the watch functionality.

  1. Configuration
    1. There exists a znode called “Config” with the configuration details.
    2. Processes being created read from Config.
    3. They also watch for changes in the configuration.
  2. Group membership
    1. There exists a znode called “Group”.
    2. When a member joins the group (starts), it creates a child node of Group with name as unique process name or with sequential flag set and with data as the process data.
    3. To check membership, others can do getChildren(Group).
    4. Other processes can monitor changes in this process configuration by watching.
    5. To exit the group, simply delete the created child node.
  3. Barrier
    1. Allows a group of processes to synchronize at the beginning and end of a computation.
    2. There exists a znode called “Barrier”
    3. To enter the barrier, each client creates a child node of Barrier.
    4. Use getChildren(Barrier) to check if enough processes have entered the Barrier.
    5. If yes, continue execution of code - barrier check passed.
    6. If no, set a watch on Barrier and on triggering goto step 4.
    7. For exiting, delete the created child node and watch the number of children.
  4. Locks
    1. There exists a znode called “Lock”
    2. Create a child nodes (ephemeral) of Lock using sequential flags (using create).
    3. Check if you are the lowest numbered znode, if yes -> you get the lock (by calling getChildren(Lock)).
    4. Else, request a watch on the znode closest (and lower) than you.
    5. If watch triggers, goto 3 - could be just because the znode was deleted for various reasons without actually getting the lock.
    6. SPECIALITY: Prevents all clients from requesting for lock after someone unlocks by introducing a sequence of locks - apparently called the herd effect.
    7. Only try to obtain lock, after the client before you finishes. If they drop out, then since ephemeral node, the znode will be deleted, so no issues of hanging.
  5. Queues
  6. Leader Election
  7. 2 Phase Commit
    • It is a atomic commitment protocol (basically a consensus protocol) where all nodes agree on whether a transaction should be committed or aborted.

Zookeeper Atomic Broadcast

Zookeeper requires an atomic broadcast algorithm to ensure FIFO ordering of client write requests while maintaining consistency.

It is to Zookeeper what Paxos is to Chubby (Is this correct??? - Yes, verified. Also it is what raft is to etcd).

An atomic broadcast protocol ensures total order consistency, wherein the entire system appears to be a single machine (instead of being distributed).

The protocol should satisfy the following conditions:

  1. Fault tolerance and efficient recovery: Protocol should work even if the primary crashes.
  2. Reliable Delivery: If a message is delivered by one process, all processes should deliver it.
  3. Total ordering:
    • If a message m1 is delivered before message m2 in one server, it must be delivered in the same order in other servers.
    • m1 can be delivered either before or after m2.
  4. Multiple outstanding transactions: Multiple Zookeeper client operations can be outstanding (proposed but not yet delivered) and they must be executed in FIFO for better performance.
  5. Highly performant:
    • All Zookeeper applications will require ZAB, so it must have high througput and low latency.
  6. Primary order:
    1. Local primary order If a primary broadcasts m1 before m2, other processes must commit m1 before m2
    2. Global primary order If a primary broadcasts m1 and fails and another primary is elected which broadcasts m2. A process that commits both m1 and m2 must commit m1 before m2.
      • Differs from Causal Order: HOW???
  7. Consistency of processes:
    1. Integrity: If a process commits a transaction m1, then some process must have broadcast it.
    2. Agreement: All processes commit

Similar yet dissimilar to

Transaction

Phases of the algorithm

Phase 0: Leader Election

Phase 1: Discovery across the cluster

Phase 2: Synchronization of transactions

Phase 3: Broadcast of relevant information

Useful information

  1. Zab uses TCP to ensure a FIFO channel between the servers which guarantees delivery and order of the messages.
  2. Deliver is used in the meaning of commit.
  3. Server and process are used interchangably to denote a Zookeeper running program.
  4. Transaction and message are used interchangably to signify the communication between server and client.
  5. Quorum (as used by Zookeeper) signifies a majority of processes.
    • Thus, we assume 2f + 1 total processes, with minimum f processes alive.
    • TODO: Could quorum mean anything else as well???

References

ZooKeeper

  1. The paper
  2. Morning Paper’s summary of the paper
  3. Tom Wheeler’s Slides
  4. Guide to creating higher level constructs using ZooKeeper

    ZAB

  5. Morning Paper on Zab
  6. Morning Paper on Zab: Theory and Practice
  7. Official Zookeeper Website
  8. Some course’s slides
  9. Zab vs Paxos