ZooKeeper is a high available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates key configuration information. ZooKeeper can be used for leader election, group membership, and configuration maintenance. In addition ZooKeeper can be used for event notification, locking, and as a priority queue mechanism. (source Zookeeper homepage)
Data storage in Zookeeper
To make Zookeeper a highly available and reliable coordination system, clients can store data in a hierarchical data structure in the same way a file system tree is built. This tree is only kept in memory and can handle up to 50,000 request per second. Zookeeper itself runs in a distributed fashion and the data is available in every running instance. As soon as a write request comes in the receiving Zookeeper will do an atomic broadcast to all the other Zookeepers. Data is guaranteed to never diverge.
Znodes, normal, ephemeral and sequence
Nodes in a Zookeeper tree are called znodes. A znode is a data (stat) struct with statistical data and a byte array to store user data. Zookeeper is not meant to be a database so there are checks to ensure znoeds have less than 1M of data. Here is an overview of the stats:
- czxid – The zxid of the change that caused this znode to be created.
- mzxid – The zxid of the change that last modified this znode.
- crime – The time in milliseconds from epoch when this znode was created.
- mime – The time in milliseconds from epoch when this znode was last modified.
- version – The number of changes to the data of this znode.
- coercion – The number of changes to the children of this znode.
- aversion – The number of changes to the ACL of this znode.
- ephemeralOwner – The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero.
- dataLength – The length of the data field of this znode.
- numChildren – The number of children of this znode.
Besides the normal znode there is also another type of znode called ephemeral node. This node only exists as long as there is a connection to the client that created it. It disappears as soon as the connection is closed or lost. Ephemeral nodes can not have child nodes.
Every type of node can be created as a sequence node. A sequence node will have a number appended to the node name, every time a child node gets added the number increases by one. With this you can keep track which child nodes under a parent where created first.
Operations on znodes
Clients can create, modify, delete and ask information about nodes in the Zookeeper tree, the operations which can be done are very similar to the ones you can do on a file system.
Here is a list:
- create – creates a node at a location in the tree
- delete – deletes a node
- exists – tests if a node exists at a location
- get data – reads the data from a node
- set data – writes data to a node
- get children – retrieves a list of children of a node
- sync – waits for data to be propagated
Once the client is connected and nodes are created the client can set a watch on specific nodes in the tree. The moment these nodes change the Zookeeper server will send an event to the subscribed clients. Watches can be set with the following operations: getData(), getChildren() and exists. One thing to note: events will always arrive in order at the client.
There are five event types, all self explanatory, except the last one:
- Node children changed
- Node created
- Node data changed
- Node deleted
The first four all have to do with events happening related to a watched node. The last one is special, it indicates a change in the state of the connection. When you get a None event check the keeper state to so whats going on. Possible values are:
- AuthFailed – Auth failed state
- Disconnected – The client is in the disconnected state – it is not connected to any server in the ensemble
- Expired – The serving cluster has expired this session
- SyncConnected – The client is in the connected state – it is connected to a server in the ensemble (one of the servers specified in the host connection parameter during ZooKeeper client creation)
As you can guess, this first thing your watcher will get once it connects is a None event with a SyncConnected state.
So far an overview of the main workings of Zookeeper. There is a lot more to it, I haven’t spoken about ACLs, master election, two phase commits, but I hope this overview will you get started.