Zookeeper Single File Leader Election with Retry Logic

In a previous post I explained how to implement leader election as suggested on the Zookeeper website. I posted my solution on the Zookeeper mailing list and got some useful tips. The first one was on how to do leader election. The method suggested is a single file leader election. Works like this:

  1. Try creating ephemeral ZNode
    1. Succes, become leader
    2. Fail, stay inactive
  2. Set watch on ZNode
  3. If ZNode disappears goto 1.

All clients try create the same ephemeral node, which has to be unique, so only one client will be able to create the node. The client who creates the node first becomes the leader, the rest  of the clients stay inactive and wait for the node to disappear before trying to create a node again.

The second suggestion I was given is to keep the connection to Zookeeper in some sort of retry logic. Assuming things will go wrong we have to make sure there is a system in place which can recover from a bad situation.
Here a class diagram of all objects involved:
Class diagram single file leader election
The following sequence diagram explains the flow of the program in case we have a successful election, followed by a signal indicating the node has disappeared.

The ZnodeMonitor implements the Watcher interface which has the process method. As soon as the ZNodeMonitor is set as  the Watcher the Zookeeper can talk back to it in case something changes.

First the client connects to the zookeeper by constructing a RecoverableZookeeper. This happens in the start method, called by the SpeakerServer:

public void start() throws IOException {
    this.zk = new RecoverableZookeeper(connectionString, this);

The ZNodeMonitor sets itself as a Watcher (this). This gives Zookeeper the opportunity to call back the ZNodeMonitor as soon as it is connected to the server. It does this by sending a None event type with a SyncConnected state.

public void process(WatchedEvent watchedEvent) {
    switch (watchedEvent.getType()) {
        case None:
        case NodeDeleted:
    try {
        zk.exists(ROOT, this);
    } catch (Exception e) {

public void processNoneEvent(WatchedEvent event) {
    switch (event.getState()) {
        case SyncConnected:
        case AuthFailed:
        case Disconnected:
        case Expired:

Next the client tries to create a ZNode:

public void createZNode() {
    try {
        zk.create(ROOT, listener.getProcessName().getBytes());
    } catch (Exception e) {
        // Something went wrong, lets try set a watch first before
        // we take any action

After this (successful or not) it tries setting a watch via the exist() method (also in the process method, see above).

The retry logic is encapsulated in the RecoverableZookeeper. The create() method wraps the original Zookeeper create method in the following retry logic:

public String create(String path, byte[] data) throws KeeperException, InterruptedException {
    RetryCounter retryCounter = retryCounterFactory.create();
    while (true) {
        try {
            return zk.create(path, data, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
        } catch (KeeperException e) {
            logger.debug("Error code: " + e.code());
            switch (e.code()) {
                case NODEEXISTS:
                    if (retryCounter.shouldRetry()) {
                        byte[] currentData = zk.getData(path, false, null);
                        if (currentData != null && Arrays.equals(currentData, data)) {
                            return path;
                        throw e;
                    throw e;
                case CONNECTIONLOSS:
                case OPERATIONTIMEOUT:
                    if (!retryCounter.shouldRetry()) {
                    throw e;

If the creation of the Znode fails it retries until the reties get exhausted. In case the node already exists it first checks if it created the node itself, otherwise it will throw an exception. In case of a loss of connection or an operation timeout, it keeps retrying.

The complete code is available in my github account. Any questions or suggestions please feel free to send me an email or post a comment. This code also include some simple Ruby scripts to create configuration files and to start and stop multiple Zookeeper servers on a single machine. Handy if you want to test different scenarios.

For a full implementation of retry logic with Zookeeper I recommend the Netflix Zookeeper Library (Curator) which implements all of this and much more and is tested in a large scale environment.


3 thoughts on “Zookeeper Single File Leader Election with Retry Logic

  1. Pingback: Implementing Leader Election with Zookeeper | Olivier's Technology Blog

  2. I would recommend elaborating your leader to have three states that include active, inactive and paused. You have the first two states and encode the transitions between them with startSpeaking and stopSpeaking, but you don’t handle the third state. That third state should be entered from the active state on connection loss. From paused state you can return to active if you get a connection event or you can go to inactive if you get an expiration event.

    This is important because when you get a disconnection, you don’t know the state of who is the leader. Some applications want to at most one leader and rarely zero, some want to have at least one leader but rarely more and almost never none. The first kind of application will be happy with your setup, but the second kind will not. With a paused state, a leader could choose to stay a leader until told to stop.

    The paused state is also very helpful even in the at-most-one kind of application if becoming a leader is expensive. The idea is that a leader in the paused state could refrain from tearing down their service until they are notified that they are no longer leader or an implausibly long time has passed. This avoids the cost of rebuilding or restarting the service if the connection is re-established very quickly as is common.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s