Monthly Archives: October 2012

Handling Cassandra concurrency issues, using Apache Zookeeper

One of the big problems with CassaFS, when I released the code a week and a half ago, were all the potential write races that could occur – whether it be multiple nodes trying to create the same file or directory at the same time, or writing to the same block at the same time, just to name a few of the potential concurrency scenarios that could play out.

This is because Cassandra doesn’t have the ability to provide atomic writes when updating multiple rows. It can provide atomic writes across multiple columns in a single row, but I would need to redesign the schema of CassaFS to take advantage of this, and even then, there are still going to be a number of operations that need to alter multiple rows, so this is unlikely to help in the long run.

The upshot of this is that in order to do locking, some sort of external mechanism was going to be needed. Preferably one that had some sort of ability to failover to one or more hosts.

After a bit of testing, Apache Zookeeper, described as a “Distributed Coordination Service for Distributed Applications” seems like the perfect candidate for this. It’s easy to configure, the documentation (at least, for the Java interface) is excellent, and they provide plenty of examples to learn from. And the best part, being distributed means that it isn’t a single point-of-failure.

Configuring Zookeeper to work across multiple servers was very simple – it was just a matter of adding the IP addresses and ports of all the servers to the Zookeeper configuration files.

Zookeeper also has a python interface, but other than the inline pydoc documentation, there’s not a lot of explanation of how to use it. I’ve muddled through and put together code to allow locking, based upon the example given on the Zookeeper webpages, here.

The Zookeeper namespace works rather like an in-memory filesystem; it’s a tree of directories/files (nodes). Watches can be set on nodes, which send notifications when a file has changed; I’ve use this facility in the locking code to look for the removal of nodes, when a process is releasing a lock.

import zookeeper
from threading import Condition

cv = Condition()
zh = zookeeper.init(servers)

# not sure what the third and fourth parameters are for
def notify(self, unknown1, unknown2, lockfile):

def get_lock(path):
    lockfile = zookeeper.create(zh,path + '/guid-lock-','lock', [ZOO_OPEN_ACL_UNSAFE], zookeeper.EPHEMERAL | zookeeper.SEQUENCE)

        children = zookeeper.get_children(zh, path)

        # obviously the code below can be done more efficiently, without sorting and reversing

        if children != None:

        found = 0
        for child in children:
            if child < basename(lockfile):
                found = 1

        if not found:
            return lockfile

        if zookeeper.exists(zh, path + '/' + child, notify):
            # Process will wait here until notify() wakes it

def drop_lock(lockfile):

Using it is straightforward; just call get_lock() before the critical section of code, and then drop_lock() at the end:

def create(path):
    lockfile = get_lock(path)

    # critical code here


In CassaFS, I’ve implemented this as a class, and then created subclasses to allow locking based upon path name, inode and individual blocks. It all works nicely, although as one would expect, it has slowed everything down quite a bit.

I used cluster-ssh to test CassaFS before and after I added the locks; beforehand, creating a single directory on four separate servers simultaneously would succeed without error; now, with locking, one server will create the directory, and it will fail on the remaining three.

For anyone on Ubuntu or Debian wanting a quickstart guide to getting Zookeeper up and running, and then testing it a bit, it’s just a matter of:

apt-get install zookeeper
/usr/share/zookeeper/bin/ start
/usr/share/zookeeper/bin/ -server
# now we're in the Zookeeper CLI, try creating and deleting a few nodes
ls /
create /node1 foo
get /node1
create /node1/node2 bar
create /node1/node3 foobar
ls /node1
delete /node1/node2
ls /node1