One of the big problems with CassaFS, when I released the code a week and a half ago, were all the potential write races that could occur – whether it be multiple nodes trying to create the same file or directory at the same time, or writing to the same block at the same time, just to name a few of the potential concurrency scenarios that could play out.
This is because Cassandra doesn’t have the ability to provide atomic writes when updating multiple rows. It can provide atomic writes across multiple columns in a single row, but I would need to redesign the schema of CassaFS to take advantage of this, and even then, there are still going to be a number of operations that need to alter multiple rows, so this is unlikely to help in the long run.
The upshot of this is that in order to do locking, some sort of external mechanism was going to be needed. Preferably one that had some sort of ability to failover to one or more hosts.
After a bit of testing, Apache Zookeeper, described as a “Distributed Coordination Service for Distributed Applications” seems like the perfect candidate for this. It’s easy to configure, the documentation (at least, for the Java interface) is excellent, and they provide plenty of examples to learn from. And the best part, being distributed means that it isn’t a single point-of-failure.
Configuring Zookeeper to work across multiple servers was very simple – it was just a matter of adding the IP addresses and ports of all the servers to the Zookeeper configuration files.
Zookeeper also has a python interface, but other than the inline pydoc documentation, there’s not a lot of explanation of how to use it. I’ve muddled through and put together code to allow locking, based upon the example given on the Zookeeper webpages, here.
The Zookeeper namespace works rather like an in-memory filesystem; it’s a tree of directories/files (nodes). Watches can be set on nodes, which send notifications when a file has changed; I’ve use this facility in the locking code to look for the removal of nodes, when a process is releasing a lock.
import zookeeper from threading import Condition cv = Condition() servers="127.0.0.1:2181" zh = zookeeper.init(servers) # not sure what the third and fourth parameters are for def notify(self, unknown1, unknown2, lockfile): cv.acquire() cv.notify() cv.release() def get_lock(path): lockfile = zookeeper.create(zh,path + '/guid-lock-','lock', [ZOO_OPEN_ACL_UNSAFE], zookeeper.EPHEMERAL | zookeeper.SEQUENCE) while(True): children = zookeeper.get_children(zh, path) # obviously the code below can be done more efficiently, without sorting and reversing if children != None: children.sort() children.reverse() found = 0 for child in children: if child < basename(lockfile): found = 1 break if not found: return lockfile cv.acquire() if zookeeper.exists(zh, path + '/' + child, notify): # Process will wait here until notify() wakes it cv.wait() cv.release() def drop_lock(lockfile): zookeeper.delete(zh,lockfile)
Using it is straightforward; just call get_lock() before the critical section of code, and then drop_lock() at the end:
def create(path): ... lockfile = get_lock(path) # critical code here drop_lock(lockfile)
In CassaFS, I’ve implemented this as a class, and then created subclasses to allow locking based upon path name, inode and individual blocks. It all works nicely, although as one would expect, it has slowed everything down quite a bit.
I used cluster-ssh to test CassaFS before and after I added the locks; beforehand, creating a single directory on four separate servers simultaneously would succeed without error; now, with locking, one server will create the directory, and it will fail on the remaining three.
For anyone on Ubuntu or Debian wanting a quickstart guide to getting Zookeeper up and running, and then testing it a bit, it’s just a matter of:
apt-get install zookeeper /usr/share/zookeeper/bin/zkServer.sh start /usr/share/zookeeper/bin/zkCli.sh -server 127.0.0.1:2181 # now we're in the Zookeeper CLI, try creating and deleting a few nodes ls / create /node1 foo get /node1 create /node1/node2 bar create /node1/node3 foobar ls /node1 delete /node1/node2 ls /node1 quit