Tag Archives: cassandra

Handling Cassandra concurrency issues, using Apache Zookeeper

One of the big problems with CassaFS, when I released the code a week and a half ago, were all the potential write races that could occur – whether it be multiple nodes trying to create the same file or directory at the same time, or writing to the same block at the same time, just to name a few of the potential concurrency scenarios that could play out.

This is because Cassandra doesn’t have the ability to provide atomic writes when updating multiple rows. It can provide atomic writes across multiple columns in a single row, but I would need to redesign the schema of CassaFS to take advantage of this, and even then, there are still going to be a number of operations that need to alter multiple rows, so this is unlikely to help in the long run.

The upshot of this is that in order to do locking, some sort of external mechanism was going to be needed. Preferably one that had some sort of ability to failover to one or more hosts.

After a bit of testing, Apache Zookeeper, described as a “Distributed Coordination Service for Distributed Applications” seems like the perfect candidate for this. It’s easy to configure, the documentation (at least, for the Java interface) is excellent, and they provide plenty of examples to learn from. And the best part, being distributed means that it isn’t a single point-of-failure.

Configuring Zookeeper to work across multiple servers was very simple – it was just a matter of adding the IP addresses and ports of all the servers to the Zookeeper configuration files.

Zookeeper also has a python interface, but other than the inline pydoc documentation, there’s not a lot of explanation of how to use it. I’ve muddled through and put together code to allow locking, based upon the example given on the Zookeeper webpages, here.

The Zookeeper namespace works rather like an in-memory filesystem; it’s a tree of directories/files (nodes). Watches can be set on nodes, which send notifications when a file has changed; I’ve use this facility in the locking code to look for the removal of nodes, when a process is releasing a lock.

import zookeeper
from threading import Condition

cv = Condition()
servers="127.0.0.1:2181"
zh = zookeeper.init(servers)

# not sure what the third and fourth parameters are for
def notify(self, unknown1, unknown2, lockfile):
    cv.acquire()
    cv.notify()
    cv.release()

def get_lock(path):
    lockfile = zookeeper.create(zh,path + '/guid-lock-','lock', [ZOO_OPEN_ACL_UNSAFE], zookeeper.EPHEMERAL | zookeeper.SEQUENCE)

    while(True):
        children = zookeeper.get_children(zh, path)

        # obviously the code below can be done more efficiently, without sorting and reversing

        if children != None:
            children.sort()
            children.reverse()

        found = 0
        for child in children:
            if child < basename(lockfile):
                found = 1
                break

        if not found:
            return lockfile

        cv.acquire()
        if zookeeper.exists(zh, path + '/' + child, notify):
            # Process will wait here until notify() wakes it
            cv.wait()
        cv.release()

def drop_lock(lockfile):
    zookeeper.delete(zh,lockfile)

Using it is straightforward; just call get_lock() before the critical section of code, and then drop_lock() at the end:

def create(path):
    ...
    lockfile = get_lock(path)

    # critical code here

    drop_lock(lockfile)

In CassaFS, I’ve implemented this as a class, and then created subclasses to allow locking based upon path name, inode and individual blocks. It all works nicely, although as one would expect, it has slowed everything down quite a bit.

I used cluster-ssh to test CassaFS before and after I added the locks; beforehand, creating a single directory on four separate servers simultaneously would succeed without error; now, with locking, one server will create the directory, and it will fail on the remaining three.

For anyone on Ubuntu or Debian wanting a quickstart guide to getting Zookeeper up and running, and then testing it a bit, it’s just a matter of:

apt-get install zookeeper
/usr/share/zookeeper/bin/zkServer.sh start
/usr/share/zookeeper/bin/zkCli.sh -server 127.0.0.1:2181
# now we're in the Zookeeper CLI, try creating and deleting a few nodes
ls /
create /node1 foo
get /node1
create /node1/node2 bar
create /node1/node3 foobar
ls /node1
delete /node1/node2
ls /node1
quit

CassaFS – a FUSE-based filesystem using Apache Cassandra as a backend.

A couple of weeks ago, I decided that I wanted to learn how to develop FUSE filesystems. The result of this is CassaFS, a network filesystem that uses the Apache Cassandra database as a backend.

For those who haven’t looked at Cassandra before, it’s a very cool concept. The data it holds can be distributed across multiple nodes automatically (“it just works!”), so to expand a system, it just needs more machines thrown at it. Naturally, to expand a system properly, you need to add extra nodes in the correct numbers, or retune your existing systems; but even just adding extra nodes, without thinking too hard about it, will work, just not efficiently. The trade-off, however, is consistency – in situations where the system is configured to replicate data to multiple nodes, it can take time to propagate through.

Now, I realise I am not the first person to try writing a Cassandra-based filesystem; there’s at least one other that I know of, but it hasn’t been worked on for a couple of years, and Cassandra has changed quite a bit in that time, so I have no idea whether it still works or not.

Getting your mind around Cassandra’s data model is rather tricky, especially if you’re from an RDBMS background. Cassandra is a NoSQL database system, essentially a key-value system, and only the keys are indexed. This means you need get used to denormalising data (ie, duplicating it in various parts of the database), in order to read it efficiently. The best way to design a database for Cassandra is to look carefully at what queries your software is going to need to make, because you’re going to need a column family for each of those.

I hadn’t done any filesystem design before, when I started working on CassaFS, so I naively thought that I could use a file-path as an index. This actually worked, for a while – I had three column families: one for inodes, which contained stat(2) data, one for directories and one containing all the blocks of data:

Inode column family:

Key Data
/ uid: 0, gid: 0, mode: 0755, … etc
/testfile uid: 0, gid: 0, mode: 0644, … etc
/testdir uid: 0, gid: 0, mode: 0755, … etc

Directory column family:

Key Data
/ [ (‘.’, ‘/’), (‘..’, ‘/’), (‘testfile’, ‘/testfile’), (‘testdir’, ‘/testdir’)]
/testdir [(‘.’, ‘/testdir’), (‘..’, ‘/’)]

Block column family:

Key Data
/testfile [(0,BLOCK0DATA), (1,BLOCK1DATA)…]

Of course, this model failed as soon as I thought about implementing hard links, because there’s no way to have multiple directory entries pointing at a single inode, if you’re indexing them by path name. So I replaced pathname indexes with random uuids, and then (naively, again) created a new Pathmap column family, to map paths to UUIDs:

Inode column family:

Key Data
9d194247-ac93-40ea-baa7-17a4c0c35cdf uid: 0, gid: 0, mode: 0755, … etc
fc2fc152-9526-4e33-9df2-dba070e39c63 uid: 0, gid: 0, mode: 0644, … etc
74efdba6-57d4-4b73-94cc-74b34d452194 uid: 0, gid: 0, mode: 0755, … etc

Directory column family:

Key Data
/ [ (‘.’, 9d194247-ac93-40ea-baa7-17a4c0c35cdf ), (‘..’, 9d194247-ac93-40ea-baa7-17a4c0c35cdf), (‘testfile’, fc2fc152-9526-4e33-9df2-dba070e39c63), (‘testdir’, 74efdba6-57d4-4b73-94cc-74b34d452194)]
/testdir [(‘.’, 74efdba6-57d4-4b73-94cc-74b34d452194), (‘..’, 9d194247-ac93-40ea-baa7-17a4c0c35cdf)]

Block column family:

Key Data
fc2fc152-9526-4e33-9df2-dba070e39c63 [(0,BLOCK0DATA), (1,BLOCK1DATA)…]

Pathmap column family:

Key Data
/ 9d194247-ac93-40ea-baa7-17a4c0c35cdf
/testfile fc2fc152-9526-4e33-9df2-dba070e39c63

This enabled me to get hard links working very easily, just by adding extra directory and pathmap entries for them, pointing at existing inodes. I used this model for quite a while, and hadn’t noticed any problem with it because I had forgotten to implement the rename() function (ie, for mv). It wasn’t until I tried building a debian package from source on CassaFS that it failed, and when I tried implementing this, I realised that mapping pathnames wasn’t going to work when renaming a directory, because every file underneath that directory would need to have its pathmap updated.

At that point, I saw it would be necessary to traverse the whole directory tree on every file lookup, to find its inode, and then just give the root inode a UUID of 00000000-0000-0000-0000-000000000000, so that it can be found easily. This way, I could use UUIDs as the Directory column family index, and do away with the Pathmap column family entirely.

Inode column family:

Key Data
00000000-0000-0000-0000-000000000000 uid: 0, gid: 0, mode: 0755, … etc
fc2fc152-9526-4e33-9df2-dba070e39c63 uid: 0, gid: 0, mode: 0644, … etc
74efdba6-57d4-4b73-94cc-74b34d452194 uid: 0, gid: 0, mode: 0755, … etc

Directory column family:

Key Data
00000000-0000-0000-0000-000000000000 [ (‘.’, 00000000-0000-0000-0000-000000000000 ), (‘..’, 00000000-0000-0000-0000-000000000000), (‘testfile’, fc2fc152-9526-4e33-9df2-dba070e39c63), (‘testdir’, 74efdba6-57d4-4b73-94cc-74b34d452194)]
74efdba6-57d4-4b73-94cc-74b34d452194 [(‘.’, 74efdba6-57d4-4b73-94cc-74b34d452194), (‘..’, 900000000-0000-0000-0000-000000000000)]

Block column family:

Key Data
fc2fc152-9526-4e33-9df2-dba070e39c63 [(0,BLOCK0DATA), (1,BLOCK1DATA)…]

Yesterday, I discovered the Tuxera POSIX Test Suite, and tried it on CassaFS. At a rough estimate, it’s failing at least 25% of the tests, so there’s still plenty of work to do. At this stage, CassaFS is not useful for anything more than testing out Cassandra, as a way of getting a lot of data into it quickly, and trying out Cassandra’s distributed database abilities (except, since I have currently hardcoded 127.0.0.1:9160 into CassaFS, it will require some slight adjustment for this to actually work). You can even mount a single filesystem onto multiple servers and use them simultaneously – but I haven’t even begun to think about how I might implement file locking, so expect corruption if you have multiple processes working on a single file. Nor have I done any exception handling – this is software that is in a very, very early stage of development.

It’s all written in Python at present, so don’t expect it to be fast – although, that said, given that it’s talking to Cassandra, I’m not entirely sure how much of a performance boost will be gained from rewriting it in C. I’m still using the Cassandra Thrift interface (via Pycassa), despite Cassandra moving towards using CQL these days. I’m not sure what state Python CQL drivers are in, so for the moment, it was easier to continue using Pycassa, which is well tested.

For Debian and Ubuntu users, I have provided packages (currently i386 only because of python-thrift – I’ll get amd64 packages out next week) and it should be fairly simple to set up – quickstarter documentation here. Just beware of the many caveats that I’ve spelt out on that page. I’m hoping to get packages for RHEL6 working sometime soon, too.