A couple of weeks ago, I decided that I wanted to learn how to develop FUSE filesystems. The result of this is CassaFS, a network filesystem that uses the Apache Cassandra database as a backend.
For those who haven’t looked at Cassandra before, it’s a very cool concept. The data it holds can be distributed across multiple nodes automatically (“it just works!”), so to expand a system, it just needs more machines thrown at it. Naturally, to expand a system properly, you need to add extra nodes in the correct numbers, or retune your existing systems; but even just adding extra nodes, without thinking too hard about it, will work, just not efficiently. The trade-off, however, is consistency – in situations where the system is configured to replicate data to multiple nodes, it can take time to propagate through.
Now, I realise I am not the first person to try writing a Cassandra-based filesystem; there’s at least one other that I know of, but it hasn’t been worked on for a couple of years, and Cassandra has changed quite a bit in that time, so I have no idea whether it still works or not.
Getting your mind around Cassandra’s data model is rather tricky, especially if you’re from an RDBMS background. Cassandra is a NoSQL database system, essentially a key-value system, and only the keys are indexed. This means you need get used to denormalising data (ie, duplicating it in various parts of the database), in order to read it efficiently. The best way to design a database for Cassandra is to look carefully at what queries your software is going to need to make, because you’re going to need a column family for each of those.
I hadn’t done any filesystem design before, when I started working on CassaFS, so I naively thought that I could use a file-path as an index. This actually worked, for a while – I had three column families: one for inodes, which contained stat(2) data, one for directories and one containing all the blocks of data:
Inode column family:
Key | Data |
---|---|
/ | uid: 0, gid: 0, mode: 0755, … etc |
/testfile | uid: 0, gid: 0, mode: 0644, … etc |
/testdir | uid: 0, gid: 0, mode: 0755, … etc |
Directory column family:
Key | Data |
---|---|
/ | [ (‘.’, ‘/’), (‘..’, ‘/’), (‘testfile’, ‘/testfile’), (‘testdir’, ‘/testdir’)] |
/testdir | [(‘.’, ‘/testdir’), (‘..’, ‘/’)] |
Block column family:
Key | Data |
---|---|
/testfile | [(0,BLOCK0DATA), (1,BLOCK1DATA)…] |
Of course, this model failed as soon as I thought about implementing hard links, because there’s no way to have multiple directory entries pointing at a single inode, if you’re indexing them by path name. So I replaced pathname indexes with random uuids, and then (naively, again) created a new Pathmap column family, to map paths to UUIDs:
Inode column family:
Key | Data |
---|---|
9d194247-ac93-40ea-baa7-17a4c0c35cdf | uid: 0, gid: 0, mode: 0755, … etc |
fc2fc152-9526-4e33-9df2-dba070e39c63 | uid: 0, gid: 0, mode: 0644, … etc |
74efdba6-57d4-4b73-94cc-74b34d452194 | uid: 0, gid: 0, mode: 0755, … etc |
Directory column family:
Key | Data |
---|---|
/ | [ (‘.’, 9d194247-ac93-40ea-baa7-17a4c0c35cdf ), (‘..’, 9d194247-ac93-40ea-baa7-17a4c0c35cdf), (‘testfile’, fc2fc152-9526-4e33-9df2-dba070e39c63), (‘testdir’, 74efdba6-57d4-4b73-94cc-74b34d452194)] |
/testdir | [(‘.’, 74efdba6-57d4-4b73-94cc-74b34d452194), (‘..’, 9d194247-ac93-40ea-baa7-17a4c0c35cdf)] |
Block column family:
Key | Data |
---|---|
fc2fc152-9526-4e33-9df2-dba070e39c63 | [(0,BLOCK0DATA), (1,BLOCK1DATA)…] |
Pathmap column family:
Key | Data |
---|---|
/ | 9d194247-ac93-40ea-baa7-17a4c0c35cdf |
/testfile | fc2fc152-9526-4e33-9df2-dba070e39c63 |
… |
This enabled me to get hard links working very easily, just by adding extra directory and pathmap entries for them, pointing at existing inodes. I used this model for quite a while, and hadn’t noticed any problem with it because I had forgotten to implement the rename() function (ie, for mv). It wasn’t until I tried building a debian package from source on CassaFS that it failed, and when I tried implementing this, I realised that mapping pathnames wasn’t going to work when renaming a directory, because every file underneath that directory would need to have its pathmap updated.
At that point, I saw it would be necessary to traverse the whole directory tree on every file lookup, to find its inode, and then just give the root inode a UUID of 00000000-0000-0000-0000-000000000000, so that it can be found easily. This way, I could use UUIDs as the Directory column family index, and do away with the Pathmap column family entirely.
Inode column family:
Key | Data |
---|---|
00000000-0000-0000-0000-000000000000 | uid: 0, gid: 0, mode: 0755, … etc |
fc2fc152-9526-4e33-9df2-dba070e39c63 | uid: 0, gid: 0, mode: 0644, … etc |
74efdba6-57d4-4b73-94cc-74b34d452194 | uid: 0, gid: 0, mode: 0755, … etc |
Directory column family:
Key | Data |
---|---|
00000000-0000-0000-0000-000000000000 | [ (‘.’, 00000000-0000-0000-0000-000000000000 ), (‘..’, 00000000-0000-0000-0000-000000000000), (‘testfile’, fc2fc152-9526-4e33-9df2-dba070e39c63), (‘testdir’, 74efdba6-57d4-4b73-94cc-74b34d452194)] |
74efdba6-57d4-4b73-94cc-74b34d452194 | [(‘.’, 74efdba6-57d4-4b73-94cc-74b34d452194), (‘..’, 900000000-0000-0000-0000-000000000000)] |
Block column family:
Key | Data |
---|---|
fc2fc152-9526-4e33-9df2-dba070e39c63 | [(0,BLOCK0DATA), (1,BLOCK1DATA)…] |
Yesterday, I discovered the Tuxera POSIX Test Suite, and tried it on CassaFS. At a rough estimate, it’s failing at least 25% of the tests, so there’s still plenty of work to do. At this stage, CassaFS is not useful for anything more than testing out Cassandra, as a way of getting a lot of data into it quickly, and trying out Cassandra’s distributed database abilities (except, since I have currently hardcoded 127.0.0.1:9160 into CassaFS, it will require some slight adjustment for this to actually work). You can even mount a single filesystem onto multiple servers and use them simultaneously – but I haven’t even begun to think about how I might implement file locking, so expect corruption if you have multiple processes working on a single file. Nor have I done any exception handling – this is software that is in a very, very early stage of development.
It’s all written in Python at present, so don’t expect it to be fast – although, that said, given that it’s talking to Cassandra, I’m not entirely sure how much of a performance boost will be gained from rewriting it in C. I’m still using the Cassandra Thrift interface (via Pycassa), despite Cassandra moving towards using CQL these days. I’m not sure what state Python CQL drivers are in, so for the moment, it was easier to continue using Pycassa, which is well tested.
For Debian and Ubuntu users, I have provided packages (currently i386 only because of python-thrift – I’ll get amd64 packages out next week) and it should be fairly simple to set up – quickstarter documentation here. Just beware of the many caveats that I’ve spelt out on that page. I’m hoping to get packages for RHEL6 working sometime soon, too.