A couple of weeks ago, I decided that I wanted to learn how to develop FUSE filesystems. The result of this is CassaFS, a network filesystem that uses the Apache Cassandra database as a backend.
For those who haven’t looked at Cassandra before, it’s a very cool concept. The data it holds can be distributed across multiple nodes automatically (“it just works!”), so to expand a system, it just needs more machines thrown at it. Naturally, to expand a system properly, you need to add extra nodes in the correct numbers, or retune your existing systems; but even just adding extra nodes, without thinking too hard about it, will work, just not efficiently. The trade-off, however, is consistency – in situations where the system is configured to replicate data to multiple nodes, it can take time to propagate through.
Now, I realise I am not the first person to try writing a Cassandra-based filesystem; there’s at least one other that I know of, but it hasn’t been worked on for a couple of years, and Cassandra has changed quite a bit in that time, so I have no idea whether it still works or not.
Getting your mind around Cassandra’s data model is rather tricky, especially if you’re from an RDBMS background. Cassandra is a NoSQL database system, essentially a key-value system, and only the keys are indexed. This means you need get used to denormalising data (ie, duplicating it in various parts of the database), in order to read it efficiently. The best way to design a database for Cassandra is to look carefully at what queries your software is going to need to make, because you’re going to need a column family for each of those.
I hadn’t done any filesystem design before, when I started working on CassaFS, so I naively thought that I could use a file-path as an index. This actually worked, for a while – I had three column families: one for inodes, which contained stat(2) data, one for directories and one containing all the blocks of data:
Inode column family:
Key | Data |
---|---|
/ | uid: 0, gid: 0, mode: 0755, … etc |
/testfile | uid: 0, gid: 0, mode: 0644, … etc |
/testdir | uid: 0, gid: 0, mode: 0755, … etc |
Directory column family:
Key | Data |
---|---|
/ | [ (‘.’, ‘/’), (‘..’, ‘/’), (‘testfile’, ‘/testfile’), (‘testdir’, ‘/testdir’)] |
/testdir | [(‘.’, ‘/testdir’), (‘..’, ‘/’)] |
Block column family:
Key | Data |
---|---|
/testfile | [(0,BLOCK0DATA), (1,BLOCK1DATA)…] |
Of course, this model failed as soon as I thought about implementing hard links, because there’s no way to have multiple directory entries pointing at a single inode, if you’re indexing them by path name. So I replaced pathname indexes with random uuids, and then (naively, again) created a new Pathmap column family, to map paths to UUIDs:
Inode column family:
Key | Data |
---|---|
9d194247-ac93-40ea-baa7-17a4c0c35cdf | uid: 0, gid: 0, mode: 0755, … etc |
fc2fc152-9526-4e33-9df2-dba070e39c63 | uid: 0, gid: 0, mode: 0644, … etc |
74efdba6-57d4-4b73-94cc-74b34d452194 | uid: 0, gid: 0, mode: 0755, … etc |
Directory column family:
Key | Data |
---|---|
/ | [ (‘.’, 9d194247-ac93-40ea-baa7-17a4c0c35cdf ), (‘..’, 9d194247-ac93-40ea-baa7-17a4c0c35cdf), (‘testfile’, fc2fc152-9526-4e33-9df2-dba070e39c63), (‘testdir’, 74efdba6-57d4-4b73-94cc-74b34d452194)] |
/testdir | [(‘.’, 74efdba6-57d4-4b73-94cc-74b34d452194), (‘..’, 9d194247-ac93-40ea-baa7-17a4c0c35cdf)] |
Block column family:
Key | Data |
---|---|
fc2fc152-9526-4e33-9df2-dba070e39c63 | [(0,BLOCK0DATA), (1,BLOCK1DATA)…] |
Pathmap column family:
Key | Data |
---|---|
/ | 9d194247-ac93-40ea-baa7-17a4c0c35cdf |
/testfile | fc2fc152-9526-4e33-9df2-dba070e39c63 |
… |
This enabled me to get hard links working very easily, just by adding extra directory and pathmap entries for them, pointing at existing inodes. I used this model for quite a while, and hadn’t noticed any problem with it because I had forgotten to implement the rename() function (ie, for mv). It wasn’t until I tried building a debian package from source on CassaFS that it failed, and when I tried implementing this, I realised that mapping pathnames wasn’t going to work when renaming a directory, because every file underneath that directory would need to have its pathmap updated.
At that point, I saw it would be necessary to traverse the whole directory tree on every file lookup, to find its inode, and then just give the root inode a UUID of 00000000-0000-0000-0000-000000000000, so that it can be found easily. This way, I could use UUIDs as the Directory column family index, and do away with the Pathmap column family entirely.
Inode column family:
Key | Data |
---|---|
00000000-0000-0000-0000-000000000000 | uid: 0, gid: 0, mode: 0755, … etc |
fc2fc152-9526-4e33-9df2-dba070e39c63 | uid: 0, gid: 0, mode: 0644, … etc |
74efdba6-57d4-4b73-94cc-74b34d452194 | uid: 0, gid: 0, mode: 0755, … etc |
Directory column family:
Key | Data |
---|---|
00000000-0000-0000-0000-000000000000 | [ (‘.’, 00000000-0000-0000-0000-000000000000 ), (‘..’, 00000000-0000-0000-0000-000000000000), (‘testfile’, fc2fc152-9526-4e33-9df2-dba070e39c63), (‘testdir’, 74efdba6-57d4-4b73-94cc-74b34d452194)] |
74efdba6-57d4-4b73-94cc-74b34d452194 | [(‘.’, 74efdba6-57d4-4b73-94cc-74b34d452194), (‘..’, 900000000-0000-0000-0000-000000000000)] |
Block column family:
Key | Data |
---|---|
fc2fc152-9526-4e33-9df2-dba070e39c63 | [(0,BLOCK0DATA), (1,BLOCK1DATA)…] |
Yesterday, I discovered the Tuxera POSIX Test Suite, and tried it on CassaFS. At a rough estimate, it’s failing at least 25% of the tests, so there’s still plenty of work to do. At this stage, CassaFS is not useful for anything more than testing out Cassandra, as a way of getting a lot of data into it quickly, and trying out Cassandra’s distributed database abilities (except, since I have currently hardcoded 127.0.0.1:9160 into CassaFS, it will require some slight adjustment for this to actually work). You can even mount a single filesystem onto multiple servers and use them simultaneously – but I haven’t even begun to think about how I might implement file locking, so expect corruption if you have multiple processes working on a single file. Nor have I done any exception handling – this is software that is in a very, very early stage of development.
It’s all written in Python at present, so don’t expect it to be fast – although, that said, given that it’s talking to Cassandra, I’m not entirely sure how much of a performance boost will be gained from rewriting it in C. I’m still using the Cassandra Thrift interface (via Pycassa), despite Cassandra moving towards using CQL these days. I’m not sure what state Python CQL drivers are in, so for the moment, it was easier to continue using Pycassa, which is well tested.
For Debian and Ubuntu users, I have provided packages (currently i386 only because of python-thrift – I’ll get amd64 packages out next week) and it should be fairly simple to set up – quickstarter documentation here. Just beware of the many caveats that I’ve spelt out on that page. I’m hoping to get packages for RHEL6 working sometime soon, too.
Hi, I’m the author of CassFS, which might be what you were referring to in your third paragraph. Yes, it’s essentially an abandoned project, first in favor of my own VoldFS and then my day job hacking on GlusterFS. Both CassFS and VoldFS were useful as explorations of certain ideas I had at the time, so maybe they have some value to fellow filesystem developers, but neither ever reached the point where I’d recommend their actual use.
CassaFS looks great, and I’m glad to see others thinking along these lines. Maybe if I ever have any spare time (ha!) I might actually try to contribute some code to it. Meanwhile, here are some things you might want to think about. These aren’t meant to be criticisms, just things that turned out to be hard problems while I was hacking on something similar.
(1) Cross-directory renames. Yes, they’re evil because they require atomic modification of both source and destination directories. Many users are just as well served by treating a cross-directory as separate link-to-new and unlink-from-old operations, or for that matter just by returning EXDEV and letting higher levels handle it. Still, if you did want to support them, you might want a “rename-to” field in each directory entry so that you could recover (somewhat) after a client crash in the middle.
(2) Write races. With the structure above, how would you handle concurrent writes within a single block (or concurrent appends)? One of the main reasons I switched my own efforts from CassFS to VoldFS was to take advantage of Voldemort’s conditional-put functionality which Cassandra lacked at the time. This allowed me to handle such races without locking. If Cassandra has added this (haven’t been keeping track enough to know) then you could at least do a simple “detect conflict and retry” kind of optimistic concurrency control.
(3) O_CREAT races. How do you avoid having two concurrent opens with O_CREAT result in both creating new files, the directory entry for the second effectively overwriting that for the first (and thus leaving it unlinked anywhere)?
(4) Unlinked files. Can you still write to a file that exists in the inode and block column families but has no entry anywhere in the directory column family? If so, how are such entries garbage-collected when the file is closed?
It’s perfectly reasonable to say that these are unsupported POSIX behaviors, but they do tend to guide a lot of filesystem design so I figured I’d bring them up. Please don’t take these as criticisms, and do feel free to ping me for further brainstorming.
Hi Jeff! Thanks for so much for the comments. I’m more than happy to have advice/criticism/whatever. As I said, I haven’t done any filesystem development before, so essentially don’t have much idea what I’m doing and am probably reinventing the wheel several times over.
1) Hmm, so is this the case where you have a process with a PWD of /1/2/3/4/5/6, and then, in a second process, move /1/2/3 to /a/b/c? I’ve just tested that, and it seems to cope with it. Maybe I’m misunderstanding the scenario?
2 & 3) Yep, I’ve had it in the back of my mind that there are likely to be write/create-races everywhere, but really haven’t thought about what I can do about these, yet
4) Ah, thanks. I’d tested the case when reading, but forgot about writing to a file in such a situation. But it seems that FUSE automatically handles this – it detects that the file is in use, and on an unlink, it renames it to a temporary file, which is finally unlinked when all processes accessing the file finish.
Not sure if you’re going to be notified that I’ve responded to your comments, so I might follow up with an email, too.
Cheers!
Might as well continue here, for the time being.
(1) The issue is what happens if a crash happens in the middle of a cross-directory rename. With the “link to new then unlink from old” approach, what happens if the client – the only one who knows this is really a rename – crashes after the link but before the unlink? You end up with a file that’s linked both places. That might actually be OK, but can also be avoided by adding some information before the link so that somebody who comes along later can recognize the interrupted rename for what it is and complete it.
(4) This might be a feature of the higher-level (path-based) FUSE interface, because it doesn’t happen with the lower-level (inode-based) interface that GlusterFS uses. In any case, how would this work with multiple clients? How does FUSE on client X know to “follow” when client Y renames an open file to the temp location, and how does either know when it’s safe to delete the file from there? Multiple clients accessing an unlinked file is a very strange case, I know, so it’s probably OK to punt on this and let I/O errors occur so long as their nature is properly noted.
Pingback: Handling Cassandra concurrency issues, using Apache Zookeeper | Contempt
Pingback: CassaFS – a FUSE-based filesystem using Apache Cassandra as a backend. | Contempt | Cloud Storage, Distributed File System | Scoop.it
Pingback: Links December 2012 | etbe - Russell Coker
Pingback: CassaFS – a FUSE-based filesystem using Apache Cassandra as a backend. | Contempt | Cloud & Big Data Platform | Scoop.it