Building a tiny FUSE filesystem

Lately I have been working around sandboxing, storage, and networking, and a lot of that work keeps coming back to files, which makes sense since Unix has organized itself around everything is a file for over fifty years. Your terminal and random number generator are device files you can open and read (/dev/tty, /dev/urandom), and even network sockets, which are created with their own system call rather than opened by path, are read and written through the same interface afterwards.

For this post, I built a small filesystem with a real backing store, enough metadata to behave like a filesystem, and a few deliberate omissions so the code is still readable.

magicfs mounts at /magic, but it keeps its own local backing store next to it, with names and inode numbers in metadata.json, while file contents live as plain local files under blobs/. Calling that directory a blob store is a little grandiose, because the blobs are just files with allocated names like blob-000000000001, but keeping metadata separate from file contents lets the example cover name lookup, inode stability, write ordering, kernel caching, and what fsync() is asking the filesystem to do.

The full sample code is at github.com/shayonj/magicfs, and if you have Docker, you can run the filesystem with FUSE enabled.

Try it first

docker run -it --rm --device /dev/fuse --cap-add SYS_ADMIN shayonj/magicfs

$ ls /magic
hello.txt  notes.txt

$ cat /magic/hello.txt
Hello from a tiny FUSE filesystem.

$ echo "remember the milk" > /magic/notes.txt
$ cat /magic/notes.txt
remember the milk

Inside that shell, the mount point is the interface applications use, while the store directory is private state owned by the filesystem process, so the shell sees an ordinary directory even though the data behind it is a metadata file plus a couple of local blobs.

$ find /tmp/magicfs-store -type f
/tmp/magicfs-store/metadata.json
/tmp/magicfs-store/blobs/blob-000000000001
/tmp/magicfs-store/blobs/blob-000000000002

In the store directory, the metadata file stands in for a tiny inode table and a tiny directory tree, recording the name, inode number, size, mode bits, and blob IDs for each file.

{
  "next_inode": 4,
  "entries": {
    "hello.txt": {
      "ino": 2,
      "mode": 420,
      "size": 36,
      "blobs": [
        {
          "blob": "blob-000000000001",
          "offset": 0,
          "len": 36
        }
      ]
    },
    "notes.txt": {
      "ino": 3,
      "mode": 420,
      "size": 18,
      "blobs": [
        {
          "blob": "blob-000000000002",
          "offset": 0,
          "len": 18
        }
      ]
    }
  }
}

The path notes.txt is not where the bytes live, it is the name that gets you to inode 3, and the metadata for inode 3 points at a blob file under blobs/, so renaming notes.txt changes the directory metadata, while rewriting it creates a new blob and updates the metadata pointer.

Filesystems as a request loop

When you run cat /magic/hello.txt, cat does not know that JSON metadata and blob files are involved, because all it does is call open() and read(), after which the kernel resolves the path through the VFS, and the operation eventually lands on the filesystem mounted at /magic.

With FUSE, the code that answers those filesystem requests runs in userspace, where the kernel driver sends request messages over /dev/fuse, the userspace process replies, and the application that made the system call keeps waiting until the kernel has an answer, while the kernel FUSE documentation covers the protocol, and the fuser crate exposes the same operations as Rust trait methods.

The path for a read looks roughly like this:

flowchart LR
    A["cat /magic/hello.txt"] --> B["Linux VFS"]
    B --> C["FUSE kernel driver"]
    C --> D["magicfs userspace process"]
    D --> E["metadata.json + local blobs"]
    E --> D
    D --> C
    C --> B
    B --> A

In the request log, LOOKUP asks whether a name exists in a directory and which inode it maps to, GETATTR asks for the metadata associated with an inode, READ asks for bytes at an offset, and WRITE sends bytes at an offset, while later in the lifetime of an open file, FLUSH, FSYNC, and RELEASE show up and make the write path less like a simple callback that copies bytes.

Here is the log from writing notes.txt, trimmed to the requests involved in opening, truncating, writing, flushing, and releasing the file:

[magicfs] READDIR ino=1
[magicfs] LOOKUP notes.txt -> ino=3
[magicfs] OPEN notes.txt ino=3 flags=0x8001
[magicfs] SETATTR ino=3 size=0 staged=true
[magicfs] WRITE notes.txt ino=3 offset=0 len=18 staged=true
[magicfs] FLUSH notes.txt ino=3
[magicfs] COMMIT notes.txt ino=3 size=18 blobs=1
[magicfs] COMMIT metadata entries=2
[magicfs] RELEASE notes.txt ino=3 flags=0x8001 flush=true

In this log, ls triggers READDIR, while a direct cat /magic/hello.txt can walk the path without listing the directory first. Shell redirection with > opens the file for writing and truncation, so the kernel sends a size change before it sends the bytes, and the WRITE handler only stages the new contents in memory, while the backing store does not change until the file is flushed or synced.

A filesystem usually has to answer a question about a name before it can answer anything about bytes, namely whether this name exists in this directory, and if it does, which file it refers to.

Linux mostly stops caring about filenames once path lookup is done, because internally it refers to files by inode number, and on a disk filesystem, an inode is a record with metadata and pointers to data blocks, while a directory entry maps a name to an inode, which is why a rename can change a path without moving file data, and also why hard links can make the same inode appear under more than one name.

magicfs keeps the directory entry and inode metadata in metadata.json:

"notes.txt": {
  "ino": 3,
  "mode": 420,
  "size": 18,
  "blobs": [
    {
      "blob": "blob-000000000002",
      "offset": 0,
      "len": 18
    }
  ]
}

The LOOKUP notes.txt handler reads that map and returns inode 3, while the GETATTR handler turns the entry into a FileAttr, which is what makes stat and ls -l work, and the root directory uses inode 1, which is the conventional root inode for FUSE filesystems.

The ordering problem shows up before the read and write handlers do anything with file contents, because if a new blob reaches the backing store but metadata.json still points at the old blob, readers keep seeing the old file, while if metadata.json points at a blob that never made it to disk, readers see a broken file. magicfs handles the simple case by writing the blob first, then replacing metadata, and the metadata replacement follows the usual local-filesystem pattern where the code writes a temporary file, syncs it, renames it over metadata.json, and then syncs the containing directory.

The temp-file-and-rename pattern avoids half-written JSON, but it is not a journal, and without a recovery pass or a transaction log, the filesystem cannot determine after a crash whether every in-flight metadata update had committed.

File contents as local blobs

For the data path, magicfs stores each committed file version as one immutable blob with an allocated ID, while a more complete filesystem would split larger files into chunks and let metadata point at a list of chunks, but one blob per file keeps the code short.

For reads, metadata comes first, so given inode 3, the filesystem finds the entry for notes.txt, reads the blob ID from that entry, opens the corresponding file under blobs/, and returns the byte range the kernel requested.

inode 3
  -> metadata entry for notes.txt
  -> blob ID blob-000000000002
  -> blobs/blob-000000000002
  -> bytes returned to READ

For writes, the data moves in the other direction, but magicfs does not mutate the blob in place, because when the kernel sends WRITE, the filesystem stages the new file contents in memory, and later, when FLUSH or FSYNC arrives, it writes a new blob and updates metadata to point at it.

The example ends up with a small copy-on-write data path, although rewriting one byte of a large file should not require rewriting the whole file, so a more complete implementation would chunk the file, track dirty chunks, write only the changed chunks, and then commit a metadata update that points at the new chunk list, while magicfs skips that complexity by assuming the files are small enough to rewrite as a unit.

Write is not sync

A shell command like this looks simpler than the filesystem work behind it:

$ echo "remember the milk" > /magic/notes.txt

Inside magicfs, the work is closer to this:

OPEN notes.txt for writing
SETATTR notes.txt size=0
WRITE bytes at offset 0
FLUSH because a file descriptor is closing
write content blob
replace metadata.json
RELEASE the open file

On a normal Linux filesystem, write(2) usually means the kernel accepted the bytes into memory, not that the bytes necessarily reached stable storage. fsync(2) is the call an application uses when it wants the file data, along with the metadata needed to retrieve that data, flushed to the storage device, while fdatasync(2) is similar but can skip metadata that is not needed for a later read.

FUSE also calls the filesystem when a file descriptor closes, because flush is called on close, and duplicated file descriptors mean one open file can have more than one flush. A filesystem can use flush to report delayed write errors, but flush does not mean the same thing as fsync, and release happens later still, when the kernel is done with the open file handle.

For the shell demo, magicfs commits staged bytes on both FLUSH and FSYNC, which makes echo hello > /magic/notes.txt behave the way a person expects, while the code still treats fsync as the explicit request for durable file data and metadata. A database that calls fsync is asking a more specific question than a shell that happened to close a redirected file, and if the backing blob write fails after WRITE already returned success, the filesystem still has to decide where that error can be reported, either through a later fsync or through a close-time error from flush, although plenty of programs are not careful about checking close errors.

For metadata, replacing a file with rename is atomic for readers, but atomic replacement is not the same thing as durability after power loss, so if you care that the new metadata.json survives a crash, you need to sync the new file contents and the directory entry that points at it, which magicfs handles for its local store by syncing the temporary metadata file before rename, then syncing the store directory after rename.

In code, those rules show up in the order of blob writes, metadata replacement, flush, and fsync, because the filesystem has to decide which bytes exist, which names point at them, and what an application is allowed to assume after a successful sync.

FUSE replies can include time-to-live values for names and attributes, and until those TTLs expire, the kernel can answer repeated lookups and getattr calls without asking the userspace process again, which matters because crossing from the kernel into a userspace filesystem on every stat would be expensive.

The same TTL also affects correctness, because magicfs uses a one second TTL, which is fine for a single-process demo, but if another process or another machine can update the same backing store, a reader may see an old file size or an old blob ID until the cache expires unless the filesystem actively invalidates the kernel’s cached state.

For file contents, magicfs opens files with FUSE direct I/O so reads come back to the userspace filesystem instead of being served from the page cache, which keeps the example easier to reason about but gives up caching and read-ahead that a real filesystem would probably want, and the cache policy matters because it changes which file size, inode attributes, and file contents callers are able to observe.

Shortcomings I kept

The implementation only supports one directory, and each file is stored as one local blob, so rewriting a byte rewrites the whole file, with no journal, recovery scan, or cleanup for orphaned blobs left behind by rewrites or unlinks, and it also does not implement locking, mmap, extended attributes, a real permission model, sparse files, hard links, symlinks, or multi-client cache invalidation.

The filesystem also does not model the problems that show up when the backing layer is remote, since network failures, remote consistency rules, retries, and authentication all change when reads can succeed, when writes can be retried, and what fsync can honestly report, while this example stays on local disk so the post can focus on filesystem calls.

A journal or transaction log would let recovery decide whether a metadata update committed, chunking would avoid rewriting whole files, a garbage collector would find blobs no metadata entry can reach, and better cache invalidation would keep multiple readers from seeing stale metadata for too long.

With FUSE, Linux asks the filesystem a fixed collection of questions, and the implementation can answer from whatever backing store it owns, which means the implementation still has to define lookup, write, flush, fsync, and rename when metadata and file contents are stored somewhere else.

I am working on these filesystem, sandboxing, and storage problems at Tines, along with plenty of adjacent systems work that gets deeper than a blog post can. If that sounds interesting, we are hiring.

推荐订阅源