File System Modules

To support a particular file system (FS), a kernel module implementing a special interface (file_system_module_info defined in <fs_interface.h>) has to be provided. As for any other module the std_ops() hook is invoked with B_MODULE_INIT directly after the FS module has been loaded by the kernel, and with B_MODULE_UNINIT before it is unloaded, thus providing a simple mechanism for one-time module initializations. The same module is used for accessing any volume of that FS type.

File System Objects

There are several types of objects a FS module has to deal with directly or indirectly:

  • A volume is an instance of a file system. For a disk-based file system it corresponds to a disk, partition, or disk image file. When mounting a volume the virtual file system layer (VFS) assigns a unique number (ID, of type dev_t) to it and a handle (type void*) provided by the file system. The VFS creates an instance of struct fs_volume that stores these two, an operation vector (fs_volume_ops), and other volume related items. Whenever the FS is asked to perform an operation the fs_volume object is supplied, and whenever the FS requests a volume-related service from the kernel, it also has to pass the fs_volume object or, in some cases, just the volume ID. Normally the handle is a pointer to a data structure the FS allocates to associate data with the volume.
  • A node is contained by a volume. It can be of type file, directory, or symbolic link (symlink). Just as volumes nodes are associated with an ID (type ino_t) and, if in use, also with a handle (type void*). As for volumes the VFS creates an instance of a structure (fs_vnode) for each node in use, storing the FS's handle for the node and an operation vector (fs_vnode_ops). Unlike the volume ID the node ID is defined by the FS. It often has a meaning to the FS, e.g. file systems using inodes might choose the inode number corresponding to the node. As long as the volume is mounted and the node is known to the VFS, its node ID must not change. The node handle is again a pointer to a data structure allocated by the FS.
  • A vnode (VFS node) is the VFS representation of a node. A volume may contain a great number of nodes, but at a time only a few are represented by vnodes, usually only those that are currently in use (sometimes a few more).
  • An entry (directory entry) belongs to a directory, has a name, and refers to a node. It is important to understand the difference between entries and nodes: A node doesn't have a name, only the entries that refer to it have. If a FS supports to have more than one entry refer to a single node, it is also said to support "hard links". It is possible that no entry refers to a node. This happens when a node (e.g. a file) is still open, but the last entry referring to it has been removed (the node will be deleted when the it is closed). While entries are to be understood as independent entities, the FS interface does not use IDs or handles to refer to them; it always uses directory and entry name pairs to do that.
  • An attribute is a named and typed data container belonging to a node. A node may have any number of attributes; they are organized in a (depending on the FS, virtual or actually existing) attribute directory, through which one can iterate.
  • An index is supposed to provide fast searching capabilities for attributes with a certain name. A volume's index directory allows for iterating through the indices.
  • A query is a fully virtual object for searching for entries via an expression matching entry name, node size, node modification date, and/or node attributes. The mechanism of retrieving the entries found by a query is similar to that for reading a directory contents. A query can be live in which case the creator of the query is notified by the FS whenever an entry no longer matches the query expression or starts matching.

Generic Concepts

A FS module has to (or can) provide quite a lot of hook functions. There are a few concepts that apply to several groups of them:

  • Opening, Closing, and Cookies: Many FS objects can be opened and closed, namely nodes in general, directories, attribute directories, attributes, the index directory, and queries. In each case there are three hook functions: open*(), close*(), and free*_cookie(). The open*() hook is passed all that is needed to identify the object to be opened and, in some cases, additional parameters e.g. specifying a particular opening mode. The implementation is required to return a cookie (type void*), usually a pointer to a data structure the FS allocates. In some cases (e.g. when an iteration state is associated with the cookie) a new cookie must be allocated for each instance of opening the object. The cookie is passed to all hooks that operate on a thusly opened object. The close*() hook is invoked to signal that the cookie is to be closed. At this point the cookie might still be in use. Blocking FS hooks (e.g. blocking read/write operations) using the same cookie have to be unblocked. When the cookie stops being in use the free*_cookie() hook is called; it has to free the cookie.
  • Entry Iteration: For the FS objects serving as containers for other objects, i.e. directories, attribute directories, the index directory, and queries, the cookie mechanism is used for a stateful iteration through the contained objects. The read_*() hook reads the next one or more entries into a struct dirent buffer. The rewind_*() hook resets the iteration state to the first entry.
  • Stat Information: In case of nodes, attributes, and indices detailed information about an object are requested via a read*_stat() hook and must be written into a struct stat buffer.

VNodes

A vnode is the VFS representation of a node. As soon as an access to a node is requested, the VFS creates a corresponding vnode. The requesting entity gets a reference to the vnode for the time it works with the vnode and releases the reference when done. When the last reference to a vnode has been surrendered, the vnode is unused and the VFS can decide to destroy it (usually it is cached for a while longer).

When the VFS creates a vnode, it invokes the volume's get_vnode() hook to let it create the respective node handle (unless the FS requests the creation of the vnode explicitely by calling publish_vnode()). That's the only hook that specifies a node by ID; all other node-related hooks are defined in the respective node's operation vector and they are passed the respective fs_vnode object. When the VFS deletes the vnode, it invokes the nodes's put_vnode() hook or, if the node was marked removed, remove_vnode() .

There are only four FS hooks through which the VFS gains knowledge of the existence of a node. The first one is the mount() hook. It is supposed to call publish_vnode() for the root node of the volume and return its ID. The second one is the lookup() hook. Given a fs_vnode object of a directory and an entry name, it is supposed to call get_vnode() for the node the entry refers to and return the node ID. The remaining two hooks, read_dir() and read_query() , both return entries in a struct dirent structure, which also contains the ID of the node the entry refers to.

Mandatory Hooks

Which hooks a FS module should provide mainly depends on what functionality it features. E.g. a FS without support for attribute, indices, and/or queries can omit the respective hooks (i.e. set them to NULL in the module, fs_volume_ops, and fs_vnode_ops structure). Some hooks are mandatory, though. A minimal read-only FS module must implement:

  • mount() and unmount() : Mounting and unmounting a volume is required for pretty obvious reasons.
  • lookup() : The VFS uses this hook to resolve path names. It is probably one of the most frequently invoked hooks.
  • get_vnode() and put_vnode() : Create respectively destroy the FS's private node handle when the VFS creates/deletes the vnode for a particular node.
  • read_stat() : Return a struct stat info for the given node, consisting of the type and size of the node, its owner and access permissions, as well as certain access times.
  • open() , close() , and free_cookie() : Open and close a node as explained in Generic Concepts.
  • read() : Read data from an opened node (file). Even if the FS does not feature files, the hook has to be present anyway; it should return an error in this case.
  • open_dir() , close_dir() , and free_dir_cookie() : Open and close a directory for entry iteration as explained in Generic Concepts.
  • read_dir() and rewind_dir() : Read the next entry/entries from a directory, respectively reset the iterator to the first entry, as explained in Generic Concepts.

Although not strictly mandatory, a FS should additionally implement the following hooks:

  • read_fs_info() : Return general information about the volume, e.g. total and free size, and what special features (attributes, MIME types, queries) the volume/FS supports.
  • read_symlink() : Read the value of a symbolic link. Needed only, if the FS and volume support symbolic links at all. If absent symbolic links stored on the volume won't be interpreted.
  • access() : Return whether the current user has the given access permissions for a node. If the hook is absent the user is considered to have all permissions.

Checking Access Permission

While there is the access() hook that explicitly checks access permission for a node, it is not used by the VFS to check access permissions for the other hooks. This has two reasons: It could be cheaper for the FS to do that in the respective hook (at least it's not more expensive), and the FS can make sure that there are no race conditions between the check and the start of the operation for the hook. The downside is that in most hooks the FS has to check those permissions. It is possible to simplify things a bit, though:

  • For operations that require the file system object in question (node, directory, index, attribute, attribute directory, query) to be open, most of the checks can already be done in the respective open*() hook. E.g. in fs_vnode_ops::read() or fs_vnode_ops::write() one only has to check, if the file has been opened for reading/writing, not whether the current process has the respective permissions.
  • The core of the fs_vnode_ops::access() hook can be moved into a private function that can be easily reused in other hooks to check the permissions for the respective operations. In most cases this will reduce permission checking to one or two additional "if"s in the hooks where it is required.

Node Monitoring

One of the nice features of Haiku's API is an easy way to monitor directories or nodes for changes. That is one can register for watching a given node for certain modification events and will get a notification message whenever one of those events occurs. While other parts of the operating system do the actual notification message delivery, it is the responsibility of each file system to announce changes. It has to use the following functions to do that:

  • notify_entry_created(): A directory entry has been created.
  • notify_entry_removed(): A directory entry has been removed.
  • notify_entry_moved(): A directory entry has been renamed and/or moved to another directory.
  • notify_stat_changed(): One or more members of the stat data for node have changed. E.g. the st_size member changes when the file is truncated or data have been written to it beyond its former size. The modification time (st_mtime) changes whenever a node is write-accessed. To avoid a flood of messages for small and frequent write operations on an open file the file system can limit the number of notifications and mark them with the B_WATCH_INTERIM_STAT flag. When closing a modified file a notification without that flag should be issued.
  • notify_attribute_changed(): An attribute of a node has been added, removed, or changed.

If the file system supports queries, it needs to call the following functions to make live queries work:

Caches

The Haiku kernel provides three kinds of caches that can be used by a file system implementation to speed up file system operations:

  • Block cache: Interesting for disk-based file systems. The device the file system volume is located on is considered to be divided in equally-sized blocks of data that can be accessed via the block cache API (e.g. block_cache_get() and block_cache_put()). As long as the system has enough memory the block cache will keep all blocks that have been accessed in memory, thus allowing further accesses to be very fast. The block cache also has transaction support, which is of interest for journaled file systems.
  • File cache: Stores file contents. The FS can decide to create a file cache for any of its files. The fs_vnode_ops::read() and fs_vnode_ops::write() hooks can then simply be implemented by calling the file_cache_read() respectively file_cache_write() function, which will read the data from/write the data to the file cache. For reading uncached data or writing back cached data to the file, the file cache will invoke the fs_vnode_ops::io() hook. Only files for which the file cache is used, can be memory mapped (cf. mmap())
  • Entry cache: Can be used to speed up resolving paths. Normally the VFS will call the fs_vnode_ops::lookup() hook for each element of the path to be resolved, which, depending on the file system, can be more or less expensive. When the FS uses the entry cache, those calls will be avoided most of the time. All the file system has to do is invoke the entry_cache_add() function when it encounters an entry that might not yet be known to the entry cache and entry_cache_remove() when a directory entry has been removed. The entry cache can also be used for negative caching. If the file system determines that the requested entry is not present during a lookup, it can cache this lookup failure by calling entry_cache_add_missing(). Further calls to fs_vnode_ops::lookup() for the missing entry will then be avoided. Note that it is safe to call entry_cache_add() and entry_cache_add_missing() with the same directory/name pair previously given to either function to update a cache entry, without needing to call entry_cache_remove() first. It is also safe to call entry_cache_remove() for pairs that have never been added to the cache.