Ext3 Indexed Directory Lookup

Blog post by jvff on Mon, 2010-06-14 15:58

The first milestone of the ext3 implementation was to have read support. Since ext2 read support is already implemented, the only missing feature (as far as I can tell) for ext3 read implementation was support for indexed directories. In ext3, indexed directories use a tree structured called HTree. This tree has a fixed depth and its keys are file name hashes. Each node of the tree is a file system block inside the directory file (ie. linked by the directory i-node).

To allow backwards compatibility with ext2 tools, it uses a clever scheme to hide itself from ext2 mounts, representing each node as a fake and empty directory entry. This way the directory’s entries can be read normally (sequentially) if mounted as ext2, and can be read in indexed mode by ext3. Because it uses hash values, there is a possibility that a hash collision occurs. Therefore, another clever trick is used: the least significant bit of the hash key in the tree is used to flag if the node contains entries that were supposed to be in the previous node. However, this limits the hash value to 31 bits.

The leaf nodes contain the directory’s entries with the full file names of the directory’s files. They are arranged the same way as in ext2. Therefore, once we find the leaf node where the entry is supposed to be located, we simply traverse it sequentially until we find it or reach the end of the leaf node. If we do reach the end of the node, we proceed to the next leaf node if its parent node indicates it holds collision entries from the previous leaf.

One design consideration is to reuse most of the already existent sequential lookup code. Therefore, after traversing through the nodes and finding the leaf node, a DirectoryIterator already placed at that leaf node’s offset is returned, so it can continue performing the normal sequential search. If somehow the indexed lookup fails, a normal DirectoryIterator started at offset zero is returned, and the search fallbacks to the previous implementation without any more changes.

To isolate the new code, it was placed inside a new class: HTree. Initially, this class performed all the required operations: generate the hash key, lookup inside the nodes, and find the leaf node. However, this approach suffered from one major drawback. It didn’t stop searching until it searched through all of the directory’s entries, even if the index structure indicated that the entry wasn’t located anywhere else. Although it still works normally, it’s inefficient. The first attempt for a solution was to find a stop offset. This approach required some complicated recursive code to traverse the tree until no more collisions were probable, and that would mostly degrade performance, since most of the time the extra overhead would be useless.

The current approach offloads a lot of the work to outside of the main HTree class. Now the class is only responsible for generating the hash value (from the 3 hash functions that could be used), and initiating the index traversal. A class responsible for iterating through node entries was created, called HTreeEntryIterator. It handles most of the lookup code. It first does a binary search inside the node to find the child node which should contain the key. Once found, it instantiates a child iterator for it. This recursivness isn’t too detrimental to performance, because in ext3 the maximum tree depth is 2, which means at most two entry iterators are created (one for the root, and one for the optional indirect node). Currently, since the index is only read, no locking is necessary. However, when adding new directory entries, it will be necessary to lock the directory iterator, and possibly lock the entry iterators if there exists a need to split a node.

Once all the indirect blocks of the tree have been traversed (we reached a leaf node), a directory entry iterator is created and returned. However, this directory iterator is of a new class, derived from DirectoryIterator and called IndexedDirectoryIterator. This class handles having an initial and limit offset, and has access to its parent index node. When it reaches its maximum offset, it checks to see if the next leaf contains collisions from this leaf. This is performed by calling its parent, and asking for the next leaf node offset. The HTreeEntryIterator iterates to the next entry in the node, and returns the leaf’s starting offset. If the end of the node is reached, its parent entry iterator is called to iterate, in analogous manner. If the end of all entries is reached, the looked up file name wasn’t found. After reaching the next leaf node, the parent is asked if the entry has collisions from the previous entry, and we continue the search if it does, or return that the file name wasn’t found if it doesn’t.

The current design probably isn’t the fastest. And I don’t expect it to be near the performance of the implementation in Linux, which is more optimized. However, I hope the implementation is simple to understand, and the code is clean to read.

The code still needs to be tested. I’m also not sure if the hash generation functions work on big endian machines. Initially, I plan on testing the hash functions separately, in userland as an external test program. I also have a test ext3 image with a large indexed directory. It will be a nice initial test, to see if it’s possible to lookup all the files through the index. After write support is added (which consists only of adding entries in the index, and splitting the index entry nodes if necessary, if I recall correctly), a complete test will be written, populating a directory and then looking up the entries one-by-one. I’m still a little behind schedule, mostly because I still have a lot of graduation assignments and exams to finish next week, but I think I can get ahead of myself, unless something unexpected happens. Next milestone is initial journal support, which I have already some code to begin with from before the coding period. Hopefully I should be closer to the schedule by the next milestone.