Anatomy of the Linux virtual file system switch
Abstractions and high-level concepts
The flexibility and extensibility of support for Linux file systems is a direct result of an abstracted set of interfaces. At the core of that set of interfaces is the virtual file system switch (VFS).
The VFS provides a set of standard interfaces for upper-layer applications to perform file I/O over a diverse set of file systems. And it does it in a way that supports multiple concurrent file systems over one or more underlying devices. Additionally, these file systems need not be static but may come and go with the transient nature of the storage devices.
For example, a typical Linux desktop supports an ext3 file system on the available hard disk, as well as the ISO 9660 file system on an available CD-ROM (otherwise called the CD-ROM file system, or CDFS). As CD-ROMs are inserted and removed, the Linux kernel must adapt to these new file systems with different contents and structure. A remote file system can be accessed through the Network File System (NFS). At the same time, Linux can mount the NT File System (NTFS) partition of a Windows®/Linux dual-boot system from the local hard disk and read and write from it.
Finally, a removable USB flash drive (UFD) can be hot-plugged, providing yet another file system. All the while, the same set of file I/O interfaces can be used over these devices, permitting the underlying file system and physical device to be abstracted away from the user (see Figure 1).
Figure 1. An abstraction layer provides a uniform interface over different file systems and storage devices
Now, let’s add some concrete architecture to the abstract features that the Linux VFS provides. Figure 2 shows a high-level view of the Linux stack from the point of view of the VFS. Above the VFS is the standard kernel system-call interface (SCI). This interface allows calls from user-space to transition to the kernel (in different address spaces). In this domain, a user-space application invoking the POSIX
open call passes through the GNU C library (
glibc) into the kernel and into system call de-multiplexing. Eventually, the VFS is invoked using the call
Figure 2. The layered architecture of the VFS
The VFS provides the abstraction layer, separating the POSIX API from the details of how a particular file system implements that behavior. The key here is that Open, Read, Write, or Close API system calls work the same regardless of whether the underlying file system is ext3 or Btrfs. VFS provides a common file model that the underlying file systems inherit (they must implement behaviors for the various POSIX API functions). A further abstraction, outside of the VFS, hides the underlying physical device (which could be a disk, partition of a disk, networked storage entity, memory, or any other medium able to store information—even transiently).
In addition to abstracting the details of file operations from the underlying file systems, VFS ties the underlying block devices to the available file systems. Let’s now look at the internals of the VFS to see how this works.
Before looking at the overall architecture of the VFS subsystem, let’s have a look at the major objects that are used. This section explores the superblock, the index node (or inode), the directory entry (or dentry), and finally, the file object. Some additional elements, such as caches, are also important here, and I explore these later in the overall architecture.
The superblock is the container for high-level metadata about a file system. The superblock is a structure that exists on disk (actually, multiple places on disk for redundancy) and also in memory. It provides the basis for dealing with the on-disk file system, as it defines the file system’s managing parameters (for example, total number of blocks, free blocks, root index node).
On disk, the superblock provides information to the kernel on the structure of the file system on disk. In memory, the superblock provides the necessary information and state to manage the active (mounted) file system. Because Linux supports multiple concurrent file systems mounted at the same time, each
super_block structure is maintained in a list (
super_blocks, defined in ./linux/fs/super.c, with the structure defined in /linux/include/fs/fs.h).
Figure 3 provides a simplified view of the superblock and its elements. The
super_block structure refers to a number of other structures that encapsulate other information. The
file_system_type structure, for example, maintains the name of the file system (such as ext3) as well as various locks and functions to get and remove the
file_system_type object is managed through the well-known
register_file system and
unregister_file system functions (see ./linux/fs/file systems.c). The
super_operations structure defines a number of functions for reading and writing inodes as well as higher-level operations (such as remounting). The root directory entry (
dentry) object is cached here also, as is the block device on which this file system resides. Finally, a number of lists are provided for managing inodes, including
s_inodes (a list of all inodes),
s_dirty (a list of all dirty inodes),
s_more_io (parked for writeback), and
s_files (the list of all opened files for a given file system).
Figure 3. Simplified view of the super_block structure and its related elements
Note that within the kernel, another management object called
vfsmount provides information on mounted file systems. The list of these objects refers to the superblock and defines the mount point, name of the /dev device on which this file system resides, and other higher-level attachment information.
The index node (inode)
Linux manages all objects in a file system through an object called an inode (short for index node). An inode can refer to a file or a directory or a symbolic link to another object. Note that because files are used to represent other types of objects, such as devices or memory, inodes are used to represent them also.
Note that the inode I refer to here is the VFS layer inode (in-memory inode). Each file system also includes an inode that lives on disk and provides details about the object specific to the particular file system.
VFS inodes are allocated using the slab allocator (from the
inode_cache; see resources on the right for a link to more information on the slab allocator). The inode consists of data and operations that describe the inode, its contents, and the variety of operations that are possible on it. Figure 4 is a simple illustration of a VFS inode consisting of a number of lists, one of which refers to the dentries that refer to this inode. Object-level metadata is included here, consisting of the familiar manipulation times (create time, access time, modify time), as are the owner and permission data (group-id, user-id, and permissions). The inode refers to the file operations that are possible on it, most of which map directly to the system-call interfaces (for example,
flush). There is also a reference to inode-specific operations (
mkdir, and so on). Finally, there’s a structure to manage the actual data for the object that is represented by an address space object. An address space object is an object that manages the various pages for the inode within the page cache. The address space object is used to manage the pages for a file and also for mapping file sections into individual process address spaces. The address space object comes with its own set of operations (
releasepage, and so on).
Figure 4. Simplified representation of the VFS inode
Note that all of this information can be found in ./linux/include/linux/fs.h.
Directory entry (dentry)
The hierarchical nature of a file system is managed by another object in VFS called a dentry object. A file system will have one root dentry (referenced in the superblock), this being the only dentry without a parent. All other dentries have parents, and some have children. For example, if a file is opened that’s made up of /home/user/name, four dentry objects are created: one for the root
/, one for the
home entry of the root directory, one for the
name entry of the
user directory, and finally, one dentry for the
name entry of the user directory. In this way, dentries map cleanly into the hierarchical file systems in use today.
The dentry object is defined by the dentry structure (in ./linux/include/fs/dcache.h). It consists of a number of elements that track the relationship of the entry to other entries in the file system as well as physical data (such as the file name). A simplified view of the dentry object is shown in Figure 5. The dentry refers to the
super_block, which defines the particular file system instance in which this object is contained. Next is the parent dentry (parent directory) of the object, followed by the children dentries contained within a list (if the object happens to be a directory). The operations for a dentry are then defined (consisting of operations such as
release, and so on). The name of the object is then defined, which is kept here in the dentry instead of the inode itself. Finally, a reference is provided to the VFS inode.
Figure 5. Simplified representation of the dentry object
Note that the dentry objects exist only in file system memory and are not stored on disk. Only file system inodes are stored permanently, where dentry objects are used to improve performance. You can see the full description of the dentry structure in ./linux/include/dcache.h.
For each opened file in a Linux system, a
file object exists. This object contains information specific to the open instance for a given user. A very simplified view of the file object is provided in Figure 6. As shown, a
path structure provides reference to both the
vfsmount. A set of file operations is defined for each file, which are the well-known file operations (
flush, and so on). A set of flags and permissions is defined (including group and owner). Finally, stateful data is defined for the particular file instance, such as the current offset into the file.
Figure 6. Simplified representation of the file object
Now that I’ve reviewed the various important objects in the VFS layer, let’s look at how they relate in a single diagram. Because I’ve explored the object in a bottom-up fashion so far in this article, I now look at the reverse from the user perspective (see Figure 7).
At the top is the open
file object, which is referenced by a process’s file descriptor list. The
file object refers to a
dentry object, which refers to an
inode. Both the
dentry objects refer to the underlying
super_block object. Multiple file objects may refer to the same dentry (as in the case of two users sharing the same file). Note also in Figure 7 that a
dentry object refers to another
dentry object. In this case, a directory refers to file, which in turn refers to the inode for the particular file.
Figure 7. Relationships of major objects in the VFS
The VFS architecture
The internal architecture of the VFS is made up of a dispatching layer that provides the file system abstraction and a number of caches to improve the performance of file system operations. This section explores the internal architecture and how the major objects interact (see Figure 8).
Figure 8. High-level view of the VFS layer
The two major objects that are dynamically managed in the VFS include the
inode objects. These are cached to improve the performance of accesses to the underlying file systems. When a file is opened, the dentry cache is populated with entries representing the directory levels representing the path. An inode for the object is also created representing the file. The dentry cache is built using a hash table and is hashed by the name of the object. Entries for the dentry cache are allocated from the
dentry_cache slab allocator and use a least-recently-used (LRU) algorithm to prune entries when memory pressure exists. You can find the functions associated with the dentry cache in ./linux/fs/dcache.c (and ./linux/include/linux/dcache.h).
The inode cache is implemented as two lists and a hash table for faster lookup. The first list defines the inodes that are currently in use; the second list defines the inodes that are unused. Those inodes in use are also stored in the hash table. Individual inode cache objects are allocated from the
inode_cache slab allocator. You can find the functions associated with the inode cache in ./linux/fs/inode.c (and ./linux/include/fs.h). From the implementation today, the dentry cache is the master of the inode cache. When a
dentry object exists, an
inode object will also exist in the inode cache. Lookups are performed on the dentry cache, which result in an object in the inode cache.
This article has scratched the surface of the VFS, its approach, and objects used to provide uniform access to differing file systems. Linux is scalable, flexible, and extensible from subsystems such as this. The resources section provides details on where you can learn more.