Next Previous Contents
--------------------------------------------------------------------------------
3. Virtual Filesystem (VFS)
3.1 Inode Caches and Interaction with Dcache
In order to support multiple filesystems, Linux contains a special kernel interface level called VFS (Virtual Filesystem Switch). This is similar to the vnode/vfs interface found in SVR4 derivatives (originally it came from BSD and Sun original implementations).
Linux inode cache is implemented in a single file, fs/inode.c, which consists of 977 lines of code. It is interesting to note that not many changes have been made to it for the last 5-7 years: one can still recognise some of the code comparing the latest version with, say, 1.3.42.
The structure of Linux inode cache is as follows:
A global hashtable, inode_hashtable, where each inode is hashed by the value of the superblock pointer and 32bit inode number. Inodes without a superblock (inode->i_sb == NULL) are added to a doubly linked list headed by anon_hash_chain instead. Examples of anonymous inodes are sockets created by net/socket.c:sock_alloc(), by calling fs/inode.c:get_empty_inode().
A global type in_use list (inode_in_use), which contains valid inodes with i_count>0 and i_nlink>0. Inodes newly allocated by get_empty_inode() and get_new_inode() are added to the inode_in_use list.
A global type unused list (inode_unused), which contains valid inodes with i_count = 0.
A per-superblock type dirty list (sb->s_dirty) which contains valid inodes with i_count>0, i_nlink>0 and i_state & I_DIRTY. When inode is marked dirty, it is added to the sb->s_dirty list if it is also hashed. Maintaining a per-superblock dirty list of inodes allows to quickly sync inodes.
Inode cache proper - a SLAB cache called inode_cachep. As inode objects are allocated and freed, they are taken from and returned to this SLAB cache.
The type lists are anchored from inode->i_list, the hashtable from inode->i_hash. Each inode can be on a hashtable and one and only one type (in_use, unused or dirty) list.
All these lists are protected by a single spinlock: inode_lock.
The inode cache subsystem is initialised when inode_init() function is called from init/main.c:start_kernel(). The function is marked as __init, which means its code is thrown away later on. It is passed a single argument - the number of physical pages on the system. This is so that the inode cache can configure itself depending on how much memory is available, i.e. create a larger hashtable if there is enough memory.
The only stats information about inode cache is the number of unused inodes, stored in inodes_stat.nr_unused and accessible to user programs via files /proc/sys/fs/inode-nr and /proc/sys/fs/inode-state.
We can examine one of the lists from gdb running on a live kernel thus:
--------------------------------------------------------------------------------
(gdb) printf "%d ", (unsigned long)(&((struct inode *)0)->i_list)
8
(gdb) p inode_unused
$34 = 0xdfa992a8
(gdb) p (struct list_head)inode_unused
$35 = {next = 0xdfa992a8, prev = 0xdfcdd5a8}
(gdb) p ((struct list_head)inode_unused).prev
$36 = (struct list_head *) 0xdfcdd5a8
(gdb) p (((struct list_head)inode_unused).prev)->prev
$37 = (struct list_head *) 0xdfb5a2e8
(gdb) set $i = (struct inode *)0xdfb5a2e0
(gdb) p $i->i_ino
$38 = 0x3bec7
(gdb) p $i->i_count
$39 = {counter = 0x0}
--------------------------------------------------------------------------------
Note that we deducted 8 from the address 0xdfb5a2e8 to obtain the address of the struct inode (0xdfb5a2e0) according to the definition of list_entry() macro from include/linux/list.h.
To understand how inode cache works, let us trace a lifetime of an inode of a regular file on ext2 filesystem as it is opened and closed:
--------------------------------------------------------------------------------
fd = open("file", O_RDONLY);
close(fd);
--------------------------------------------------------------------------------
The open(2) system call is implemented in fs/open.c:sys_open function and the real work is done by fs/open.c:filp_open() function, which is split into two parts:
open_namei(): fills in the nameidata structure containing the dentry and vfsmount structures.
dentry_open(): given a dentry and vfsmount, this function allocates a new struct file and links them together; it also invokes the filesystem specific f_op->open() method which was set in inode->i_fop when inode was read in open_namei() (which provided inode via dentry->d_inode).
The open_namei() function interacts with dentry cache via path_walk(), which in turn calls real_lookup(), which invokes the filesystem specific inode_operations->lookup() method. The role of this method is to find the entry in the parent directory with the matching name and then do iget(sb, ino) to get the corresponding inode - which brings us to the inode cache. When the inode is read in, the dentry is instantiated by means of d_add(dentry, inode). While we are at it, note that for UNIX-style filesystems which have the concept of on-disk inode number, it is the lookup method's job to map its endianness to current CPU format, e.g. if the inode number in raw (fs-specific) dir entry is in little-endian 32 bit format one could do:
--------------------------------------------------------------------------------
unsigned long ino = le32_to_cpu(de->inode);
inode = iget(sb, ino);
d_add(dentry, inode);
--------------------------------------------------------------------------------
So, when we open a file we hit iget(sb, ino) which is really iget4(sb, ino, NULL, NULL), which does:
Attempt to find an inode with matching superblock and inode number in the hashtable under protection of inode_lock. If inode is found, its reference count (i_count) is incremented; if it was 0 prior to incrementation and the inode is not dirty, it is removed from whatever type list (inode->i_list) it is currently on (it has to be inode_unused list, of course) and inserted into inode_in_use type list; finally, inodes_stat.nr_unused is decremented.
If inode is currently locked, we wait until it is unlocked so that iget4() is guaranteed to return an unlocked inode.
If inode was not found in the hashtable then it is the first time we encounter this inode, so we call get_new_inode(), passing it the pointer to the place in the hashtable where it should be inserted to.
get_new_inode() allocates a new inode from the inode_cachep SLAB cache but this operation can block (GFP_KERNEL allocation), so it must drop the inode_lock spinlock which guards the hashtable. Since it has dropped the spinlock, it must retry searching the inode in the hashtable afterwards; if it is found this time, it returns (after incrementing the reference by __iget) the one found in the hashtable and destroys the newly allocated one. If it is still not found in the hashtable, then the new inode we have just allocated is the one to be used; therefore it is initialised to the required values and the fs-specific sb->s_op->read_inode() method is invoked to populate the rest of the inode. This brings us from inode cache back to the filesystem code - remember that we came to the inode cache when filesystem-specific lookup() method invoked iget(). While the s_op->read_inode() method is reading the inode from disk, the inode is locked (i_state = I_LOCK); it is unlocked after the read_inode() method returns and all the waiters for it are woken up.
Now, let's see what happens when we close this file descriptor. The close(2) system call is implemented in fs/open.c:sys_close() function, which calls do_close(fd, 1) which rips (replaces with NULL) the descriptor of the process' file descriptor table and invokes the filp_close() function which does most of the work. The interesting things happen in fput(), which checks if this was the last reference to the file, and if so calls fs/file_table.c:_fput() which calls __fput() which is where interaction with dcache (and therefore with inode cache - remember dcache is a Master of inode cache!) happens. The fs/dcache.c:dput() does dentry_iput() which brings us back to inode cache via iput(inode) so let us understand fs/inode.c:iput(inode):
If parameter passed to us is NULL, we do absolutely nothing and return.
if there is a fs-specific sb->s_op->put_inode() method, it is invoked immediately with no spinlocks held (so it can block).
inode_lock spinlock is taken and i_count is decremented. If this was NOT the last reference to this inode then we simply check if there are too many references to it and so i_count can wrap around the 32 bits allocated to it and if so we print a warning and return. Note that we call printk() while holding the inode_lock spinlock - this is fine because printk() can never block, therefore it may be called in absolutely any context (even from interrupt handlers!).
If this was the last active reference then some work needs to be done.
The work performed by iput() on the last inode reference is rather complex so we separate it into a list of its own:
If i_nlink == 0 (e.g. the file was unlinked while we held it open) then the inode is removed from hashtable and from its type list; if there are any data pages held in page cache for this inode, they are removed by means of truncate_all_inode_pages(&am