2.7 Bottom Halves
Sometimes it is reasonable to split the amount of work to be performed inside an interrupt handler into immediate work (e.g. acknowledging the interrupt, updating the stats etc.) and work which can be postponed until later, when interrupts are enabled (e.g. to do some postprocessing on data, wake up processes waiting for this data, etc).
Bottom halves are the oldest mechanism for deferred execution of kernel tasks and have been available since Linux 1.x. In Linux 2.0, a new mechanism was added, called 'task queues', which will be the subject of next section.
Bottom halves are serialised by the global_bh_lock spinlock, i.e. there can only be one bottom half running on any CPU at a time. However, when attempting to execute the handler, if global_bh_lock is not available, the bottom half is marked (i.e. scheduled) for execution - so processing can continue, as opposed to a busy loop on global_bh_lock.
There can only be 32 bottom halves registered in total. The functions required to manipulate bottom halves are as follows (all exported to modules):
void init_bh(int nr, void (*routine)(void)): installs a bottom half handler pointed to by routine argument into slot nr. The slot ought to be enumerated in include/linux/interrupt.h in the form XXXX_BH, e.g. TIMER_BH or TQUEUE_BH. Typically, a subsystem's initialisation routine (init_module() for modules) installs the required bottom half using this function.
void remove_bh(int nr): does the opposite of init_bh(), i.e. de-installs bottom half installed at slot nr. There is no error checking performed there, so, for example remove_bh(32) will panic/oops the system. Typically, a subsystem's cleanup routine (cleanup_module() for modules) uses this function to free up the slot that can later be reused by some other subsystem. (TODO: wouldn't it be nice to have /proc/bottom_halves list all registered bottom halves on the system? That means global_bh_lock must be made read/write, obviously)
void mark_bh(int nr): marks bottom half in slot nr for execution. Typically, an interrupt handler will mark its bottom half (hence the name!) for execution at a "safer time".
Bottom halves are globally locked tasklets, so the question "when are bottom half handlers executed?" is really "when are tasklets executed?". And the answer is, in two places: a) on each schedule() and b) on each interrupt/syscall return path in entry.S (TODO: therefore, the schedule() case is really boring - it like adding yet another very very slow interrupt, why not get rid of handle_softirq label from schedule() altogether?).
2.8 Task Queues
Task queues can be though of as a dynamic extension to old bottom halves. In fact, in the source code they are sometimes referred to as "new" bottom halves. More specifically, the old bottom halves discussed in previous section have these limitations:
There are only a fixed number (32) of them.
Each bottom half can only be associated with one handler function.
Bottom halves are consumed with a spinlock held so they cannot block.
So, with task queues, arbitrary number of functions can be chained and processed one after another at a later time. One creates a new task queue using the DECLARE_TASK_QUEUE() macro and queues a task onto it using the queue_task() function. The task queue then can be processed using run_task_queue(). Instead of creating your own task queue (and having to consume it manually) you can use one of Linux' predefined task queues which are consumed at well-known points:
tq_timer: the timer task queue, run on each timer interrupt and when releasing a tty device (closing or releasing a half-opened terminal device). Since the timer handler runs in interrupt context, the tq_timer tasks also run in interrupt context and thus cannot block.
tq_scheduler: the scheduler task queue, consumed by the scheduler (and also when closing tty devices, like tq_timer). Since the scheduler executed in the context of the process being re-scheduled, the tq_scheduler tasks can do anything they like, i.e. block, use process context data (but why would they want to), etc.
tq_immediate: this is really a bottom half IMMEDIATE_BH, so drivers can queue_task(task, &tq_immediate) and then mark_bh(IMMEDIATE_BH) to be consumed in interrupt context.
tq_disk: used by low level block device access (and RAID) to start the actual requests. This task queue is exported to modules but shouldn't be used except for the special purposes which it was designed for.
Unless a driver uses its own task queues, it does not need to call run_tasks_queues() to process the queue, except under circumstances explained below.
The reason tq_timer/tq_scheduler task queues are consumed not only in the usual places but elsewhere (closing tty device is but one example) becomes clear if one remembers that the driver can schedule tasks on the queue, and these tasks only make sense while a particular instance of the device is still valid - which usually means until the application closes it. So, the driver may need to call run_task_queue() to flush the tasks it (and anyone else) has put on the queue, because allowing them to run at a later time may make no sense - i.e. the relevant data structures may have been freed/reused by a different instance. This is the reason you see run_task_queue() on tq_timer and tq_scheduler in places other than timer interrupt and schedule() respectively.
2.9 Tasklets
Not yet, will be in future revision.
2.10 Softirqs
Not yet, will be in future revision.
2.11 How System Calls Are Implemented on i386 Architecture?
There are two mechanisms under Linux for implementing system calls:
lcall7/lcall27 call gates;
int 0x80 software interrupt.
Native Linux programs use int 0x80 whilst binaries from foreign flavours of UNIX (Solaris, UnixWare 7 etc.) use the lcall7 mechanism. The name 'lcall7' is historically misleading because it also covers lcall27 (e.g. Solaris/x86), but the handler function is called lcall7_func.
When the system boots, the function arch/i386/kernel/traps.c:trap_init() is called which sets up the IDT so that vector 0x80 (of type 15, dpl 3) points to the address of system_call entry from arch/i386/kernel/entry.S.
When a userspace application makes a system call, the arguments are passed via registers and the application executes 'int 0x80' instruction. This causes a trap into kernel mode and processor jumps to system_call entry point in entry.S. What this does is:
Save registers.
Set %ds and %es to KERNEL_DS, so that all data (and extra segment) references are made in kernel address space.
If the value of %eax is greater than NR_syscalls (currently 256), fail with ENOSYS error.
If the task is being ptraced (tsk->ptrace & PF_TRACESYS), do special processing. This is to support programs like strace (analogue of SVR4 truss(1)) or debuggers.
Call sys_call_table+4*(syscall_number from %eax). This table is initialised in the same file (arch/i386/kernel/entry.S) to point to individual system call handlers which under Linux are (usually) prefixed with sys_, e.g. sys_open, sys_exit, etc. These C system call handlers will find their arguments on the stack where SAVE_ALL stored them.
Enter 'system call return path'. This is a separate label because it is used not only by int 0x80 but also by lcall7, lcall27. This is concerned with handling tasklets (including bottom halves), checking if a schedule() is needed (tsk->need_resched != 0), checking if there are signals pending and if so handling them.
Linux supports up to 6 arguments for system calls. They are passed in %ebx, %ecx, %edx, %esi, %edi (and %ebp used temporarily, see _syscall6() in asm-i386/unistd.h). The system call number is passed via %eax.
2.12 Atomic Operations
There are two types of atomic operations: bitmaps and atomic_t. Bitmaps are very convenient for maintaining a concept of "allocated" or "free" units from some large collection where each unit is identified by some number, for example free inodes or free blocks. They are also widely used for simple locking, for example to provide exclusive access to open a device. An example of this can be found in arch/i386/kernel/microcode.c:
--------------------------------------------------------------------------------
/*
* Bits in microcode_status. (31 bits of room for future expansion)
*/
#define MICROCODE_IS_OPEN 0 /* set if device is in use */
static unsigned long microcode_status;
--------------------------------------------------------------------------------
There is no need to initialise microcode_status to 0 as BSS is zero-cleared under Linux explicitly.
--------------------------------------------------------------------------------
/*
* We enforce only one user at a time here with open/close.
*/
static int microcode_open(struct inode *inode, struct file *file)
{
if (!capable(CAP_SYS_RAWIO))
return -EPERM;
/* one at a time, please */
if (test_and_set_bit(MICROCODE_IS_OPEN, µcode_status))
return -EBUSY;
MOD_INC_USE_COUNT;
return 0;
}
--------------------------------------------------------------------------------
The operations on bitmaps are:
void set_bit(int nr, volatile void *addr): set bit nr in the bitmap pointed to by addr.
void clear_bit(int nr, volatile void *addr): clear bit nr in the bitmap pointed to by addr.
void change_bit(int nr, volatile void *addr): toggle bit nr (if set clear, if clear set) in the bitmap pointed to by addr.
int tes