阅读笔记：如何给OpenSolaris增加一个系统调用

原作者: Eric Schrock

原文来自:

http://blogs.sun.com/roller/page/eschrock

译注者: Badcoffee

Email: blog.oliver@gmail.com

Blog: http://blog.csdn.net/yayong

2005年7月

按：给操作系统增加系统一个简单的系统调用是熟悉OS的内核代码的一个好方法，本文给出了在Solaris内核中增加一个系统调用的基本步骤。

How to add a system call to OpenSolaris

When I first started in the Solaris group, I was faced with two equally

difficult tasks: learning the development model, and understanding the

source

code. For both these tasks, the recommended method is usually picking a

small

bug and working through the process. For the curious, the first bug I

putback

to ON was 4912227

(ptree call returns zero on failure), a simple bug with near zero risk.

was the first step down a very long road.

As a another first step, someone suggested adding a very simple

system call to the

kernel. This turned out to be a whole lot harder than one would expect,

and has

so many subtle(细微的) aspects(方面) that experienced Solaris engineers

(myself

included)

still miss some of the necessary changes. With that in mind, I

thought a

reasonable first OpenSolaris blog would be describing exactly how to

add a new

system call to the kernel.

For the purposes of this post, we will assume that it's a simple

system call

that lives in the generic kernel code, and we'll put the code into an

existing

file to avoid having to deal with Makefiles. The goal is to print an

arbitrary(随意的)

message to the console whenever the system call is issued.

注：

做Solaris开发面临2个难题，一个是需要了解Solaris开发的模式，或者说是process上的东西；而另一个就是理解Solairs源代码

了。有一个最好的办法就是选择Solaris上一个很小的bug来熟悉process上的东西。

而理解Solaris的源代码，最好是从增加一个非常简单的系统调用开始。但是这有一点难，有很多细微之处即便是有经验的Solaris工程师也会遗漏。

而本篇文章的作者将以此为起点，描述如何给Solaris的kernel增加一个系统调用。

为尽量简化，作者把新增调用的代码放到了已经存在的源文件中，来避免对Makefile的改动。这个新的系统调用只是在被调用时输出任意的信息到

console上。

1. Picking a syscall number

Before writing any real code, we first have to pick a number that

will

represent our system call. The main source of documentation here is syscall.h,

which describes all the available system call numbers, as well as which

ones are

reserved. The maximum number of syscalls is currently 256 (NSYSCALL),

which

doesn't leave much space for new ones. This could theoretically be

extended - I

believe the hard limit is in the size of sysset_t, whose 16

integers

must be able to represent a complete bitmask of all system calls. This

puts our

actual limit at 16*32, or 512, system calls. But for the purposes of

our

tutorial, we'll pick system call number 56, which is currently unused.

For my

own amusement(娱乐), we'll name our (my?) system call 'schrock'. So first

add the

following line to syscall.h

#define SYS_uadmin 55

#define SYS_schrock 56

#define SYS_utssys 57

注：

4. 第1步，需要选择一个系统调用号，需要在syscall.h里

增加一个定义，这个头文件包含了目前系统所有可用的系统调用号。

5. 系统最大的调用号数是systm.h文

件的NSYSCALL定义的，目前的值是256，实际上256被占用，没有空间增加新的调用号。

6. 理论上，可以扩展最大调用号，但sysset_t对

这个有限制，它是所有系统调用的位掩码，在syscall.h的

定义中表明，它的最大位数是16＊32=512。

7. 为简化问题，作者使用了调用号56，这个号恰好没有被使用过，而系统调用的名字就叫“schrock"。

2. Writing the syscall handler

Next, we have to actually add the function that will get called when

invoke the system call. What we should really do is add a new file

schrock.c to usr/src/uts/common/syscall,

but I'm trying to avoid Makefiles. Instead, we'll just stick it in getpid.c:

#include <sys/cmn_err.h>

int

schrock(void *arg)

{

charbuf[1024];

size_tlen;

if (copyinstr(arg, buf, sizeof (buf), &len) != 0)

return (set_errno(EFAULT));

cmn_err(CE_WARN, "%s", buf);

return (0);

}

Note that declaring a buffer of 1024 bytes on the stack is a very

bad

thing to do in the kernel. We have limited stack space, and a stack

overflow

will result in a panic. We also don't check that the length of the

string was

less than our scratch space. But this will suffice for illustrative

purposes.

The cmn_err()

function is the simplest way to display messages from the kernel.

注：

第2步，实现系统调用函数。为避免修改Makefile，作者选择了在getpid.c文件里来增加新调用schrock，实现比较简单，就是在

console输出一个指定的字符

串。

这个函数声明了一个1024字节的buffer，这个buffer是要在kernel的stack中分配的，由于kernel的stack空间是非常有限

的，分配这么大的一个buffer是很不好的，stack的溢出是会导致系统panic的。通常，为避免耗尽kernel的stack，局部变量和嵌套函

数调用都要考虑占用stack的资源问

题。

10. 查看OpenSolaris的源代码可以看到，copyinstr()这

个函数是从用户空间将以空字符终止的字符串拷贝到内核空间中，函数原型如下：

copyinstr(const char *uaddr, char *kaddr, size_t maxlength,

size_t *lencopied);

其中，第1，2个参数分别是位于用户空间的源串和内核空间的目的串；第3个参数是目的串的长度；第4个参数写回实际拷贝的长度。

11. cmn_err()相

当于Linux的printk()，可以把内核消息输出到console上。

3. Adding an entry to the syscall table

We need to place an entry in the system call table. This table lives

in sysent.c,

and makes heavy use of macros to simplify the source. Our system call

takes a

single argument and returns an integer, so we'll need to use the

SYSENT_CI macro. We need

to add a prototype for our syscall, and add an entry to the sysent

and

sysent32 tables:

int rename();

void rexit();

int schrock();

int semsys();

int setgid();

/* ... */

/* 54 */ SYSENT_CI("ioctl", ioctl, 3),

/* 55 */ SYSENT_CI("uadmin", uadmin, 3),

/* 56 */ SYSENT_CI("schrock",schrock,1),

/* 57 */ IF_LP64(

SYSENT_2CI("utssys", utssys64, 4),

SYSENT_2CI("utssys", utssys32, 4)),

/* ... */

/* 54 */ SYSENT_CI("ioctl", ioctl, 3),

/* 55 */ SYSENT_CI("uadmin", uadmin, 3),

/* 56 */ SYSENT_CI("schrock",schrock,1),

/* 57 */ SYSENT_2CI("utssys", utssys32, 4),

注：

12. 第3步，在系统调用表里增加一项。这个表就在sysent.c里，

为简化源代码这里使用了很多宏定义。

13. sysent和sysent32用

来存放系统调用表，可在sysent.c找

到如下说明：

* This table is the switch used to transfer to the appropriate

* routine for processing a system call. Each row contains the

* number of arguments expected, a switch that tells systrap()

* in trap.c whether a setjmp() is not necessary, and a pointer

* to the routine.

可以看出，事实上这个表里有每个系统调用的名称，该调用处理函数的指针，还有入口参数的个数。

sysent32用

于64位内核时，存放32位系统到调用的表结构。

14. 由于新增的调用返回值个数为1，且类型为int，在LP64和ILP32模式下都是32位的，因此使用宏SYSENT_CI，在sysent.c可

以找到相关的定义:

/* returns a 64-bit quantity for both ABIs */

#define SYSENT_C(name, call, narg) { (narg), SE_64RVAL, NULL, NULL, (llfcn_t)(call) }

/* returns one 32-bit value for both ABIs: r_val1 */

#define SYSENT_CI(name, call, narg) { (narg), SE_32RVAL1, NULL, NULL, (llfcn_t)(call) }

/* returns 2 32-bit values: r_val1 & r_val2 */

#define SYSENT_2CI(name, call, narg) { (narg), SE_32RVAL1|SE_32RVAL2, NULL, NULL, (llfcn_t)(call) }

可以看到，根据系统调用的返回值的类型及个数，可以使用不同的宏定义，对于本例，需要使用SYSENT_CI。

SYSENT_CI的参数中，第1个是调用名字符串，第2个是函数指针llfcn指向的处理函数，第3个参数是参数的个数，因此除在sysent和sysent32表

中增加相应的项外，还需要声明一下schrock()函数。

4. /etc/name_to_sysnum

At this point, we could write a program to invoke our system call,

but the

point here is to illustrate everything that needs to be done to

integrate

a system call, so we can't ignore the little things. One of these

little things

is /etc/name_to_sysnum, which provides a mapping between

system call

names and numbers, and is used by dtrace(1M), truss(1),

and

friends. Of course, there is one version for x86 and one for SPARC, so

you will

have to add the following lines to both the

intel

and SPARC

versions:

ioctl 54

uadmin 55

schrock 56

utssys 57

fdsync 58

注：

15.

第4步，需要在/etc/name_to_sysnum里添加一个相应的系统调用号。其实这时主要的工作已经完成，已经可以写一个应用程序调用执行新的系

统调用了，但这个教程实际上是要讲述集成一个系统调用所需做的所有步骤，当然也就不能忽略这些细节了。

16.

/etc/name_to_sysnum实际上是为dtrace(1M)和truss(1)之类的程序提供了一个系统调用名字和系统调用号之间的影射关

系。在这里，需要修改Intel和SPARC两

个版本的文件。

5. truss(1)

Truss does fancy decoding of system call arguments. In order to do

this, we

need to maintain a table in truss that describes the type of each

argument for

every syscall. This table is found in systable.c.

Since our syscall takes a single string, we add the following entry:

{"ioctl", 3, DEC, NOV, DEC, IOC, IOA}, /* 54 */

{"uadmin", 3, DEC, NOV, DEC, DEC, DEC}, /* 55 */

{"schrock", 1, DEC, NOV, STG}, /* 56 */

{"utssys", 4, DEC, NOV, HEX, DEC, UTS, HEX}, /* 57 */

{"fdsync", 2, DEC, NOV, DEC, FFG}, /* 58 */

Don't worry too much about the different constants. But be sure to

read up(攻读)

on the truss source code if you're adding a complicated system call.

注：

17. 第5步，为了让truss(1)命令可以解释出新加的系统调用的参数，需要在systable.c文

件中的systable中

增加一条相应的记录。

18. systable实

际上是truss(1)维护的一个表结构，用来描述系统调用的入口参数个数，返回值和入口参数的输出表示形式，其定义如下：

const struct systable systable[] = {

{ NULL,8, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX},

{"_exit",1, DEC, NOV, DEC},/* 1 */

{"forkall",0, DEC, NOV},/* 2 */

{"read",3, DEC, NOV, DEC, IOB, UNS},/* 3 */

{"write",3, DEC, NOV, DEC, IOB, UNS},/* 4 */

{"open",3, DEC, NOV, STG, OPN, OCT},/* 5 */

..............

{"cladm",3, DEC, NOV, CLC, CLF, HEX},/* 253 */

{ NULL,8, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX, HEX},

{"umount2",2, DEC, NOV, STG, MTF},/* 255 */

{ NULL, -1, DEC, NOV},

};

可以看到，其中每一行实际上对应一个系统调用的描述，对应着结构体systable，

其定义如下：

struct systable {

const char *name;/* name of system call */

shortnargs;/* number of arguments */

charrval[2];/* return value types */

chararg[8];/* argument types */

};

所以systable这

张表每行的第1个值对应调用名，第2个对应参数个数，第3，4对应返回值的描述，剩下8个值对应调用的入口参数描述。通过这样的描述，truss(1)就

知道每个系统调用的入口参数和返回值格式，并正确的输出了，新增的系统调用对应的记录为：

{"schrock", 1, DEC, NOV, STG}, /* 56 */

如前所述，再结合print.h中

对DEC,NOV,STG的定义，就知道这一行的含义了。

6. proc_names.c

This is the file that gets missed the most often when adding a new

syscall.

Libproc uses the table in proc_names.c

to translate between system call numbers and names. Why it doesn't make

use of

/etc/name_to_sysnum is anybody's guess, but for now you have

to update

the systable array in this file:

"ioctl", /* 54 */

"uadmin", /* 55 */

"schrock", /* 56 */

"utssys", /* 57 */

"fdsync", /* 58 */

注：

19. 第6步，为保证Libproc能正确识别新加的系统调用，需要在proc_names.c增加对应的行，这一步是经常容易被遗漏的。至于Libproc为何不用/etc/name_to_sysnum而另外定义一个系统调用名和调用号的影射关系，恐怕只有作者知道了。

20. Libproc是Solaris提供的一组访问proc文件系统的接口，proc(1)中介绍的一组命令使用了这组接口。这组接口位于libproc.so动态链接库，关于proc文件系统，可以参考proc(4)。

7. Putting it all together

Finally, everything is in place. We can test our system call with a

simple

program:

#include <sys/syscall.h>

int

main(int argc, char **argv)

{

syscall(SYS_schrock, "OpenSolaris Rules!");

return (0);

}

If we run this on our system, we'll see the following output on the

console:

June 14 13:42:21 halcyon genunix: WARNING: OpenSolaris Rules!

Because we did all the extra work, we can actually observe the

behavior using

truss(1), mdb(1), or dtrace(1M). As you

can see,

adding a system call is not as easy as it should be. One of the ideas

that has

been floating around for a while is the Grand Unified Syscall(tm)

project, which

would centralize all this information as well as provide type

information for

the DTrace syscall provider. But until that happens, we'll have to deal

with

this process.

注：

21.

最后，写一个小程序测试一下新加的系统调用。其实，这里略去了很重要而且很复杂的一个环节，就是重新build一下OpenSolairs的内核，然后

Install或者update一下OpenSolaris，让新加的调用可用。因为所有应做的改动都做了，因此，除了可以调用新的系统调用之外，还可以

使用OpenSolaris所有debug工具，如truss(1), mdb(1)和dtrace(1M)。

22. 文章的结尾处，作者透露了未来 OpenSolaris所做的改进，就是将集中化所有有关系统调用的定义，同时为dtrace的syscall provider提供系统调用的类型信息。在这些改进完成之前，增加新的系统调用就不得不走一遍本文所述流程。

Technorati Tag: OpenSolaris

Technorati Tag: Solaris