阅读笔记:x86系统调用入门
原作者: Russ Blaine
原文来自:
http://blogs.sun.com/roller/page/rab
译注者: Badcoffee
Email: blog.oliver@gmail.com
Blog: http://blog.csdn.net/yayong
2005年7月
按:要开始学习像操作系统这样复杂的东东是一个令人头痛的问题。为了帮助新学者理清头绪,这里我们将讨论Solaris X86和Solaris
X64系统调用的基础构架。
x86 syscall primer
Getting started on a project as complex as an operating system can
be quite
a daunting(令人畏缩的) task. To help OpenSolaris newcomers sort out
their
head from
their tail(理出头绪), here's a look at the system call infrastructure on
Solaris
x86
and Solaris x64.
I'll go over the different system call methods used, their
departure(出发) points
in userland and entry points in the kernel, and then we'll actually
follow
one into the kernel with the debugger to see it all in action.
注:
1.
个人感觉学习操作系统最好的起点就是从系统调用来着手,因为系统调用是用户态进入到内核态的一个入口。看来不只是我们觉得操作系统复杂啊,连kernel
的developer都说它是be quite
a daunting task。所以学习中止步不前时千万别灰心,呵呵。
2. sort out their head from
tail是个习语,意思大约是“理顺头绪”。
Background
Processors in the x86 world support a number of different system
call
methods, and some are faster than others. In Solaris, unoptimized(未优化的)
system
calls take one of three possible paths into the kernel:
注:
3.
x86处理器支持许多种系统调用的方式,其中一些方式要比另一些快。在Solaris中,未优化的系统调用使用了其中的一种可能的方式。
lcall $0x27Used for years as the standard Solaris syscall method.
int $0x91
Used by linux for years, Solaris finally
adopted int as the base syscall method in Solaris 11 (under
development) - and earned a significant performance increase as a
result. It will be available soon in a Solaris 10 update.lcall $0x7Used by some (very old) statically linked binaries.
注:
4. lcall实际上就是利用x86的调用门机制。lcall $0x27是solaris系统调用的标准使用方式。lcall
$0x7则出现在solaris非常古老的静态链接库里。
5. int方式实际上利用的是x86提供的中断门。int
$0x91这种方式是Solaris在版本11马上要实现的一种方式,这种方式会显著提高性能,它也很快会出现在Solaris的update版本中。
Linux和FreeBSD实际上就是利用同样的机制,只不过它们用的是int $0x80,中断向量号不一样而已。
x86的CPU支持4种不同的门调用机制:
中断门 -- 被Windows/Linux/Unix系统用作中断处理和系统调用
陷阱门 -- 一般用做异常处理
调用门 -- Linux/Unix用来实现系统服务,兼容以前版本的应用
任务门 -- 现代的OS都不使用任务门,因为速度慢和任务数限制,只有早期的Linux2.0使用
关于x86 CPU调用门的详细介绍,请参考IntelP4的手册卷3:系统编程。
Fast Syscalls and Hardware Capability Libraries
When a well-behaved application makes a system call, it jumps
through a
wrapper(包装) function in libc. Changing the instruction used to enter
the kernel
becomes a matter of changing the wrappers in libc. Recently I
integrated
support for faster, chip-proprietary(芯片特有的) system calls into Solaris
10:
sysenter (from Intel) and syscall (from AMD). Along with
new
kernel entry points, new hwcap (as in "hardware capability")
versions of libc were provided to take advantage of the these new,
faster
instructions ( TimMarsland has written about the hw capability architecture and
DarrenMoffat has written about how the system goes about selecting and
using
a hwcap libc).
注:
6. 应用程序调用系统调用,通常是通过libc里面的包装函数,包装函数最终会通过CPU提供的几种系统进入系统调用服务的指令中的一种,来进入到内核态。
7. 最近,作者集成支持了更快的,芯片特有的系统调用指令到Solaris
10:Intel的sysenter和AMD的syscall。新的kernel的入口点,提供了新的hardware
capability版本的libc库,它们利用了这些新的更快的指令。作者还给了另外相关的两篇文章的链接,都是关于hwcap库的。
I often get confused about which system call method is used on which
type
of system. For the record, the following table shows which methods are
supported by the various flavor combinations of x86 kernels, CPUs, and
user
application types shipping today:
u64 = 64-bit user applications
u32 = 32-bit user applications
syscall
sysenter
64-bit kernel
Intel Xeon
u64 (64-bit libc)
u32 (hwcap1)
AMD Opteron
u64 (64-bit libc)
u32 (hwcap2)
-
32-bit kernel
Intel Xeon
-
u32 (hwcap1)
AMD Opteron
u32
u32 (hwcap1)
(The hwcap libraries referenced live in the /usr/lib/libc directory.)
注:
8. 上表给出了Intel Xeon和AMD Optern在32bit和64bitkernel的情况下,使用libc库的版本的情况:
Solaris是64位内核时,64位的libc库(即u64)无论Xeon还是Optern都是使用的syscall指令,这大概是因为AMD在64位
技术领先一步,intel不得不追随吧.
Solaris是64位内核时,还同时为支持32位应用程序提供了32位的libc库,这时Solaris为两种CPU提供了不同的32位libc版本:
u32 (hwcap1) --
libc的hardware
capability 1版本,提供对Intel CPU快速系统调用指令SYSENTER/SYSEXIT的支持
u32 (hwcap2) -- libc的hardware
capability 2版本,提供对AMD的快速系统调用指令SYSCALL/SYSRET的支持
Solaris是32位内核时,AMD和Intel都使用
libc的hardware
capability 1版本。
Intel在很早就在PII 300(Family 6,Model 3,Stepping
3)支持了新的快速系统调用指令SYSENTER/SYSEXIT。AMD的Optern在32位模式下是与其保持兼容的,在64位模式下,AMD抢得先
机,推出新的快速系统调用指令SYSCALL/SYSRET,Intel的EMT64不得不与之兼容。
关于Intel及AMD的快速系统调用指令可以参考Linux2.6 对新型 CPU快速系统调用的支持这篇文章。当然,更彻底是需要看一看Intel和AMD的系统编程手册了。
Digging In
To illustrate this, let's take a look at the libc source code. It
lives in
under the usr/src/lib/libc directory. The important entries here
are:
i386/ - 32-bit source code and unoptimized binaryamd64/ - 64-bit source code and binaryi386_hwcap1/ - Intel CPU-specific source code and binaryi386_hwcap2/ - AMD CPU-specific source code and binary
注:
9. 这里给出了libc的源代码路径,通过查看i386/sys和amd64/sys下syscall.s
的源代码,结合i386_hwcap1及i386_hwcap2源
代码目录下的Makefile文件的宏定义,即可了解4种libc版本的差异。
A simple system call to use for this example is mkdir(2). We
can
use mdb to disassemble the text bits and see how libc jumps into the
kernel:
rab> mdb /lib/libc.so.1
Loading modules: [ libc.so.1 ]
> mkdir::dis
mkdir: movl $0x50,%eax
mkdir+5: syscall
mkdir+7: jb -0x82847 <__cerror>
mkdir+0xd: ret
We can see that the system call number (See EricSchrock's post for more information on system call numbers) is
stashed
away in register %eax so the kernel can find it later, and
then
the syscall instruction is executed to transfer control to
the
kernel.
注:
10.
这里用mdb可以反汇编libc的系统调用mkdir(2),可以看出只是一个简单的包装函数,通过把系统调用号放入eax寄存器,再用syscall指
令来进入内核。
12. mkdir的系统调用号是0x50即十进制的80,在syscall.h可
以找到定义:
#define
SYS_mkdir 80
This example is on an AMD Opteron system, because otherwise we'd
expect to
find either lcall $0x27 or sysenter as the control
transfer instruction. We can get at the unoptimized libc by unmounting
the
hwcap library:
rab> su
Password:
# umount /lib/libc.so.1
rab> mdb /lib/libc.so.1
Loading modules: [ libc.so.1 ]
> mkdir::dis
mkdir: movl $0x50,%eax
mkdir+5: lcall $0x27,$0x0
mkdir+0xc: jb -0x82b2c <__cerror>
mkdir+0x12: ret
注:
13. umount掉libc.so.1后,这时就是未经优化的系统调用libc版本了,可以看到,发起系统调用的指令已经改成lcall
$0x27了。作者应该是在Solaris10上做的实验,在OpenSolaris上,未优化的libc中系统调用应该已经用int
$0x91了,请见我后面的注释15和16小节。
Tracing it back to the source
Ah-hah - now let's look at the source for the libc mkdir(2)
wrapper to complete the userland picture:
rab> pwd
.../usr/src/lib/libc/common/sys
rab> cat mkdir.s
[ snip ]
#include "SYS.h"
SYSCALL_RVAL1(mkdir)
RET
SET_SIZE(mkdir)
注:
14.
这里展示了mkdir在libc里的实现,实际上就是用了SYSCALL_RVAL1这个宏,看表面意思这个宏应该是用在返回值只有一个的系统调用上的。
In order to organize the source in a portable way that avoids
reproducing
the same code in more than one place, many portions of libc are
implemented
as preprocessor macros. mkdir(2) is so simple that it needs
nothing but the SYSCALL macro, found in SYS.h. For reasons too
boring to repeat here, the SYSCALL macro eventually expands into a
corresponding SYSTRAP macro. All 32-bit variants of libc share one
SYS.h, and preprocessor macros defined via Makefiles in the
binary
directories determine which instructions go into the SYSTRAP macro:
注:
15. 使用SYSCALL*的宏主要是多个地方避免重复编码,这个宏展开后对应着SYSTRAP的宏。SYSCALL*类的宏在SYS.h文件里定义是随着结合i386_hwcap1及i386_hwcap2源
代码目录下的Makefile文件的宏定义来决定用哪一种SYSTRAP宏的。
rab> pwd
.../usr/src/lib/libc/i386/inc
rab> grep SYSTRAP_RVAL1 SYS.h
#define SYSTRAP_RVAL1(name) __SYSCALL(name)
#define SYSTRAP_RVAL1(name) __SYSENTER(name)
#define SYSTRAP_RVAL1(name) __SYSLCALL(name)
One of the above macros are used depending on which libc is being
built:
__SYSCALL() for hwcap2, __SYSENTER() for
hwcap1, and __SYSLCALL() for the unoptimized base libc
at
/lib/libc.so.1.
注:
16. 可以看到,根据i386_hwcap1及i386_hwcap2目
录下的Makefile文件里的宏定义,libc被build成使用__SYSCALL()的hwcap2版本或者使用__SYSENTER()的
hwcap1版本,再或者未优化的版本(如前所述,solaris 10用lcall $27, OpenSolaris用int $91)。
事实上,所有32位的libc库,即便是hwcap1的libc库,也不是所有的系统调用全由__SYSENTER()来实现系统调用,对于多个返
回值的系统调用,还是会用lcall $0x27或者int $0x91来实现的,在OpenSolaris32bit的libc的源代码sys.h中
有如下定义:
#define
SYSTRAP_RVAL2(name) __SYSCALLINT(name)
#define
SYSTRAP_2RVALS(name) __SYSCALLINT(name)
#define
SYSTRAP_64RVAL(name) __SYSCALLINT(name)
可以看到,OpenSolaris对于多返回值的系统调用是用init $0x91实现的。
rab> cat SYS.h
[ snip ]
#define __SYSLCALL(name) /* CSTYLED */ movl $SYS_/**/name, %eax; lcall $SYSCALL_TRAPNUM, $0
[ snip ]
#define __SYSCALL(name) /* CSTYLED */ movl $SYS_/**/name, %eax; .byte 0xf, 0x5 /* syscall */
We added support for AMD's syscall instruction to Solaris,
but we
were using a slightly older version of our assembler which
(embarassingly
enough) didn't yet recognize the instruction, so its opcode had to be
manually hard-coded into libc.
注:
17. 由于开发用的编译器版本略旧一些,还不能识别AMD
Optern的syscall指令,因此在__SYSCALL的宏定义里直接使用了该指令的机器码。
另外,可以在OpenSolaris的sys.h文
件里找到支持新的int $0x91的实现:
#define
__SYSCALLINT(name) /* CSTYLED
*/ movl
$SYS_/**/name, %eax; int
$T_SYSCALLINT
Jumping Over the Fence(围栏)
That's all for userland; the easy part is over. Because the actual
workings of
the differing system call instructions vary widely, the kernel uses
separate code paths to deal with each. The function entry points used
are
(shown are only those for 32-bit applications making system calls):
Entry Instruction
Kernel Entry Point
64-bit
kernel
lcall*
syscall
sysenter
32-bit kernel
lcall
sysenter
* In the 64-bit kernel, 32-bit
system
calls made via lcall come in to the system via a
segment-not-present trap (#np), a matter which is beyond the
scope
of this document. Trust me, you don't want to get into segmentation
now...
注:
18. 上表只给出了Solaris内核中的32位应用程序的系统调用入口。为支持各种系统调用指令,实际上内核同时实现了不同代码路径的处理函数。
Seeing it in Action
Using the kernel debugger we can step out of the classroom and watch
these
creatures in their native wild habitats. Boot a machine and from the
system console get the kernel debugger loaded and ready. Enter the
debugger, and then set a breakpoint on the syscall entry
point. I'm still using the same Opteron machine as above (running the
64-bit kernel), so I need to re-mount the hwcap library:
root> mount -O -F lofs /usr/lib/libc/libc_hwcap2.so.1 /lib/libc.so.1
注:
19.
由于之前作者已经umount了hwcap2的libc库,所以这里想使用hwcap2版本的话,需要重新mount该库到
/lib/libc.so.1。
root> mdb -K
Welcome to kmdb
Loaded modules: [ cpc ptm ufs unix krtld sppp nca lofs genunix ip logindmux usba
specfs nfs random sctp ]
[0]> sys_syscall32:b
[0]> :c
kmdb: stop at sys_syscall32
kmdb: target stopped at:
sys_syscall32: swapgs
[1]> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc230a0 1b 0 0 60 no no t-0 ffffffff82b38520
fsflush
1 ffffffff8bdd1800 1b 0 0 49 no no t-0 ffffffff8cc991e0 ksh
We set a breakpoint, and tripped over(跳出) it immediately after
continuing
(because system calls are a very common occurrence on even an idle
machine). We can see that CPU1 tripped(跳入) the breakpoint first (as
evidenced
by the [1] in the kmdb prompt), and that ksh is the process
running. Which system call is the shell making?
注:
20. 作者利用mdb
-k进入到kmdb来直接设置在64位内核的32位应用程序的系统调用入口函数sys_syscalll32(见前面的表格)设置内核断点,
然后又用:c来继续恢复内核运行:
[0]> sys_syscall32:b ;设置断点
[0]> :c ;继续恢复运行
21.
即时在一台空闲的机器上,系统调用也是发生的非常频繁的,因此很快CPU1就运行到设置的断点处,这时kmdb的提示符就是[1]表示停在CPU1上。使
用::cpuinfo可以看到,用户进程ksh在CPU1上运行。
Remember that the libc
wrapper function stashed the system call number in register
%eax. When we are in the 64-bit kernel, %eax is the
lower
32-bits of register %rax:
[1]> <rax=D
98
注:
22.
libc的包装函数是用寄存器eax来存放调用号,Opteron中rax寄存器的低32位就是eax,因此这里直接察看其内容,转换成10进制数格式。
syscall 98, which -- according to the sysent table (see sysent.c)
-- is the shell doing a sigaction(2) (which makes sense,
because
shells are always messing around with signals).
注
23. 可以看到,98号系统调用就是sigaction(2)是可以解释得通的,因为shell经常发信号。
Clear the breakpoint and try the same thing with the 64-bit entry
point (it
is sys_syscall()), but this time enter the debugger by
sending a
break over the console (how one does this varies depending on the
terminal
being used to access the console):
[1]> :z
[1]> sys_syscall:b
[1]> :c
root>
root>
root>
注:
24. 清除之前的断点,然后在64位内核中的64位应用的系统调用入口函数sys_syscall处设置断点,然后继续运行。
Because this is an otherwise idle machine, nothing trips the 64-bit
syscall
breakpoint just yet. There just aren't very many 64-bit processes
running. We can run one manually to trigger the breakpoint:
root> /usr/bin/amd64/ls
kmdb: stop at sys_syscall
kmdb: target stopped at:
sys_syscall: swapgs
[1]> <rax=D
115
We see that the first 64-bit system call made by the 64-bit ls is
mmap(2), which makes sense because the 64-bit dynamic linker
needs
to begin setting up the new process's address space.
注:
25.
由于这是台空闲机器,没有很多64位的应用程序在运行,因此继续运行后没有进入到断点处。因此作者手工执行64位的ls命令来使其进入断点。这时察看系统
调用号,是mmap(2),这也是可以解释的,因为程序开始执行时,64位的动态链接器需要先用mmap设置新的进程地址空间。