Table of Contents
This chapter describe processes and threads in NetBSD. This includes process startup, traps and system calls, process and thread creation and termination, signal delivery, and thread scheduling.
CAUTION! This chapter is an ongoing work: it has not been reviewed yet, neither for typos, not for technical mistakes
On Unix systems, new programs are started using the
execve
system call. If successful,
execve
replaces the currently-executing program
by a new one. This is done within the same process, by reinitializing
the whole virtual memory mapping and loading the new program binary in
memory. All process' threads but the calling one are terminated, and the
calling thread CPU context is reset for executing the new program
startup.
Here is execve
prototype:
int execve( |
path, | |
argv, | ||
envp) ; |
const char * | path; |
char *const | argv[]; |
char *const | envp[]; |
path
is the filesystem path to the new
executable. argv
and envp
are two NULL-terminated string arrays that hold the new program
arguments and environment variables. execve
is
responsible for copying the arrays to the new process stack.
Here is the top-down modular diagram for execve
implementation in the NetBSD kernel when executing a native 32 bit ELF
binary on an i386 machine:
src/sys/kern/kern_exec.c
:
sys_execve
src/sys/kern/kern_exec.c
:
execve1
src/sys/kern/kern_exec.c
:
check_exec
src/sys/kern/kern_verifiedexec.c
:
veriexec_verify
src/sys/kern/kern_conf.c
:
*execsw[]->es_makecmds
src/sys/kern/exec_elf32.c
:
exec_elf_makecmds
src/sys/kern/exec_elf32.c
:
exec_check_header
src/sys/kern/exec_elf32.c
:
exec_read_from
src/sys/kern/exec_conf.c
:
*execsw[]->u.elf_probe_func
src/sys/kern/exec_elf32.c
:
netbsd_elf_probe
src/sys/kern/exec_elf32.c
:
elf_load_psection
src/sys/kern/exec_elf32.c
:
elf_load_file
src/sys/kern/exec_conf.c
:
*execsw[]->es_setup_stack
src/sys/kern/exec_subr.c
:
exec_setup_stack
*fetch_element
src/sys/kern/kern_exec.c
:
execve_fetch_element
*vcp->ev_proc
src/sys/kern/exec_subr.c
:
vmcmd_map_zero
src/sys/kern/exec_subr.c
:
vmcmd_map_pagedvn
src/sys/kern/exec_subr.c
:
vmcmd_map_readvn
src/sys/kern/exec_subr.c
:
vmcmd_readvn
src/sys/kern/exec_conf.c
:
*execsw[]->es_copyargs
src/sys/kern/kern_exec.c
:
copyargs
src/sys/kern/kern_clock.c
:
stopprofclock
src/sys/kern/kern_descrip.c
:
fdcloseexec
src/sys/kern/kern_sig.c
:
execsigs
src/sys/kern/kern_ras.c
:
ras_purgeall
src/sys/kern/exec_subr.c
:
doexechooks
src/sys/sys/event.h
:
KNOTE
src/sys/kern/kern_event.c
:
knote
src/sys/kern/exec_conf.c
:
*execsw[]->es_setregs
src/sys/arch/i386/i386/machdep.c
:
setregs
src/sys/kern/kern_exec.c
:
exec_sigcode_map
src/sys/kern/kern_exec.c
:
*p->p_emul->e_proc_exit
(NULL)
src/sys/kern/kern_exec.c
:
*p->p_emul->e_proc_exec
(NULL)
execve
calls execve1
with a pointer to a function called fetch_element
,
responsible for loading program arguments and environment variables
in kernel space.
The primary reason for this abstraction function is to allow fetching
pointers from a 32 bit process on a 64 bit system.
execve1
uses a variable of type
struct exec_package (defined in
src/sys/sys/exec.h
) to share various informations
with the called functions.
The makecmds
is responsible for checking
if the program can be loaded, and to build a set of virtual memory
commands (vmcmd's) that can be used later to setup the virtual memory
space and to load the program code and data sections. The set of
vmcmd's is stored in the ep_vmcmds
field of the
exec package. The use of these vmcmd set allows cancellation of the
execution process before a commitment point.
The exec switch is an array of structure struct execsw
defined in src/sys/kern/exec_conf.c
:
execsw[]
.
The struct execsw itself is defined in
src/sys/sys/exec.h
.
Each entry in the exec switch is written for a given executable
format and a given kernel ABI. It contains test methods to check if
a binary fits the format and ABI, and the methods to load it and start
it up if it does. One
can find here various methods called within execve
code path.
Table 3.1. struct execsw fields summary
Field name | Description |
---|---|
es_hdrsz |
The size of the executable format header |
es_makecmds |
A method that checks if the program can be executed, and if it does, create the vmcmds required to setup the virtual memory space (this includes loading the executable code and data sections). |
u.elf_probe_func
u.ecoff_probe_func
u.macho_probe_func
|
Executable probe method, used by the
es_makecmds method
to check if the binary can be executed.
The u field is an union that contains
probe methods for ELF, ECOFF and Mach-O formats |
es_emul |
The struct emul used for handling different kernel ABI. It is covered in detail in Section 3.2.3, “Multiple kernel ABI support with the emul switch”. |
es_prio |
A priority level for this exec switch entry. This field helps choosing the test order for exec switch entries |
es_arglen |
XXX ? |
es_copyargs |
Method used to copy the new program arguments and environment function in user space |
es_setregs |
Machine-dependent method used to set up the initial process CPU registers |
es_coredump |
Method used to produce a core from the process |
es_setup_stack |
Method called by es_makecmds
to produce a set of vmcmd for setting up the new process stack.
|
execve1
iterate on the exec switch entries,
using the es_priority
for ordering, and calls the
es_makecmds
method of each entry until it gets
a match.
The es_makecmds
will fill the exec package's
ep_vmcmds
field with vmcmds that will be used later
for setting up the new process virtual memory space. See
Section 3.1.3.2, “Virtual memory space setup commands (vmcmds)” for details about the vmcmds.
The executable format probe is called by the
es_makecmds
method. Its job is simply to check
if the executable binary can be handled by this exec switch entry.
It can check a signature in the binary (e.g.: ELF note section),
the name of a dynamic linker embedded in the binary, and so on.
Some probe functions feature wildcard, and will be used as
last resort, with the help of the es_prio
field.
This is the case of the native ELF 32 bit entry, for instance.
Vmcmds are stored in an array of struct exec_vmcmd
(defined in src/sys/sys/exec.h
) in the
ep_vmcmds
field of the exec
package, before execve1
decides to execute or
destroy them.
struct exec_vmcmd defines,
in the ev_proc
field, a pointer to the
method that will perform the command, The other fields are
used to store the method's arguments.
Four methods are available in
src/sys/kern/exec_subr.c
Table 3.2. vmcmd methods
Name | Description |
---|---|
vmcmd_map_pagedvn |
Map memory from a vnode. Appropriate for handling demand-paged text and data segments. |
vmcmd_map_readvn |
Read memory from a vnode. Appropriate for handling non-demand-paged text/data segments, i.e. impure objects (a la OMAGIC and NMAGIC). |
vmcmd_readvn |
XXX ? |
vmcmd_zero |
Maps a region of zero-filled memory |
Vmcmd are created using new_vmcmd
,
and can be destroyed using kill_vmcmd
.
The es_setup_stack
field of the exec switch
holds a pointer to the method in charge of generating the vmcmd
for setting up the stack space. Filling the stack with arguments and
environment is done later, by the es_copyargs
method.
For native ELF binaries, the
netbsd32_elf32_copyargs
(obtained by a macro from elf_copyargs
method
in src/sys/kern/exec_elf32.c
) is used. It calls the
copyargs
(from
src/sys/kern/kern_exec.c
) for the part of the
job which is not specific to ELF.
copyargs
has to copy back the arguments
and environment string from the kernel copy (in the exec package)
to the new process stack in userland. Then
the arrays of pointers to the strings are reconstructed, and finally,
the pointers to the array, and the argument count, are copied to the
top of the stack. The new program stack pointer will be set to
point to the argument count, followed by the argument array pointer,
as expected by any ANSI program.
Dynamic ELF executable are special: they need a structure called the ELF auxiliary table to be copied on the stack. The table is an array of pairs of key and values for various things such as the ELF header address in user memory, the page size, or the entry point of the ELF executable
Note that when starting a dynamic ELF executable, the ELF
loader (also known as the interpreter:
/usr/libexec/ld.elf_so
) is loaded with the
executable by the kernel. The ELF loader is started by
the kernel and is responsible for starting the executable itself
afterwards.
es_setregs
is a machine
dependent method responsible for setting up the initial
process CPU registers. On any machine, the method will
have to set the registers holding the instruction pointer,
the stack pointer and the machine state. Some ports will need
more work (for instance i386 will set up the segment registers,
and Local Descriptor Table)
The CPU registers are stored in a struct trapframe, available from struct lwp.
After execve
has finished his work,
the new process is ready for running. It is available in the run
queue and it will be picked up by the scheduler when
appropriate.
From the scheduler point of view, starting or resuming a process execution is the same operation: returning to userland. This involves switching to the process virtual memory space, and loading the process CPU registers. By loading the machine state register with the system bit off, kernel privileges are dropped.
XXX details
When the processor encounter an exception (memory fault, division
by zero, system call instruction...), it executes a trap: control
is transferred to the kernel, and after some assembly routine in
locore.S
, the CPU drops in the
syscall_plain
(from src/sys/arch/i386/i386/syscall.c
on i386) for
system calls, or in the
trap
function
(from src/sys/arch/i386/i386/trap.c
on i386) for
other traps.
There is also a syscall_fancy
system call
handler which is only used when the process is being traced by
ktrace.
The struct emul is defined in
src/sys/sys/proc.h
. It defines various methods
and parameters to handle system calls and traps. Each kernel ABI
supported by the NetBSD kernel has its own struct emul.
For instance, Linux ABI defines emul_linux
in
src/sys/compat/linux/common/linux_exec.c
,
and the native ABI defines emul_netbsd
, in
src/sys/kern/kern_exec.c
.
The struct emul for the current ABI is obtained
from the es_emul
field of the exec switch entry
that was selected by execve
. The kernel holds a
pointer to it in the process' struct proc (defined in
src/sys/sys/proc.h
).
Most importantly, the struct emul defines the system call handler function, and the system call table.
Each kernel ABI have a system call table. The table maps system
call numbers to functions implementing the system call in the kernel
(e.g.: system call number 2 is fork
).
The native system call table can be found in
src/sys/kern/syscalls.master
.
This file is not written in C language. After any change, it
must be processed by the Makefile
available
in the same directory. syscalls.master
processing
is controlled by the configuration found in
syscalls.conf
, and it will output several
files:
Table 3.3. Files produced from syscalls.master
File name | Description |
---|---|
syscallargs.h |
Define the system call arguments structures, used to pass data from the system call handler function to the function implementing the system call. |
syscalls.c |
An array of strings containing the names for the system calls |
syscall.h |
Preprocessor defines for each system call name and number |
sysent.c |
An array containing for each system call an entry with the number of arguments, the size of the system call arguments structure, and a pointer to the function that implements the system call in the kernel |
In order to avoid namespace collision, non native ABI have
syscalls.conf
defining output file names prefixed
by tags (e.g: linux_ for Linux ABI).
system call argument structures (syscallarg for short) are always used to pass arguments to functions implementing the system calls. Each system call has its own syscallarg structure. This encapsulation layer is here to hide endianness differences.
All functions implementing system calls have the same prototype:
int syscall( |
l, | |
v, | ||
retval) ; |
struct lwp * | l; |
void * | v; |
register_t * | retval; |
l
is the struct lwp for the
calling thread, v
is the syscallarg structure
pointer, and retval
is a pointer to the return
value.
When executing 32 bit binaries on a 64 bit system, care must be taken to only use addresses below 4 GB. This is a problem at process creation, when the stack and heap are allocated, but also for each system call, where 32 bits pointers handled by the 32 bit process are manipulated by the 64 bit kernel.
For a kernel built as a 64 bit binary, a 32 bit pointer is
not something that makes sense: pointers can only be 64 bit long.
This is why 32 bit pointers are defined as an u_int32_t
synonym called netbsd32_pointer_t
(in src/sys/compat/netbsd32/netbsd32.h
).
For copyin
and copyout
,
true 64 bits pointers are required. They are obtained by casting the
netbsd32_pointer_t through the
NETBSD32PTR64
macro.
Most of the time, implementation of a 32 bit system call is just
about casting pointers and to call the 64 version of the system call.
An example of such a situation can be found in
src/sys/compat/netbsd32/netbsd32_time.c
:
netbsd32_timer_delete
. Provided that the 32 bit
system call argument structure pointer is called uap
,
and the 64 bit one is called ua
, then helper macros
called NETBSD32TO64_UAP
,
NETBSD32TOP_UAP
,
NETBSD32TOX_UAP
, and
NETBSD32TOX64_UAP
can be used. Sources in
src/sys/compat/netbsd32
provide multiple examples.
For each kernel ABI, struct emul defines a
machine-dependent sendsig
function, which
is responsible for altering the process user context so that it calls a
signal handler.
sendsig
builds a stack frame containing
the CPU registers before the signal handler invocation. The CPU
registers are altered so that on return to userland, the process
executes the signal handler and have the stack pointer set to the
new stack frame.
If requested at sigaction
call time,
sendsig
will also add a struct siginfo
to the stack frame.
Last but not least, sendsig
may copy
a small assembly code involved in signal cleanup, which is called the
signal trampoline. This is detailed
in the next section. Note that that modern NetBSD native programs
do not use that feature anymore: it is only used for older programs,
and other OSes emulation.
Once the signal handler returns, the kernel must destroy the signal handler context and restore the previous process state. This can be achieved by two ways.
First method, using the kernel-provided signal trampoline:
sendsig
have copied the signal trampoline on
the stack and has prepared the stack and/or CPU registers so that the
signal handler returns to the signal trampoline. The job of the
signal trampoline is to call the sigreturn
or the setcontext
system calls, handling a pointer
to the CPU registers saved on stack. This restores the CPU registers
to their values before the signal handler invocation, and next time the
process will return to userland, it will resume its execution where it
stopped.
The native signal trampoline for i386 is called
sigcode
and can be found in
src/sys/arch/i386/i386/locore.S
. Each emulated ABI
has its own signal trampoline, which can be quite close to the native
one, except usually for the sigreturn
system call
number.
The second method is to use a signal trampoline provided by libc.
This is how modern NetBSD native programs do. At the time the
sigaction
system call is invoked, the libc stub
handle a pointer to a signal trampoline in libc, which is in charge
of calling setcontext
.
sendsig
will use that pointer as the return address
for the signal handler. This method is better than the previous one,
because it removes the need for an executable stack page where the
signal trampoline is stored. The trampoline is now stored in the code
segment of libc. For instance, for i386, the signal trampoline
is named __sigtramp_siginfo_2
and can be found in
src/lib/libc/arch/i386/sys/__sigtramp2.S
.