Chapter 3. Processes and threads

Chapter 3. Processes and threads
Prev		Next

3.1. Process startup

3.1.1. `execve` usage

On Unix systems, new programs are started using the execve system call. If successful, execve replaces the currently-executing program by a new one. This is done within the same process, by reinitializing the whole virtual memory mapping and loading the new program binary in memory. All process' threads but the calling one are terminated, and the calling thread CPU context is reset for executing the new program startup.

Here is execve prototype:

`int execve(`	`path`,
	`argv`,
	`envp)`;

const char *	`path`;
char *const	`argv`[];
char *const	`envp`[];

path is the filesystem path to the new executable. argv and envp are two NULL-terminated string arrays that hold the new program arguments and environment variables. execve is responsible for copying the arrays to the new process stack.

3.1.2. Overview of in-kernel `execve` code path

Here is the top-down modular diagram for execve implementation in the NetBSD kernel when executing a native 32 bit ELF binary on an i386 machine:

src/sys/kern/kern_exec.c: sys_execve
- src/sys/kern/kern_exec.c: execve1
  - src/sys/kern/kern_exec.c: check_exec
    
    src/sys/kern/kern_verifiedexec.c: veriexec_verify
    
    src/sys/kern/kern_conf.c: *execsw[]->es_makecmds
    
    src/sys/kern/exec_elf32.c: exec_elf_makecmds
    
    src/sys/kern/exec_elf32.c: exec_check_header
    
    src/sys/kern/exec_elf32.c: exec_read_from
    
    src/sys/kern/exec_conf.c: *execsw[]->u.elf_probe_func
    
    src/sys/kern/exec_elf32.c: netbsd_elf_probe
    
    src/sys/kern/exec_elf32.c: elf_load_psection
    
    src/sys/kern/exec_elf32.c: elf_load_file
    
    src/sys/kern/exec_conf.c: *execsw[]->es_setup_stack
    
    src/sys/kern/exec_subr.c: exec_setup_stack
  - *fetch_element
    
    src/sys/kern/kern_exec.c: execve_fetch_element
  - *vcp->ev_proc
    
    src/sys/kern/exec_subr.c: vmcmd_map_zero
    
    src/sys/kern/exec_subr.c: vmcmd_map_pagedvn
    
    src/sys/kern/exec_subr.c: vmcmd_map_readvn
    
    src/sys/kern/exec_subr.c: vmcmd_readvn
  - src/sys/kern/exec_conf.c: *execsw[]->es_copyargs
    
    src/sys/kern/kern_exec.c: copyargs
  - src/sys/kern/kern_clock.c: stopprofclock
  - src/sys/kern/kern_descrip.c: fdcloseexec
  - src/sys/kern/kern_sig.c: execsigs
  - src/sys/kern/kern_ras.c: ras_purgeall
  - src/sys/kern/exec_subr.c: doexechooks
  - src/sys/sys/event.h: KNOTE
    
    src/sys/kern/kern_event.c: knote
  - src/sys/kern/exec_conf.c: *execsw[]->es_setregs
    
    src/sys/arch/i386/i386/machdep.c: setregs
  - src/sys/kern/kern_exec.c: exec_sigcode_map
  - src/sys/kern/kern_exec.c: *p->p_emul->e_proc_exit (NULL)
  - src/sys/kern/kern_exec.c: *p->p_emul->e_proc_exec (NULL)

execve calls execve1 with a pointer to a function called fetch_element, responsible for loading program arguments and environment variables in kernel space. The primary reason for this abstraction function is to allow fetching pointers from a 32 bit process on a 64 bit system.

execve1 uses a variable of type struct exec_package (defined in src/sys/sys/exec.h) to share various informations with the called functions.

The makecmds is responsible for checking if the program can be loaded, and to build a set of virtual memory commands (vmcmd's) that can be used later to setup the virtual memory space and to load the program code and data sections. The set of vmcmd's is stored in the ep_vmcmds field of the exec package. The use of these vmcmd set allows cancellation of the execution process before a commitment point.

3.1.3. Multiple executable format support with the exec switch

The exec switch is an array of structure struct execsw defined in src/sys/kern/exec_conf.c: execsw[]. The struct execsw itself is defined in src/sys/sys/exec.h.

Each entry in the exec switch is written for a given executable format and a given kernel ABI. It contains test methods to check if a binary fits the format and ABI, and the methods to load it and start it up if it does. One can find here various methods called within execve code path.

Table 3.1. struct execsw fields summary

Field name	Description
`es_hdrsz`	The size of the executable format header
`es_makecmds`	A method that checks if the program can be executed, and if it does, create the vmcmds required to setup the virtual memory space (this includes loading the executable code and data sections).
`u.elf_probe_func` `u.ecoff_probe_func` `u.macho_probe_func`	Executable probe method, used by the `es_makecmds` method to check if the binary can be executed. The `u` field is an union that contains probe methods for ELF, ECOFF and Mach-O formats
`es_emul`	The struct emul used for handling different kernel ABI. It is covered in detail in Section 3.2.3, “Multiple kernel ABI support with the emul switch”.
`es_prio`	A priority level for this exec switch entry. This field helps choosing the test order for exec switch entries
`es_arglen`	XXX ?
`es_copyargs`	Method used to copy the new program arguments and environment function in user space
`es_setregs`	Machine-dependent method used to set up the initial process CPU registers
`es_coredump`	Method used to produce a core from the process
`es_setup_stack`	Method called by `es_makecmds` to produce a set of vmcmd for setting up the new process stack.

execve1 iterate on the exec switch entries, using the es_priority for ordering, and calls the es_makecmds method of each entry until it gets a match.

The es_makecmds will fill the exec package's ep_vmcmds field with vmcmds that will be used later for setting up the new process virtual memory space. See Section 3.1.3.2, “Virtual memory space setup commands (vmcmds)” for details about the vmcmds.

3.1.3.1. Executable format probe

The executable format probe is called by the es_makecmds method. Its job is simply to check if the executable binary can be handled by this exec switch entry. It can check a signature in the binary (e.g.: ELF note section), the name of a dynamic linker embedded in the binary, and so on.

Some probe functions feature wildcard, and will be used as last resort, with the help of the es_prio field. This is the case of the native ELF 32 bit entry, for instance.

3.1.3.2. Virtual memory space setup commands (vmcmds)

Vmcmds are stored in an array of struct exec_vmcmd (defined in src/sys/sys/exec.h) in the ep_vmcmds field of the exec package, before execve1 decides to execute or destroy them.

struct exec_vmcmd defines, in the ev_proc field, a pointer to the method that will perform the command, The other fields are used to store the method's arguments.

Four methods are available in src/sys/kern/exec_subr.c

Table 3.2. vmcmd methods

Name	Description
`vmcmd_map_pagedvn`	Map memory from a vnode. Appropriate for handling demand-paged text and data segments.
`vmcmd_map_readvn`	Read memory from a vnode. Appropriate for handling non-demand-paged text/data segments, i.e. impure objects (a la OMAGIC and NMAGIC).
`vmcmd_readvn`	XXX ?
`vmcmd_zero`	Maps a region of zero-filled memory

Vmcmd are created using new_vmcmd, and can be destroyed using kill_vmcmd.

3.1.3.3. Stack virtual memory space setup

The es_setup_stack field of the exec switch holds a pointer to the method in charge of generating the vmcmd for setting up the stack space. Filling the stack with arguments and environment is done later, by the es_copyargs method.

For native ELF binaries, the netbsd32_elf32_copyargs (obtained by a macro from elf_copyargs method in src/sys/kern/exec_elf32.c) is used. It calls the copyargs (from src/sys/kern/kern_exec.c) for the part of the job which is not specific to ELF.

copyargs has to copy back the arguments and environment string from the kernel copy (in the exec package) to the new process stack in userland. Then the arrays of pointers to the strings are reconstructed, and finally, the pointers to the array, and the argument count, are copied to the top of the stack. The new program stack pointer will be set to point to the argument count, followed by the argument array pointer, as expected by any ANSI program.

Dynamic ELF executable are special: they need a structure called the ELF auxiliary table to be copied on the stack. The table is an array of pairs of key and values for various things such as the ELF header address in user memory, the page size, or the entry point of the ELF executable

Note that when starting a dynamic ELF executable, the ELF loader (also known as the interpreter: /usr/libexec/ld.elf_so) is loaded with the executable by the kernel. The ELF loader is started by the kernel and is responsible for starting the executable itself afterwards.

3.1.3.4. Initial register setup

es_setregs is a machine dependent method responsible for setting up the initial process CPU registers. On any machine, the method will have to set the registers holding the instruction pointer, the stack pointer and the machine state. Some ports will need more work (for instance i386 will set up the segment registers, and Local Descriptor Table)

The CPU registers are stored in a struct trapframe, available from struct lwp.

3.1.3.5. Return to userland

After execve has finished his work, the new process is ready for running. It is available in the run queue and it will be picked up by the scheduler when appropriate.

From the scheduler point of view, starting or resuming a process execution is the same operation: returning to userland. This involves switching to the process virtual memory space, and loading the process CPU registers. By loading the machine state register with the system bit off, kernel privileges are dropped.

XXX details

3.2. Traps and system calls

When the processor encounter an exception (memory fault, division by zero, system call instruction...), it executes a trap: control is transferred to the kernel, and after some assembly routine in locore.S, the CPU drops in the syscall_plain (from src/sys/arch/i386/i386/syscall.c on i386) for system calls, or in the trap function (from src/sys/arch/i386/i386/trap.c on i386) for other traps.

There is also a syscall_fancy system call handler which is only used when the process is being traced by ktrace.

3.2.1. Traps

XXX write me

3.2.2. System call implementation in libc

XXX write me

3.2.3. Multiple kernel ABI support with the emul switch

The struct emul is defined in src/sys/sys/proc.h. It defines various methods and parameters to handle system calls and traps. Each kernel ABI supported by the NetBSD kernel has its own struct emul. For instance, Linux ABI defines emul_linux in src/sys/compat/linux/common/linux_exec.c, and the native ABI defines emul_netbsd, in src/sys/kern/kern_exec.c.

The struct emul for the current ABI is obtained from the es_emul field of the exec switch entry that was selected by execve. The kernel holds a pointer to it in the process' struct proc (defined in src/sys/sys/proc.h).

Most importantly, the struct emul defines the system call handler function, and the system call table.

3.2.4. The syscalls.master table

Each kernel ABI have a system call table. The table maps system call numbers to functions implementing the system call in the kernel (e.g.: system call number 2 is fork). The native system call table can be found in src/sys/kern/syscalls.master.

This file is not written in C language. After any change, it must be processed by the Makefile available in the same directory. syscalls.master processing is controlled by the configuration found in syscalls.conf, and it will output several files:

Table 3.3. Files produced from syscalls.master

File name	Description
`syscallargs.h`	Define the system call arguments structures, used to pass data from the system call handler function to the function implementing the system call.
`syscalls.c`	An array of strings containing the names for the system calls
`syscall.h`	Preprocessor defines for each system call name and number
`sysent.c`	An array containing for each system call an entry with the number of arguments, the size of the system call arguments structure, and a pointer to the function that implements the system call in the kernel

In order to avoid namespace collision, non native ABI have syscalls.conf defining output file names prefixed by tags (e.g: linux_ for Linux ABI).

system call argument structures (syscallarg for short) are always used to pass arguments to functions implementing the system calls. Each system call has its own syscallarg structure. This encapsulation layer is here to hide endianness differences.

All functions implementing system calls have the same prototype:

`int syscall(`	`l`,
	`v`,
	`retval)`;

struct lwp *	`l`;
void *	`v`;
register_t *	`retval`;

l is the struct lwp for the calling thread, v is the syscallarg structure pointer, and retval is a pointer to the return value.

3.2.5. Managing 32 bit system calls on 64 bit systems

When executing 32 bit binaries on a 64 bit system, care must be taken to only use addresses below 4 GB. This is a problem at process creation, when the stack and heap are allocated, but also for each system call, where 32 bits pointers handled by the 32 bit process are manipulated by the 64 bit kernel.

For a kernel built as a 64 bit binary, a 32 bit pointer is not something that makes sense: pointers can only be 64 bit long. This is why 32 bit pointers are defined as an u_int32_t synonym called netbsd32_pointer_t (in src/sys/compat/netbsd32/netbsd32.h).

For copyin and copyout, true 64 bits pointers are required. They are obtained by casting the netbsd32_pointer_t through the NETBSD32PTR64 macro.

Most of the time, implementation of a 32 bit system call is just about casting pointers and to call the 64 version of the system call. An example of such a situation can be found in src/sys/compat/netbsd32/netbsd32_time.c: netbsd32_timer_delete. Provided that the 32 bit system call argument structure pointer is called uap, and the 64 bit one is called ua, then helper macros called NETBSD32TO64_UAP, NETBSD32TOP_UAP, NETBSD32TOX_UAP, and NETBSD32TOX64_UAP can be used. Sources in src/sys/compat/netbsd32 provide multiple examples.

3.3. Processes and threads creation

3.3.1. `fork`, `clone`, and `pthread_create` usage

XXX write me

3.3.2. Overview of `fork` code path

XXX write me

3.3.3. Overview of `pthread_create` code path

XXX write me

3.4. Processes and threads termination

3.4.1. `exit`, and `pthread_exit` usage

XXX write me

3.4.2. Overview of `exit` code path

XXX write me

3.4.3. Overview of `pthread_exit` code path

XXX write me

3.5. Signal delivery

3.5.1. Deciding what to do with a signal

XXX write me

3.5.2. The `sendsig`sendsig

For each kernel ABI, struct emul defines a machine-dependent sendsig function, which is responsible for altering the process user context so that it calls a signal handler.

sendsig builds a stack frame containing the CPU registers before the signal handler invocation. The CPU registers are altered so that on return to userland, the process executes the signal handler and have the stack pointer set to the new stack frame.

If requested at sigaction call time, sendsig will also add a struct siginfo to the stack frame.

Last but not least, sendsig may copy a small assembly code involved in signal cleanup, which is called the signal trampoline. This is detailed in the next section. Note that that modern NetBSD native programs do not use that feature anymore: it is only used for older programs, and other OSes emulation.

3.5.3. Cleaning up state after signal handler execution

Once the signal handler returns, the kernel must destroy the signal handler context and restore the previous process state. This can be achieved by two ways.

First method, using the kernel-provided signal trampoline: sendsig have copied the signal trampoline on the stack and has prepared the stack and/or CPU registers so that the signal handler returns to the signal trampoline. The job of the signal trampoline is to call the sigreturn or the setcontext system calls, handling a pointer to the CPU registers saved on stack. This restores the CPU registers to their values before the signal handler invocation, and next time the process will return to userland, it will resume its execution where it stopped.

The native signal trampoline for i386 is called sigcode and can be found in src/sys/arch/i386/i386/locore.S. Each emulated ABI has its own signal trampoline, which can be quite close to the native one, except usually for the sigreturn system call number.

The second method is to use a signal trampoline provided by libc. This is how modern NetBSD native programs do. At the time the sigaction system call is invoked, the libc stub handle a pointer to a signal trampoline in libc, which is in charge of calling setcontext.

sendsig will use that pointer as the return address for the signal handler. This method is better than the previous one, because it removes the need for an executable stack page where the signal trampoline is stored. The trampoline is now stored in the code segment of libc. For instance, for i386, the signal trampoline is named __sigtramp_siginfo_2 and can be found in src/lib/libc/arch/i386/sys/__sigtramp2.S.

3.6. Threads scheduling

XXX write me

Prev		Next
Chapter 2. File system internals	Home	Chapter 4. Regression testing