Explanation of how lazy FPU context switching works.
Tohru Nishimura - Nara Insititute of Science and Technology
FPU hardware typically has a single set of hardware registers to hold the current FPU context. Each process has an area of memory reserved (in u_pcb under NetBSD/mips), to hold that processes state while not executing. Loading and saving the FPU state upon each context switch consumes a significant number CPU cycles.
Modern CPUs provide an option to disable the ability to execute any FP instructions. When the CPU attempts to execute an FP instruction, an exception is posted and operating system starts processing the 'FPU was unavailable for me' handler for the executing process. This can then check and prepare the FPU for use, then restart the process at the FP instruction which posted the exception. This time FP instructions will be executed normally and not produce the 'FPU is unavailable' condition unless another process later takes the FPU.
Every process is created without FPU ownership and prohibited from use FPU. If the process never executes any FP instructions, nothing special happens to it and the FPU is not touched during the execution of that process.
If a process prohibited from using FPU attempts to execute a FP instruction, the CPU posts an 'unavailable' exception. The global variable fpcurproc indicates which process has the ownership of FPU. At that point the FPU hardware will contain the state for that owner process, which will be different from the curproc that posted the exception. The unavailable handler saves the FPU hardware context into the reserved area of fpcurproc, and loads the curproc's FPU hardware context into FPU registers. The initial load of process FPU context clears the entire FPU. In this way, FPU context switch is deferred until a different process attempts to use the FPU. Because the vast majority of programs do not use any FP instructions, deferred lazy FPU context switch significantly reduces the number of expensive FPU save/load operations.
Matt Thomas adds that you need to be careful to properly cleanup tbe lazy FP context with the fpcurproc exits.
The expensive FPU context switch syndrome is similar to the situation faced by an MMU on process context switch. The MMU is a rather complicated device which may hold a complex internal 'state' describing the process' address space, or more unusually, a 'task description' for runtime environment, nature and features of processes defined by CPU hardware foundation. Some MMUs have dedicated register(s) to point to the memory region which describes processes address space. In that case the cost of an MMU context switch can be reduced by having multiple memory regions and switching between them by updating dedicated register(s) via a special MMU instruction. A certain CPU design is widely known to have a hilariously spectacular method of MMU context switch which involves saving/loading a number of registers, then traversing a memory region to establish new process runtime context, with the cost of an astonishingly large number of CPU cycles. The hardware supported context switch capability is costly, seldom used in practice, and many consider it as CISCy or a waste of silicon.