Friday, September 5, 2014

Windows Internals - A look into SwapContext routine

Here I am really taking advantage of my summer vacations and back again with a second part of the Windows thread scheduling articles. In the previous blog post I discussed the internals of quantum end context switching (a flowchart). However, the routine responsible for context switching itself wasn't discussed in detail and that's why I'm here today.

Here are some notes that'll help us through this post :
 1 - The routine which contains code that does context switching is SwapContext and it's called internally by KiSwapContext. There are some routines that prefer to call SwapContext directly and do the housekeeping that KiSwapContext does themselves.
 2 - The routines above (KiSwapContext and SwapContext) are implemented in ALL context switches that are performed no matter what is the reason of the context switch (preemption,wait state,termination...).
 3 - SwapContext is originally written in assembly and it doesn't have any prologue or epilogue that are normally seen in ordinary conventions, imagine it like a naked function.
 4 - Neither SwapContext or KiSwapContext is responsible for setting the CurrentThread and NextThread fields of the current KPRCB. It is the responsibility of the caller to store the new thread's KTHREAD pointer into pPrcb->CurrentThread and queue the current thread (we're still running in its context) in the ready queue before calling KiSwapContext or SwapContext which will actually perform the context-switch.
 Usually before calling KiSwapContext, the old irql (before raising it to DISPATCH_LEVEL) is stored in CurrentThread->WaitIrql , but there's an exception discussed later in this article.

So buckle up and let's get started :
Before digging through SwapContext let's first start by examining what its callers supply to it as arguments.
SwapContext expects the following arguments:
- ESI : (PKTHREAD) A pointer to the New Thread's structure.
- EDI : (PKTHREAD) A pointer to the old thread's structure.
- EBX : (PKPCR) A pointer to PCR (Processor control region) structure of the current processor.
- ECX : (KIRQL) The IRQL in which the thread was running before raising it to DISPATCH_LEVEL.
By callers, I mean the KiSwapContext routine and some routines that call SwapContext directly (ex : KiDispatchInterrupt).
Let's start by seeing what's happening inside KiSwapContext :
This routine expects 2 arguments the Current thread and New thread KTHREAD pointers in ECX and EDX respectively (__fastcall).
Before storing both argument in EDI and ESI, It first proceeds to save these and other registers in the current thread's (old thread soon) stack:
EBP : The stack frame base pointer (SwapContext only updates ESP).
EDI : The caller might be using EDI for something else ,save it.
ESI : The caller might be using ESI for something else ,save it too.
EBX : The caller might be using EBX for something else ,save it too.
Note that these registers will be popped from this same thread's stack when the context will be switched from another thread to this thread again at a later time (when it will be rescheduled to run).
After pushing the registers, KiSwapContext stores the self pointer to the PCR in EBX (fs:[1Ch]).Then it stores the CurrentThread->WaitIrql value in ECX, now that everything is set up KiSwapContext is ready to call SwapContext.

Again, before going through SwapContext let me talk about routines that actually call SwapContext directly and exactly the KiDispatchInterrupt routine that was referenced in my previous post.
Why doesn't KiDispatchInterrupt call KiSwapContext ?
Simply because it just needs to push EBP,EDI and ESI onto the current thread's stack as it already uses EBX as a pointer to PCR.

Here, we can see a really great advantage of software context switching where we just save the registers that we really need to save, not all registers.

Now , we can get to SwapContext and explain what it does in detail.
The return type of SwapContext is a boolean value that tells the caller (in the new thread's stack) whether the new thread has any APCs to deliver or not.

Let's see what SwapContext does in these 15 steps:

1 - The first thing that SwapContext does is verify that the new thread isn't actually running , this is only right when dealing with a multiprocessor system where another processor might be actually running the thread.If the new thread is running SwapContext just loops until the thread stops running. The boolean value checked is NewThread->Running and after getting out of the loop, the Running boolean is immediately set to TRUE.

2 - The next thing SwapContext does is pushing the IRQL value supplied in ECX. To spoil a bit of what's coming in the next steps (step 13) SwapContext itself pops ECX later, but after the context switch. As a result we'll be popping the new thread's pushed IRQL value (stack switched).

3 - Interrupts are disabled, and PRCB cycle time fields are updated with the value of the time-stamp counter. After the update, Interrupts are enabled again.

4 - increment the count of context switches in the PCR (Pcr->ContextSwitches++;) , and push Pcr->Used_ExceptionList which is the first element of PCR (fs:[0]). fs:[0] is actually a pointer to the last registered exception handling frame which contains a pointer to the next frame and also a pointer to the handling routine (similar to usermode), a singly linked list simply. Saving the exception list is important as each thread has its own stack and thus its own exception handling list.

5 - OldThread->NpxState is tested, if it's non-NULL, SwapContext proceeds to saving the floating-points registers and FPU related data using fxsave instruction. The location where this data is saved is in the initial stack,and exactly at (Initial stack pointer - 528 bytes) The fxsave output is 512 bytes long , so it's like pushing 512 bytes onto the initial stack , the other 16 bytes are for stack-alignment I suppose.The Initial stack is discussed later during step 8.

6 - Stack Swapping : Save the stack pointer in OldThread->KernelStack and load NewThread->KernelStack into ESP. We're now running in the new thread's stack, from now on every value that we'll pop was previously pushed the last time when the new thread was preparing for a context-switch.

7 - Virtual Address Space Swapping : The old thread process is compared with the new thread's process if they're different CR3 register (Page directory pointer table register) is updated with the value of : NewThread->ApcState.Process->DirectoryTableBase. As a result, the new thread will have access to a valid virtual address space. If the process is the same, CR3 is kept unchanged. The local descriptor table is also changed if the threads' processes are different.

8 -  TSS Esp0 Switching : Even-though I'll dedicate a future post to discuss TSS (task state segment) in detail under Windows , a brief explanation is needed here. Windows only uses one TSS per processor and uses only (another field is also used but it is out of the scope of this article) ESP0 and SS0 fields which stand for the kernel stack pointer and the kernel stack segment respectively. When a usermode to kernelmode transition must be done as a result of an interrupt,exception or system service call... as part of the transition ESP must be changed to point to the kernel stack, this kernel stack pointer is taken from TSS's ESP0 field. Logically speaking, ESP0 field of the TSS must be changed on every context-switch to the kernel stack pointer of the new thread. In order to do so, SwapContext takes the kernel stack pointer at NewThread->InitialStack (InitialStack = StackBase - 0x30) ,it substrats the space that it has used to save the floating-point registers using fxsave instruction and another additional 16 bytes for stack alignment, then it stores the resulted stack pointer in the TSS's Esp0 field : pPcr->TssCopy.Esp0 (TSS can be also accessed using the TR segment register).

9 - We've completed the context-switch now and the old thread can be finally marked as "stopped running" by setting the previously discussed boolean value "Running" to FALSE. OldThread->Running = FALSE.

10 - If fxsave was previously executed by the new thread (the last time its context was switched), the data (floating-point registers...) saved by it is loaded again using xrstor instruction.

11 - Next the TEB (Thread environment block) pointer is updated in the PCR :
pPcr->Used_Self = NewThread->Teb . So the Used_Self field of the PCR points always to the current thread's TEB.

12 - The New thread's context switches count is incremented (NewThread->ContextSwitches++).

13 - It's finally the time to pop the 2 values that SwapContext pushed , the pointer to the exception list and the IRQL from the new thread's stack. the saved IRQL value is restored in ECX and the exception list pointer is popped into its field in the PCR.

14 - A check is done to see if the context-switch was performed from a DPC routine (Entering a wait state for example) which is prohibited. If pPrcb->DpcRoutineActive boolean is TRUE this means that the current processor is currently executing a DPC routine and SwapContext will immediately call KeBugCheck which will show a BSOD : ATTEMPTED_SWITCH_FROM_DPC.

15 - This is the step where the IRQL (NewThread->WaitIrql) value stored in ECX comes to use. As mentionned earlier SwapContext returns a boolean value telling the caller if it has to deliver any pending APCs. During this step SwapContext will check the new thread's ApcState to see if there are any kernel APCs pending. If there are : a second check is performed to see if special kernel APCs are disabled , if they're not disabled ECX is tested to see if it's PASSIVE_LEVEL, if it is above PASSIVE_LEVEL an APC_LEVEL software interrupt is requested and the function returns FALSE. Actually the only case that SwapContext returns TRUE is if ECX is equal to PASSIVE_LEVEL so the caller will proceed to lowering IRQL to APC_LEVEL first to call KiDeliverApc and then lower it to PASSIVE_LEVEL afterwards.

Special Case :
This special case is actually about the IRQL value supplied to SwapContext in ECX. The nature of this value depends on the caller in such way that if the caller will lower the IRQL immediately upon returning from SwapContext or not.
Let's take 2 examples : KiQuantumEnd and KiExitDispatcher routines. (KiQuantumEnd is the special case)

If you disassemble KiExitDispatcher you'll notice that before calling KiSwapContext it stores the OldIrql (before it was raised to DISPATCH_LEVEL) in the WaitIrql of the old thread so when the thread gains execution again at a later time SwapContext will decide whether there any APCs to deliver or not. KiExitDispatcher makes use of the return value of KiSwapContext (KiSwapContext returns the same value returned by SwapContext) to lower the IRQL. (see step 15 last sentence).
However, by disassembling KiQuantumEnd you'll see that it's storing APC_LEVEL at the old thread's WaitIrql without even caring about in which IRQL the thread was running before. If you refer back to my flowchart in the previous article you'll see that KiQuantumEnd always insures that SwapContext returns FALSE , first of all because KiQuantumEnd was called as a result of calling KiDispatchInterrupt which is meant to be called when a DISPATCH_LEVEL software interrupt was requested.Thus, KiDispatchInterrupt was called by HalpDispatchSoftwareInterrupt which is normally called by HalpCheckForSoftwareInterrupt. HalpDispatchSoftwareInterrupt is the function responsible for raising the IRQL to the software interrupt level (APC_LEVEL or DISPATCH_LEVEL) and upon returning from it HalpCheckForSoftwareInterrupt recovers back the IRQL to its original value (OldIrql). So the reason why KiQuantumEnd doesn't care about KiSwapContext return value because it won't proceed to lowering the IRQL (not its responsibility) nor to deliver any APCs that's why it's supplying APC_LEVEL as an old IRQL value to SwapContext so that it will return FALSE. However, a software interrupt might be requested by SwapContext if there are any pending APCs.
KiDispatchInterrupt which calls SwapContext directly uses the same approach as KiQuantumEnd, instead of storing the value at OldThread->WaitIrql it just moves it into ECX.

Post notes :
- Based on Windows 7 32 bit :>
- For any questions or suggestions feel free to leave a comment below or send me an email :

See you again soon :)


1 comment: