Tuesday, October 28, 2014

Code obFU(N)scation mixing 32 and 64 bit mode instructions

1 - Introduction

This article is about a funny way to obfuscate code that takes advantage of the Windows 64bit capability to manage and run 32bit processes. As we will see, it's a very effective technique that can really be time consuming and annoying.
Windows 64bit natively runs 64bit processes and kernel drivers, but, of course, because of retro-compatibility, it offers the possibility to run old 32bit executables through the WoW64 subsystem. On Intel x86-64 architecture this is implemented via hardware features offered by the CPU that allow 32bit mode code to switch to 64bit mode and viceversa.
The trick relies in these 32bit/64bit switches: you can craft an executable that contains both 32bit and 64bit code, and you can make the code jump from one to the other at any time. Unfortunately, almost all debuggers seem to be ineffective in dealing with these jumps (only remote kernel debugging using Windbg can step through the code). 
Also the disassemblers don't handle the situation very well, as they are designed to handle only one architecture at a time.
Long story short: a real mess and a nightmare for analysis!

2 - 32bit/64bit switch

Let's start analysing how the switch between 32bit and 64bit works, then we can see how it can be abused and what are the problems that it causes to static analysis tools.

2.1 - The basics: how it works

The best way to understand how Windows 64bit handles 32bit processes is to see it in action: let's start a remote kernel debugging session and let's see what happens when we debug a 32bit process. In particular, we are going to debug the 32bit API CreateFile to see how the code interfaces with the 64bit operating system. Starting from the API entry point, we will arrive to the following code:

00000000`7698b62b 89450c          mov     dword ptr [ebp+0Ch],eax
00000000`7698b62e 8d45f8          lea     eax,[ebp-8]
00000000`7698b631 50              push    eax
00000000`7698b632 ffd6            call    esi {ntdll_772b0000!ZwCreateFile}
00000000`7698b634 8bd8            mov     ebx,eax
00000000`7698b636 bf220000c0      mov     edi,0C0000022h
00000000`7698b63b 3bdf            cmp     ebx,edi

This is where the library KERNELBASE.dll is calling the ntdll.dll API ZwCreateFile. In the good old 32bit windows, ntdll, among other things, acts as a wrapper providing the transition from usermode to kernelmode (that is, it implements a syscall). Now things are different: we step into the call and we get:

ntdll_772b0000!ZwCreateFile:
00000000`772d00a4 b852000000      mov     eax,52h
00000000`772d00a9 33c9            xor     ecx,ecx
00000000`772d00ab 8d542404        lea     edx,[esp+4]
00000000`772d00af 64ff15c0000000  call    dword ptr fs:[0C0h]
00000000`772d00b6 83c404          add     esp,4
00000000`772d00b9 c22c00          ret     2Ch

There is no sysenter/syscall/int 2E here, so this code is not calling the kernel yet. Instead, it is calling the following:

wow64cpu!X86SwitchTo64BitMode:
00000000`74c62320 ea1e27c6743300  jmp     0033:74C6271E

A far jump? You don't really see this type of jump very often in 32bit, so why is it used here? Because it is switching to 64bit mode (the normal usermode code segment for 32bit is 0x0023, and this jump is going to segment 0x0033)! In fact, segment 0x0033 has some specific properties, let's have a look:

kd> dg 33
                                                    P Si Gr Pr Lo
Sel        Base              Limit          Type    l ze an es ng Flags
---- ----------------- ----------------- ---------- - -- -- -- -- --------
0033 00000000`00000000 00000000`00000000 Code RE Ac 3 Nb By P  Lo 000002fb

It is a code segment with Read/Execute attributes, usermode privilege (ring 3), and the Long bit is set (that is, the segment is for 64bit mode). So now we know how to switch from 32bit to 64bit, but what about the opposite? Since we are executing a 32bit process, it must be possible to switch back to 32bit from 64bit. If we keep debugging, we will pass through the following APIs:

wow64cpu!CpupReturnFromSimulatedCode
wow64cpu!TurboDispatchJumpAddressStart
wow64!Wow64SystemServiceEx
wow64!whNtCreateFile

and finally land on:

ntdll!NtCreateFile
0033:00000000`77121860 4c8bd1          mov     r10,rcx
0033:00000000`77121863 b852000000      mov     eax,52h
0033:00000000`77121868 0f05            syscall
0033:00000000`7712186a c3              ret

The system call itself happens in 64bit mode: in fact, it is not allowed to use a syscall instruction from 32bit mode, or else an exception will be raised. This is an interesting detail, because it tells us that all the APIs that require a transition to kernelmode must switch to 64bit. (Hint: if you can control the switch to 64bit you can implement a cheap API logger ;))
We finish debugging this API and we get to what we were looking for:

0033:00000000`74c626b0 4489442410      mov     dword ptr [rsp+10h],r8d
0033:00000000`74c626b5 458b85c8000000  mov     r8d,dword ptr [r13+0C8h]
0033:00000000`74c626bc 4c89442418      mov     qword ptr [rsp+18h],r8
0033:00000000`74c626c1 458b85bc000000  mov     r8d,dword ptr [r13+0BCh]
0033:00000000`74c626c8 4c890424        mov     qword ptr [rsp],r8 
0033:00000000`74c626cc 48cf            iretq

The iretq instruction is similar to a ret: it returns to the address that is on the top of the stack, but it will also get from there the values that will be used to restore the registers CS, EFL, RSP, SS. We have come full circle:


And this is all we need to know about the mode switches.

2.2 - Abusing 32bit/64bit switches

If Windows library code can simply jump back and forth from 32bit and 64bit mode, then why can't we? In fact, we can just fine! As an example I have crafted a 32bit executable that performs a jump to 64bit mode, and then it jumps back to 32bit. Here it is:

.text:00401000 _main           proc near             
.text:00401000                 call    ds:DebugBreak

 ...

.text:00401010                 jmp     far ptr 33h:401019
.text:00401010 _main           endp
.text:00401010

 ...

.text:00401019                 db  48h ; sub  rsp, 4
.text:0040101A                 db  83h
.text:0040101B                 db 0ECh
.text:0040101C                 db    4
.text:0040101D                 db  89h ; mov  dword ptr [rsp], eax
.text:0040101E                 db    4
.text:0040101F                 db  24h
.text:00401020                 db  48h ; mov    rax, rsp
.text:00401021                 db  8Bh 
.text:00401022                 db 0C4h 
.text:00401023                 db  50h ; push   rax
.text:00401024                 db  90h ; nop
.text:00401025                 db  90h ; nop
.text:00401026                 db  90h ; nop
.text:00401027                 db  90h ; nop
.text:00401028                 db  5Bh ; pop    rbx
.text:00401029                 db  48h ; mov    rax, 2Bh
.text:0040102A                 db 0B8h 
.text:0040102B                 db  2Bh 
.text:0040102C                 db    0
.text:0040102D                 db    0
.text:0040102E                 db    0
.text:0040102F                 db    0
.text:00401030                 db    0
.text:00401031                 db    0
.text:00401032                 db    0
.text:00401033                 db  50h ; push   rax
.text:00401034                 db  53h ; push   rbx
.text:00401035                 db  48h ; mov    rax, 246h
.text:00401036                 db 0B8h 
.text:00401037                 db  46h 
.text:00401038                 db    2
.text:00401039                 db    0
.text:0040103A                 db    0
.text:0040103B                 db    0
.text:0040103C                 db    0
.text:0040103D                 db    0
.text:0040103E                 db    0
.text:0040103F                 db  50h ; push   rax
.text:00401040                 db  48h ; mov    rax, 23h
.text:00401041                 db 0B8h 
.text:00401042                 db  23h 
.text:00401043                 db    0
.text:00401044                 db    0
.text:00401045                 db    0
.text:00401046                 db    0
.text:00401047                 db    0
.text:00401048                 db    0
.text:00401049                 db    0
.text:0040104A                 db  50h ; push   rax
.text:0040104B                 db  48h ; mov    rax, 401080h
.text:0040104C                 db 0B8h 
.text:0040104D                 db  80h 
.text:0040104E                 db  10h
.text:0040104F                 db  40h 
.text:00401050                 db    0
.text:00401051                 db    0
.text:00401052                 db    0
.text:00401053                 db    0
.text:00401054                 db    0
.text:00401055                 db  50h ; push   rax
.text:00401056                 db  48h ; iretq
.text:00401057                 db 0CFh 
...
.text:00401080                 pop     eax
...

I compiled a simple C program, and in the main() function I put a call to DebugBreak to conveniently spawn the remote debugger, then a series of nops which I later modified with the opcodes I needed. You can clearly see the far jump at line 0x00401010: it jumps to the segment 0x0033 and to the virtual address 0x00401019. The code at 0x00401019 is to be read as 64bit instructions, but the executable is loaded in IDA as a 32bit PE, so you see it as data and not as 64bit instructions.
I have put comments on line 0x00401019, 0x0040101d etc. to indicate the 64bit instructions, they are simply pushing the correct values on the stack in order to be able to switch back to 32bit mode. In order, the following values are pushed:
  • the stack segment selector
  • the stack pointer
  • the eflags register
  • the code segment selector (0x0023 is the standard usermode code segment)
  • the instruction pointer (in this case, it is 0x00401080)
The iretq will restore all these values, starting the execution in 32bit mode from address 0x0023:0x00401080, but bear in mind that the 64bit code also changes the state of the registers in 32bit mode. So it's up to you to preserve the registers that need to be saved across switches.

2.3 - Some issues with the decompilers

Of course you can always open the PE file as a binary file in IDA64, and then manually decompile those instructions, but there are some issues:
  • The file is opened as a binary file, which means that if an opcode is referencing a memory location IDA will not show you the x-refs. For instance, if you have "mov   rax, 0x00402000", since the file is loaded as a binary file and not as a PE, there will not be a reference to the virtual address 0x00402000.
  • IDA will not know where the 64bit code snippets are in the file, so you will need to manually get every virtual address from the 32bit PE, translate it to a file offset and then find it in the 64bit binary file loaded in IDA. Annoying!
  • If you have a complex computation (for example, a decryption routine) that interleaves 32bit and 64bit instructions to perform a task, then following the whole routine through static analysis is really a pain: you need to use two sessions of IDA to understand all the code.
To solve these problems, IDA actually lets you interleave 32bit and 64bit code: you can load a 32bit PE file in IDA64, then locate the 64bit snippet and create a 64bit segment specifying the starting and ending address of such snippet. In this case you can successfully browse a 32bit PE file, disassembling 64bit instructions where needed. The drawback is that you manually have to create a segment each time you see a 64bit code snippet, which is rather annoying. The result is something like this:

.text:00401000 _main       proc near
.text:00401000       call    ds:DebugBreak
.text:00401006       nop
.text:00401007       nop
.text:00401008       mov     edx, 12345678h
.text:0040100D       nop
.text:0040100E       nop
.text:0040100F       nop
.text:00401010       jmp     far ptr 33h:401020h
.text:00401010 _main       endp
...
.text:0040101F _text       ends
TEST1:0000000000401020 ; ===========================================================================
TEST1:0000000000401020 ; Segment type: Regular
TEST1:0000000000401020 TEST1       segment byte public '' use64
TEST1:0000000000401020       assume cs:TEST1
TEST1:0000000000401020       ;org 401020h
TEST1:0000000000401020       assume es:nothing, ss:nothing, ds:nothing
TEST1:0000000000401020       mov     rax, rsp
TEST1:0000000000401023       push    rax
TEST1:0000000000401024       nop
TEST1:0000000000401025       nop
TEST1:0000000000401026       nop
TEST1:0000000000401027       nop
TEST1:0000000000401028       pop     rbx
TEST1:0000000000401029       mov     rax, 2Bh
TEST1:0000000000401033       push    rax
TEST1:0000000000401034       push    rbx
TEST1:0000000000401035       mov     rax, 246h
TEST1:000000000040103F       push    rax
TEST1:0000000000401040       mov     rax, 23h
TEST1:000000000040104A       push    rax
TEST1:000000000040104B       mov     rax, 401080h
TEST1:0000000000401055       push    rax
TEST1:0000000000401056       add     rdx, rdx
TEST1:0000000000401059       iretq
TEST1:0000000000401059 ; ---------------------------------------------------------------------------
...
TEST1:000000000040107F TEST1       ends
.text:00401080 ; ===========================================================================
.text:00401080 ; Segment type: Pure code
.text:00401080 ; Segment permissions: Read/Execute
.text:00401080 _text       segment para public 'CODE' use32
.text:00401080       assume cs:_text
.text:00401080       ;org 401080h
.text:00401080       assume es:TEST1, ss:TEST1, ds:_data
.text:00401080       nop
.text:00401081       nop
...
.text:004012D4       nop
.text:004012D5       nop
.text:004012D6       xor     eax, eax

.text:004012D8       retn

Notice that you don't have any cross references for memory locations between segments, even manually using the "offset" command won't work.

These issues show up mainly in static analysis, if you are debugging the code you can just follow it and the obfuscation won't matter. Or will it? Well, it turns out that debuggers don't work very well with 64bit code, and besides, it is common to analyse parts of an executable without having the possibility to run them, so this is a serious issue.   

3 - Debuggers

Let's have a quick overview of the debugging problems.

3.1 - Which one works?

I have tested some common debuggers and, as I briefly mentioned in section 2.3, the results are poor:

  • Ollydbg - It can debug a 32bit process, but it won't be able to trace the far jumps. If you try to step over/into one of those jumps, the debugger will lose control, and will end up somewhere else in the code.
  • Syser Win32 Debugger - Same as Ollydbg.
  • Syser kernel debugger - It doesn't run on 64bit Windows.
  • Windbg local debugger - Same as Ollydbg.
  • Windbg remote kernel debugger - The only one that works. When doing remote debugging, you can step into the far jumps and the iretqs, so you can debug the code. Unfortunately there are some other limitations, like the code assembler (that is, the "a" command) does not support 64bit instructions, so if you have to patch an executable for any reason, you will have to patch the opcode bytes manually. Not the end of the world, but not nice either.
  • IDA - You can try and use IDA's built-in debugger, but it won't directly load 64bit PE executables. It requires you to use dbgsrv component from Windbg and then start a remote debugging session. I have not fully tested this feature, but since it uses dbgsrv it may work. Still, it requires remote debugging.
If you want to debug an executable that switches between 32bit and 64bit you need to use Windbg remote kernel debugging, I have not found another easy way to do it. Luckily, machines nowadays are pretty powerful and capable of running virtual machines, but still, it would be much easier to be able to debug this sort of code locally.

3.2 - A small workaround

I have said that Ollydbg (and basically all other usermode debuggers) is not able to step through far jumps, and that if you try you lose control of the execution, but there is still a way to bypass the problem. If you know the 32bit address at which the 64bit code will return to (via an iretq), then you can put a bpx on it, let the program run, and the debugger will break on it, thus bypassing the 64bit code completely. To explain it more clearly:
  • you arrive at a far jump that will switch to 64bit mode
  • you know that the 64bit code will return to 32bit address xyz
  • you set a bpx on address xyz
  • you let the program run
  • the debugger will break on xyz
In this way, you completely bypass the 64bit snippet. But of course, it requires you to have previously analysed such snippet, and determined which 32bit address it will return to, which slows everything down.

4 - Some examples of obfuscation

The state of the registers (and of the memory, stack etc.) is maintained across switches, which means you can perform any computation splitting parts of it between 32bit and 64bit.
For example, we can modify the test code in section 2.2 as follows (this time, for clarity, I'm writing the assembly code instead of the opcodes):

------------ 32bit code ------------

.text:00401008          mov     edx, 12345678h  ; set edx before 64bit
.text:0040100D          nop
.text:0040100E          nop
.text:0040100F          nop
.text:00401010          jmp     far ptr 33h:401019h

...

------------ 64bit code ------------

.text:00401019   sub  rsp, 4
.text:0040101D   mov  dword ptr [rsp], eax
.text:00401020 mov rax, rsp
.text:00401023 push rax
.text:00401024 nop
.text:00401025 nop
.text:00401026 nop
.text:00401027 nop
.text:00401028 pop rbx
.text:00401029 mov rax, 2Bh
.text:00401033 push rax
.text:00401034 push rbx
.text:00401035 mov rax, 246h
.text:0040103F push rax
.text:00401040 mov rax, 23h
.text:0040104A push rax
.text:0040104B mov rax, 401080h
.text:00401055 push rax
.text:00401056 add rdx, rdx     ; modifies edx
.text:00401059 iretq    ; returns to 32bit address 0x00401080

...

------------ 32bit code ------------

.text:00401080 pop  eax  ; edx is now 0x12345678 + 0x12345678
...

The code starts by setting a value (0x12345678) in the register EDX. Then, it jumps to 64bit mode, and the 64bit instructions simply double up the value of EDX. At this point, when the code returns in 32bit mode, EDX contains the value that has been doubled in the 64bit snippet (it would be 0x2468ACF0). The same holds for the stack: you can push 32bit values from the 64bit mode, and they will remain on the stack (assuming you don't change it with the iretq). This means you can hide stack parameters for API calls. Moreover, you can hide the API call itself: all you need to do is to jump in 64bit mode and call its corresponding 64bit version.
This may require some preparation (push the correct parameters, type conversions, etc.), but it's nothing too complicated:


This is an example of how you can call an API from 64bit, but of course you can do it in many other ways, or you can even invoke the SYSCALL yourself.
Another interesting trick is that of using a snippet of code that can be executed in both 32bit and 64bit mode, and it will perform a different computation depending on which mode you are in. For example the sequence of bytes 

 48 03 D2

can be:

 - 64bit
   add   rdx, rdx
 - 32bit
   dec   eax
   add   edx, edx

so you can call the same opcodes and have them behave differently. Or, even worse, you can add JMP instructions in your code from both 32bit and 64bit to the same opcodes, but only one of them is really executed at runtime, for example:

.text:00401010                jmp     far ptr 33h:401050h
...
.text:00401020                jmp     401050h
...
.text:00401050   48 03 D2     ???

it becomes difficult to understand which of the two jumps is actually going to be executed at runtime, this is particularly annoying if you are trying to write a tool that automatically finds 64bit code snippets and disassembles them for you. In this case, if the tool blindly disassembles the line 0x00401050, then maybe the real code executed it only in 32bit mode from line 0x00401020, etc.

5 - Tools

Compilers are not designed to handle this situation either! So developing this trick is not straight forward. Compilers, like debuggers and disassemblers, are designed to handle ONE architecture at a time. Mixing 32bit and 64bit is not easy, but it is not too difficult to write tools or plugins that can generate 64bit snippets to be embedded inside a 32bit executable. You can for example use the "__emit" compiler intrinsic available in old Visual Studio versions, or you can use NASM or other assemblers to generate both 32bit and 64bit code and then merge them in one single executable.
Here are my proposals to help you implement this kind of obfuscation.

5.1 - How to include the obfuscation in a Visual Studio project

To show you how to implement the obfuscation in your own Visual Studio project, I have crafted a POC that you can easily modify.
I have first created a 32bit Visual Studio project called "Asm_C" containing two files: "main.cpp" and "test.asm". "main.cpp" simply executes "run_asm64()", the assembly routine that is located in "test.asm", and demonstrates how this routine modifies the value of the "Key" variable.
In particular, this routine consists of:
  • opcodes to jump in 64bit mode;
  • the 64bit opcodes corresponding to the assembly code you want to execute (in case, the ones that modify the "Key" variable);
  • opcodes to return in 32bit mode.
This is done to bypass the lack of support for the two architectures together: you can't mix 32bit code and 64 bit code in the same project, but you can use the corresponding opcodes instead!
Note that I have put the opcodes also for the code to jump to 64bit mode although it's run in 32bit. This is done because MASM does not seem to support far jumps properly.
Here are the listings:

main.cpp

------------------------------- CUT HERE ---------------------------------

#include <windows.h>
#include <stdio.h>

extern "C" void run_asm64(void);
extern "C" int Key = 0x10000000;

using namespace std;

void main(void)
{
printf("Key before 64bit: %08x \n", Key);
run_asm64();
printf("Key after 64bit: %08x \n", Key);
}

------------------------------- CUT HERE ---------------------------------

test.asm

------------------------------- CUT HERE ---------------------------------

.586
.MODEL FLAT, C
.STACK
.DATA

Extern Key:DWORD

.CODE             ;Indicates the start of a code segment.

run_asm64 PROC

db 0EAh ; jump to enter 64 bit
dd offset LocEnter
db 033h, 000h
LocEnter:
; Add the 64bit opcodes within the two lines
; warning: remember to preserve the registers you trash
;-----------------------------------------------------

db 051h, 048h, 0b9h                              ; push rcx / mov rcx, offset Key
dd offset Key
dd 0
db 081h, 001h, 078h, 056h, 034h, 012h, 059h      ; add dword ptr [rcx], 012345678h / pop rcx

;-----------------------------------------------------
db 048h, 083h, 0ech, 004h ; sub  rsp, 4
db 089h, 004h, 024h ; mov  dword ptr [rsp], eax (save eax for later)
db 048h, 08bh, 0c4h ; mov rax, rsp
db 06ah, 02bh ; push stack segment selector
db 50h ; push stack pointer (in rax)
db 068h, 046h, 002h, 000h, 000h ; push eflags
db 06ah, 023h ; push code selector
db 068h
dd offset LocExit ; push instruction pointer
db 048h, 0cfh ; iretq
LocExit:
pop eax ; restore eax
ret

run_asm64 ENDP 
END

------------------------------- CUT HERE ---------------------------------

To obtain the 64bit opcodes I've created a 64bit Visual Studio project  named "Dummy64" containing two files: "main.cpp" and "dummy.asm".
"dummy.asm" contains the 64bit assembly code that we want to compile to obtain the corresponding binary opcodes.
"main.cpp" loops through all the opcodes of the compiled "DummyAsm" routine and then prints them but, first, it looks for a jump (opcode 0xE9) and skips it. This is done because some compilers (Visual Studio, for instance) use to include a snippet, called "trampoline area", that jumps to the function body: so, basically, this check is meant to skip the trampoline itself.
The code also supports a sort of relocation procedure: for example, in this POC, we use the variable "Var1" to refer to the "Key" variable in the "Asm_C" project.
Of course, you can use the same trick every time you want to employ in your 64bit code something that has been defined in the 32bit code. 

main.cpp

------------------------------- CUT HERE ---------------------------------

#include <windows.h>
#include <stdio.h>

extern "C" void DummyAsm(void);
extern "C" int Var1;

void main(void) 
{
int i, Line;

unsigned char *Routine = (unsigned char *)DummyAsm;

if(Routine[0] == 0xE9) {
Routine += *(unsigned long *)(&Routine[1]) + 5;
}

Line = 0;
for(i = 0; ; i++){
if(*(unsigned long *)(&Routine[i]) == 0xAAAAAAAA) { // dummy signature
break;
}

// the address of Var1 from this source will be relocated
// with the address of Key from the 32bit source
if(*(unsigned long long*)(&Routine[i]) == (unsigned long long)&Var1) {
printf("\n dd offset Key \n dd 0");
i += 7;
Line = -1;
}
else if(Line % 8 == 0) {
printf("\n db 0%02xh", Routine[i]);
}
else {
printf(", 0%02xh", Routine[i]);
}

Line++;
}
}

------------------------------- CUT HERE ---------------------------------

dummy.asm

------------------------------- CUT HERE ---------------------------------

.DATA

Var1 DWORD 0
PUBLIC Var1

.CODE

DummyAsm PROC
; write the code within the two lines
; warning: remember to preserve the registers you trash
;----------------------------------------------

   push rcx
   mov rcx, offset Var1
   add dword ptr [rcx], 012345678h
   pop rcx

;----------------------------------------------
   db 0AAh, 0AAh, 0AAh, 0AAh ; dummy signature
DummyAsm ENDP

END

------------------------------- CUT HERE ---------------------------------

To sum up, I'm proposing you the following steps:
  • You create a 32bit project ("Asm_C", in this case) containing both the C/C++ files with the 32bit code and the ASM files in which you will put the 64bit routines.
  • Each 64bit routine must contain proper code to enter/exit in/from the 64bit mode.
  • Each 64bit routine must be codified as opcodes, using the "Dummy64" project.
  • If you want to use a portion of memory that has been previously allocated from the 32bit code (like a variable, an array, a structure and so on..), just use a different one in 64bit and remember to relocate it to the one you are really referring to in 32bit, using the trick we saw in the "Dummy64" project.
5.2 - How to (nearly) automate the obfuscation

To automate the obfuscation you can take advantage of Visual Studio itself! In fact, you can use the /FA option in the Visual Studio command line (or from "Project Properties -> Configuration properties -> C/C++ -> Output files -> Assembler output -> Assembly-Only Listing") and then /GL option (or from "Project properties -> Configuration properties -> C/C++ -> Optimization -> Whole Program Optimization") to obtain the assembly sources related to your project without optimizations. Finally you can compile and link the obtained assembly files by typing: "ml file_1.asm ... file_n.asm" in the Visual Studio command line.
N.B. The /GL option is crucial, because it tells the compiler not to mix the code between the project files: in this way, if a routine is located in "main.cpp", the corresponding assembly one will be in "main.asm", while without this option, due to optimization, it could be located in any other generated assembly file and the ML command won't work!
So you can:
  1. Create a 32bit Visual Studio C/C++ project and compile it in the way described above.
  2. Select any instruction from the obtained assembly listing and substitute it with a bunch of opcodes in 64bit mode that have the same behavior, taking care of adding the code to jump in and out 64bit.
  3. Compile and link the assembly files.
Of course, you can automate step 2 very easily and craft your own obfuscator: it won't take long if you use any programming language that supports regular expressions.
For example, I followed these steps and substituted the assembly instruction "push 14h" with the following assembly code:

db 0EAh ; jump to enter 64 bit
dd offset LocEnter
db 033h, 000h
LocEnter:
;------------------------

db 048h, 083h, 0ech, 004h, 0c7h, 004h, 024h, 014h ; sub  rsp, 4
db 000h, 000h, 000h ; mov  dword ptr [rsp], 14h

;------------------------
db 048h, 083h, 0ech, 004h ; sub  rsp, 4
db 089h, 004h, 024h ; mov  dword ptr [rsp], eax
db 048h, 08bh, 0c4h ; mov rax, rsp
db 06ah, 02bh ; push stack segment selector
db 50h ; push stack pointer (in rax)
db 068h, 046h, 002h, 000h, 000h ; push eflags
db 06ah, 023h ; push code selector
db 068h
dd offset LocExit
db 048h, 0cfh ; iretq
LocExit:
pop eax

I then linked the assembly files and it worked just fine. Moreover I've decompiled the executable you obtain before and after that modification, here are the listings.

Before the modification:

.text:00401018                 push    offset aMain_txt
.text:0040101D                 call    FileCreate
.text:00401022                 push    14h             ; the instruction we're going to replace
.text:00401024                 push    offset aTestTest
.text:00401029                 push    0  
.text:0040102B                 push    eax 
.text:0040102C                 mov     hObject, eax
.text:00401031                 call    FileSeekWrite

After the modification:

.text:00401018                 push    offset aMain_txt ; "main.txt"
.text:0040101D                 call    FileCreate
.text:00401022                 jmp     far ptr 33h:401029h
.text:00401022 _main           endp ; sp-analysis failed
.text:00401022
.text:00401022 ; ---------------------------------------------------------------------------
.text:00401029                 db 48h, 83h, 0ECh
.text:0040102C                 dd 2404C704h, 14h, 4EC8348h, 48240489h, 2B6AC48Bh, 2466850h
.text:0040102C                 dd 236A0000h
.text:00401048                 db 68h
.text:00401049                 dd offset loc_40104F
.text:0040104D                 db 48h, 0CFh
.text:0040104F ; ---------------------------------------------------------------------------
.text:0040104F
.text:0040104F loc_40104F:                             ; DATA XREF: .text:00401049 o
.text:0040104F                 pop     eax
.text:00401050                 push    offset aTestTest ; "test test"
.text:00401055                 push    0
.text:00401057                 push    eax
.text:00401058                 mov     dword_403040, eax
.text:0040105D                 call    FileSeekWrite

Totally messy and, as I mentioned before, a very effective way to hide the parameters of a function. Note that this kind of obfuscation is really powerful and, unlike standard packers, the clear code never appears ready to be dumped from memory. Also, you can use this idea to implement any other obfuscation technique. For example, you can easily create a little program that adds a lot of junk code all over the assembly listing. Also spreading the trick at the end of section 4, that is filling your source with pieces of code that can be interpreted in both 32bit and 64bit, will be very frustrating to whoever will have to analyse your program.

6 - Evolutions

This trick alone is very effective, but there are other good obfuscation techniques that have been used in various malwares/packers. Well, combine the old obfuscation techniques with this new one and you can obtain a code that is nearly impossible to analyze... well, not impossible but very very hard!

7 - (Not) detecting the obfuscation

I tried using Intel's Pin instrumentation toolkit (I used the 32bit version) to trace the test application I created, hoping that Pin would be able to identify and follow the far jumps that go from 32bit to 64bit. Unfortunately, Pin seems to be unable to handle these jumps as well (I also found people reporting this problem in the official Pin's forum). This is the source code of the Pintool I have written:

------------------------------- CUT HERE ---------------------------------

#include <stdio.h>
#include "pin.H"

namespace WINDOWS
{
#include <windows.h>
}

FILE * OutTrace;
ADDRINT ExceptionDispatcher = 0;
bool LastJmp64 = false;

VOID DetectFarJmp(ADDRINT InstrEip, UINT32 Opcode) 
{
if(LastJmp64)
{
fprintf(OutTrace, "after jmp 64: eip %08x \n", InstrEip);
LastJmp64 = false;
}

if( ((UINT8*)(InstrEip))[0] == 0xEA  &&
((UINT8*)(InstrEip))[5] == 0x33 &&
((UINT8*)(InstrEip))[6] == 0)
{
fprintf(OutTrace, "Jump seg 64! eip %08x \n", InstrEip);
LastJmp64 = true;
}
}

VOID Instruction(INS Ins, VOID *v)
{
INS_InsertCall(Ins, IPOINT_BEFORE, (AFUNPTR)DetectFarJmp, IARG_INST_PTR, IARG_UINT32, INS_Opcode(Ins), IARG_END);
}

VOID Fini(INT32 code, VOID *v)
{
fprintf(OutTrace, "Terminating execution\n");
fflush(OutTrace);
fclose(OutTrace);
}

INT32 Usage()
{
PIN_ERROR("Itrace pintool 1\n");
return -1;
}

int main(int argc, char * argv[])
{
OutTrace = fopen("itrace.txt", "wb");
WINDOWS::HMODULE hNtdll;
hNtdll = WINDOWS::LoadLibrary("ntdll");
ExceptionDispatcher = (ADDRINT)WINDOWS::GetProcAddress(hNtdll, "KiUserExceptionDispatcher");
fprintf(OutTrace, "Exception handler address: %08x \n", ExceptionDispatcher);
WINDOWS::FreeLibrary(hNtdll);

// Initialize pin
if (PIN_Init(argc, argv)) 
{
Usage();
}

// Register Instruction to be called to instrument instructions
INS_AddInstrumentFunction(Instruction, 0);

// Register Fini to be called when the application exits
PIN_AddFiniFunction(Fini, 0);
    
// Start the program, never returns
fprintf(OutTrace, "Starting Pintool\n"); 
PIN_StartProgram();
    
return 0;
}

------------------------------- CUT HERE ---------------------------------

It simply identifies the opcode of a far jump, and if found, prints the address of the instruction that immediately follows it. Running the test produces the following log before making the application crash:

Exception handler address: 772f0124 
Starting Pintool
Jump seg 64! eip 748f2320 
after jmp 64: eip 773010b2 
Jump seg 64! eip 748f2320 
after jmp 64: eip 772ffb9a 
Jump seg 64! eip 748f2320 
after jmp 64: eip 772ffa1a 
Jump seg 64! eip 748f2320 
after jmp 64: eip 772ffa1a
...
Jump seg 64! eip 01121022 
after jmp 64: eip 772f0124 

As we can see, the jumps within system DLLs are correctly detected and the problem occurs only at address 0x01121022, that is the first application's far jump. We notice this also because the following instruction is located at address 0x772f0124, which is the address of KiUserExceptionDispatcher (one of the functions called by Windows when an exception occurs).
Moreover, the application works perfectly if run normally and crashes only when run under Pin. 
I haven't investigated these details deeply, but it seems that something happens within Pin's instrumented code in case of the application far jumps, while Pin may have its own logic to handle Windows internal API calls.
And there goes another tool...!
As a note: you can use the 32bit version of Pin to instrument a 64bit process too (although Pin also exists in 64bit): the process will be running in 32bit mode, but the 64bit module is loaded and can be run without problems. So, I think it should be also possible, from a 64bit mode process, to call 32bit code, but I have not tried this yet.

8 - Conclusion

Legacy software and hardware are always a pain, and this is a good example of why they are. This obfuscation derives from the 32bit legacy in our new shiny 64bit CPUs, and it can present many advantages:
  • it hides computations mixing operations in 32bit and 64bit modes
  • it hides parameters for API calls
  • it hides API calls
  • it destroys code and data cross references
  • it makes analysis time consuming
  • it can be only debugged via remote debugging
  • it is difficult to have automated tools to solve this obfuscation
  • 64bit support in analysis tools in general is not very good
Note that when I say "hide" I mean that the code is difficult to visualize correctly in the disassembler or in usermode debuggers.
The code is there however, but the current tools have difficulties in dealing with it.

Note: I wrote this blog entry about two years ago and I proposed it for the Phrack magazine. At the end they decided to decline the offer just a few months ago and I decided to publish it now anyway. Some of the findings reported here were new at the time of writing, but were later published by other researchers (see the references). Also, even if I took some time to review this material again, some limitations I outlined to handle this obfuscation could have been fixed with newer software releases. Hope you enjoyed the article anyway :)

9 - References

[1] Intel Manuals:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

[2] Windbg and Debugging tools for windows:
http://msdn.microsoft.com/en-us/windows/hardware/gg463009.aspx

[3] __emit:
http://msdn.microsoft.com/en-us/library/ms253948(v=vs.80).aspx

[4] Wow64:
http://msdn.microsoft.com/en-us/library/windows/desktop/aa384274(v=vs.85).aspx

[5] Pin:
http://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

Other articles on the subject:

[6] Knockin' on Heaven's Gate Ð Dynamic Processor Mode Switching:
http://rce.co/knockin-on-heavens-gate-dynamic-processor-mode-switching/

[7] Call64, Bypassing Wow64 Emulation Layer:
http://waleedassar.blogspot.it/2013/01/call64-no-wow64-emulation-layer.html


[8] Ghost in the Shellcode 2014: Byte Sexual:
https://github.com/ctfs/write-ups/tree/master/ghost-in-the-shellcode-2014/byte-sexual