Monday, April 30, 2012

IOCCC: C code obfuscation at its best!


Finally the International Obfuscated C Code Contest is back again and this month, the 14 winning entries were announced! You can check their website and give a look to the code: www.ioccc.org.

In particular, there was one winning entry that made me really curious: the one liner by Taketo Konno. Reading the code I noticed some interesting tricks and I began to think two things: that this was a very cool entry and that it deserves to be analyzed as well :)

First of all, you have to download the C source code and the respective make file, put them in the same directory and type "make konno" to build the executable. Then, type the following: ./konno qwerty

And you will get something like this:

o o o o o o . . . .
 . . . . . . . . .
  . . . . . . .

What is it? What does this remind you of?
Try with different inputs if you have no ideas, but remember: all the inputs must be in same form "./konno just_a_single_argument_of_lower_case_letters".

Ok, here we are, after some attempt you should have realized what the output is… that is… the visual representation of the keystrokes!

Well, with this little hint we are ready to begin our analysis. The first step consists of re-shaping the code, in order to make it more clear and easy to understand:

main(_,l)
char **l;
{
  6 * putchar
  (
    (--_ % 20) ?
      (_ + _ / 21 & 56 > _) ?
        strchr(1[l], _ ^ "pt`u}rxf~c{wk~zyHHOJ]QULGQ[Z" [_/2])?  
        111 : 46
      :32
    :10
  )
  ^ _ && main(2 + _, l);
}


Let's start by observing the structure of the code: it is a single line of code, where the function "main" performs a recursion

6*putchar(…) ^ _ && main(2 + _, l);

The function takes the parameters "_" and "l" respectively as the standard "argc" and "argv" and the recursion continues until the expression "6*putchar(…) ^ _ && main()" holds true.

Pay attention in identifying "_" with "argc", as this allow us to assume that its initialization value is 2.

Now, if you give a deeper look at the source-code, you will recognize the following construct:


(exp) ? true : false;

 
appearing three time (nested). So, this whole code is one big loop, and the nested operators, that are all inside the "putchar", must control the output of chars in every iteration. 


Let's start from the first ternary operator:

(--_ % 20) ? ... :10

this one, when false, will return the value 10 (0x0A in hex) which is the ascii value for "\n", (it will cause the cursor to go return on the following line). How does it work exactly?

The variable "_" is actually used as a counter for the main big loop, and it is incremented at every iteration because of the recursion ( main(2 + _, l) ) and because the "--" in the above expression. Every time the iterator is a multiple of 20, the condition of the ternary operator will evaulate to false, causing the code to output a "\n" char.

This handles the division of the output in lines, but what about the rest? Well, we can now look at what happens when the above ternary operator evaluates to true (that is, in every iteration where the counter is not a multiple of 20):
   
(_ + _ / 21 & 56 > _) ? ... : 32
    
mmm, not easy to follow all these arithmetic passages. Let's rewrite this line including parenthesis to highligh the operators precedence:
  
( (_ + (_ / 21)) & (56 > _) ) ? ... : 32

ha! Much better! When this evaluates to false, the space character is printed on screen (32 decimal = 0x20 hex, the ascii code for space).

We know from the output that the code will alternate spaces to other characters ("." or "o"), so the above condition must evaluate to false at alternate values of the counter.
First of all, 56 is the upper limit for the characters to be printed. The expression "56 > _" is always true for all iterations (can you guess what happens when the counter reaches value 56? :P).

The other part of the formula, instead, ("(_ + (_ / 21))") is more interesting: the code adds to the counter the value of the division between the counter and the number 21.

Let's explain it better: counter / 21 will result in 0 for the first 20 iterations, in 1 for the second 21, and in 2 for the last ones. The result of this sum is then "and-ed" with the value 1 (that is, the "true" resulting from "56 > _ "). 

So, in the first line, the counter is left unchanged, and all the even values of the counter and-ed with 1 will return false. Why? Because the number 1 in binary is... well... 1 :).

Every multiple of two, instead (that is, every even value of the counter), once translated in binary will correspond to a number that has the least significant bit set to 0. This means that in every iteration where the counter is even the "and" operation will be something like this:
   
bbbbb0  and
000001

where "b" is a bit that can be either 0 or 1. Of course the result of this and will always be false. On the other hand, it will always be true for odd values of the counter (because all odd values have the least significant bit set to 1).

BUT! This is only valid for the first line of the output! In the second line the expression "_ / 21" results in 1, which is then added to the counter.
So, the whole thing now works the opposite way: when the counter is even, it is incremented by one and it becomes odd, causing the whole ternary operator to evaluate to true. When it's odd instead, the increment makes it even, causing the ternary operator to evaluate to false, and therefore a space is printed on the screen.

Finally, the last line of the output begins with TWO spaces!! How is that possible? Well, the reason is in the chosen number "21"!
The output lines infact are aligned on multiples of 20, while the check on alternate spaces is aligned on multiples of 21. This means that the check of alternate spaces is off by one in respect of the alignment of lines.

In the third line you start with iteration 41 (well, actually iteration 42 which gets decremented to 41). This value of the counter makes it so that for the line alignment it represents the first element of the third line (the third set of 20), while for the space alignment it is the last element of the second line (the second set of 21).
    
Now the third and most nested operator is left:
   
strchr(1[l], _ ^ "pt`u}rxf~c{wk~zyHHOJ]QULGQ[Z" [_/2])? 111 : 46
   
this is quite simple, but also quite interesting in the notation being used. 

First we see it returns the two values 111 or 46 (ascii values for "o" and "."), and correspond respectively to a char that is present and to a char that is not present in the input parameter string.
How is the check performed?? Simply with a strchr() function.

This function searches for a char in a string, and returns a pointer to that char if it is found, or NULL otherwise (which logically would evaluate respectively to true or false). The first parameter is weird! 
We have:
                        1[l]
   
no, it's not a typo! 1 is the array, and the char **l is the index. How is that possible?? Well, the semantics of the C language allows this weird notation; normally, when using an array, the code would resolve it like this:
   
type array[n] --> array + n*sizeof(type)

in order to calculate the pointer to reach the n-th element in the array. In the case of this IOCCC code, instead, the compiler does this:
   
char *test = 1[l];
----------------------------------------------
    mov         eax,dword ptr [ebp+0Ch]  ; eax = array address
    mov         ecx,dword ptr [eax+4]      ; ecx = element at address + 4
    mov         dword ptr [ebp-4],ecx       ; element is stored in test



remember that "l" is a char**, meaning that it is an array of pointers, and each element is of size 4. We access element 1, so 1*4 = 4. Therefore even when the index and the array are swapped, we still have
   
type n[array] -> array + n*sizeof(type)
   
that is, the pointer is still calculated and resolved correctly.
So this whole 1[l] thing is simply accessing argv[1], that is, the first parameter passed in the command line.

Going back to the strchr function, we see another similar trick in the second parameter:
   
"pt`u}rxf~c{wk~zyHHOJ]QULGQ[Z" [_/2]

this time the trick is more intuitive, the string is treated as an array of chars, and C allows you to use the [] notation to access the elements. This is completely equivalent to the most "normal" form:
   
char Array[] = "pt`u}rxf~c{wk~zyHHOJ]QULGQ[Z";
...
char test = Array[_/2];
   
Ok, we know that the first parameter of the strchr is the string passed in the command line, but what about this second one? We see the counter makes an appearance. It is divided by two because we have seen it prints alternatively characters and spaces. Also, we see the character extracted from the string array is xored with the counter itself! This looks like some sort of partial encryption :P

Anyway, at every odd iteration a character is extracted from the above string, and after the xor with the counter it results in the n-th character of the keyboard (qwertyuio... etc). This character is then searched in the input parameter string, and if it is found the character "o" is returned by the ternary operator (and printed on the screen), or else the character "." is returned.

Riddle: because of the way the loop is calculated, some characters from the above string are skipped and never used, can you figure out which ones?
   
The last, final, question is: which is the condition that stops the loop?
The loop is performed through the recursive formula we saw in the beginning:
    
6*putchar(…) ^ _ && main(2 + _, l);
   
This is a logical "and": the first part stands for the termination condition, while the second one is the core of the recursion. Both the sides of the formula must evaluate to true in order for the recursion to work; the second part, though, only contains the "main" which doesn't return any value.
   
So this leaves us with only the first part that controls the iteration: the code will evaluate the first part of the formula, and if it is true it will try and evalute the second part (the "main"), but this part is a function, therefore the code will have to call the function in order to determine whether this function will return true or false. But this function is recursive! Therefore it doesn't matter what the "main" evaluates to, the recursion will cause the code to always follow it in any case.
   
The only way to terminate this loop is when the first part of the formula evaluates to false. This way, the code knows a priori that the second part of the formula can't make a difference, so it will not evaluate it, causing the recursion not to happen.
   
The first part of the formula is simply:
   
6*putchar(…) ^ _
   
when is this false? Well, remember that every line of output is of 20 chars? Every time the counter is a multiple of 20 a "\n" is printed (ascii code is 10 decimal).
   
In the last iteration, after the third line, the counter reaches the value 60. This is a multiple of 20, therefore the "\n" is printed by the expression within the "putchar" argument. Also, the printed charater is returned by "putchar" (that is, it returns 10). We see the return value (10) is multiplied by 6, then xored with the counter itself. Let's do the simple math:
   
(6*10) ^ 60
   
the result of this is 0, which logically evaluates to false, so we can see that only when the execution counter reaches the value of 60 the recursion will be interrupted, causing the code to get out of the nested recursions it has done so far, and ultimately terminate the program execution.
   
And that's all for this little one liner: very small code, but dense contents! It is very interesting to analyze tricky codes like this because you find a lot of quirks that you never see in "normal" C source codes, and you also learn a lot about the internal working of the C itself.
  
My congratulations to the author Taketo Konno for designing this very clever piece of code, which well deserves the "best one liner award" :)


Friday, April 20, 2012

RansomCrypt: an analysis of its crypto routines.


About a week ago, I noticed a new crypto ransomware (remember gpcode?): W32/RansomCrypt. The news came from the F-Secure website. So, as soon as I gathered a sample I started my analysis, but just when I was about to finish it... A twitter update told me that F-Secure had already done the work with the decryption script!
    
Anyway here you can find some additional details about the encryption/decryption and the way it works.
   
The first executable I came across was coded in Visual Basic and was packed with a modified UPX version.
   
After unpacking the sample, I started searching for standard encryption routines, but nothing caught my attention. However the sample did contain encrypted data, and debugging some routines that looked like home made decryption code, I ended up finding a second encrypted executable within the first one.
   
Luckily, this second executable wasn't coded in Visual Basic (if you have tried to reverse some VB executable you know what I mean!),  this one was also packed, but with standard UPX. Again, I unpacked it and began searching for some encryption routines. I found three candidates: a 16-bytes xor, and two TEA routines (both encryption and decryption).
   
(If you haven't ever reversed TEA, here's a little trick to recognize it: in most cases you just have to look for the magic constant 0x9E3779B9 and the work is done.)


I also noticed some interesting APIs: FindResource, SizeofResource, LoadResource, LockResource. They made me look into the resources of the file, where I found an encrypted block. Once decrypted, this block revealed the malware configuration.



To decrypt the configuration, I debugged this decryption routine and found out that the first 16 bytes of the resource were the key for the decryption of the configuration block. The decryption algorithm was the 16-bytes xor routine I had noted earlier.

Then I noticed a loop scanning for all the files in the system, and I thought that it was probably the same that also encrypted them.

I set a breakpoint there, and followed the key generation algorithm: the malware uses the same key employed to decrypt the configuration block, but it changes the endianness of every dword.
   
Key from the resource section:    4A 2E 94 46   60 64 85 5B   5A 86 89 8C   7F 63 6C 50
Key with swapped endianness:   46 94 2E 4A   5B 85 64 60   8C 89 86 5A   50 6C 63 7F

This key, combined with the first byte of the targeted filename, leads to the final encryption key that will be used with TEA.
  
   mov     dl, [eax]       ; get first char from filename
   mov     ecx, 10h
   mov     esi, offset BaseKey
   mov     edi, offset GeneratedKey

   KeyGeneration:
   lodsb
   xor     al, dl
   rol     dl, 1
   stosb
   loop    KeyGeneration


Or in C, in you prefer:
   
   #define ROTATE_LEFT(a,n)( (a<<n) | (a >> (8-n)) )
                                 
   A = Filename[0];
   for(i = 0; i < 16; i++)
   {
   GeneratedKey[i] = StaticKey[i] ^ A;
   A = ROTATE_LEFT(A, 1);
   }

Finally I noticed that the malware doesn't encrypt the entire file, but it skips the first 0x47 bytes. I believe that this is done because the first bytes of a file usually contain a header (depending on the type of the file), and these headers may contain predictable data that may allow known plaintext attacks.

We have gone through the encryption: they key is generated starting from a static seed, and then TEA is used, so if you have been infected you can generate your decryption keys and get your files back.

You can use the python script from F-Secure, which performs the decryption automatically. However, users should note that they should NOT rename the encrypted files: the encryption key is generated combining the static key with the first character of the filename, so changing the filename of an encrypted file may cause incorrect decryption.

Alternatively, the malware prompts the user for a code that if inserted will cause the malware to decrypt all encrypted files and then remove itself. The code will be sent of course only after a user sends money to the malware author.



Can't we just get this code? Does it really work? I tried forcing the tool to accept any code (I cracked it, if you want to put it that way), and the malware did decrypt the files and remove itself. So having the code may solve the problem immediately.

Unfortunately, the code is checked pretty heavily: the string you insert is hashed five times with MD5 (the hash is performed through Windows Crypto APIs) , and then checked with a hash that is hardcoded in the configuration; something like this:


   if MD5( MD5( MD5(MD5(MD5(Code))))) == 900140074FA550671FB6916CFF4D21CC

   then DecryptAndRemove()


So, unless you have the time and patience to crack this MD5 hashes chain, you are better off calculating your own keys and decrypting the files yourself.

Thursday, April 12, 2012

Flashback Trojan: domain generator algorithm demystified

We already heard a lot about Flashback, a trojan targeting users of Apple's Mac OS X that has currently infected more than 600,000 machines around the world, taking advantage of a java vulnerability  (CVE-2012-0507).
                
Kaspersky has conducted a good analysis: 
https://www.securelist.com/en/blog/208193441/Flashfake_Mac_OS_X_botnet_confirmed

Information from a user perspective has already been published: you can find removal scripts, patches, detection routines, and so on, easily all over the net. However the most interesting  data about this malware, from a security standpoint, is its spreading functionality.
I recently got a sample of this malware, and analyzed a bit of code. It was easy to determine the domain generator algorithm as I noticed the piece of code that set up the url:
                
                mov     edi, [ebp+StrParam1]
                mov     [esp+10h], eax  ; string param2 
                mov     dword ptr [esp+8], offset aSS ; char format 
                mov     dword ptr [esp+4], 200h ; size_t
                mov     [esp+0Ch], edi  ; string param1
                mov     [esp], ebx      ; dest
                call     snprintf
                
                aSS             db '%s%s',0
                
The domain generator algorithm is immediately before this part.
I studied it for a while and rewrote it in C language:
              
                #include <stdio.h> 
                #include <stdlib.h>
                #include <string.h>
                #include <time.h>

                
                void GenDomain(void)
                {
                    struct tm *ptm;
                    time_t rawtime;

                    unsigned int day, month, year;

                    unsigned int day1, day2, day3, month1, year1, year2, year3;
                    unsigned int i, j, size;
                    char *domain;  

                    time(&rawtime);
                    ptm = gmtime(&rawtime);
                    year = ptm->tm_year;
                    month = ptm->tm_mon;
                    day = ptm->tm_mday;
                 // compute day
                    day1 = day ^ (day << 16);
                    if((day ^ (day << 16)) <= 1 )
                    {
                      day2 = day << 24;
                      day3 = ~(day << 24);
                      if(day2 > 1)
                        day3 = day2;
                      day1 = day3;
                    }
                 // compute month
                    month1 = month ^ (month << 16);
                    if(month1 <= 7)
                    {
                      month1 = month << 24;
                      if((month << 24) <= 7)
                        month1 = ~(month << 24);
                    }
   
                 // compute year
                    year1 = year ^ (year << 16);
                    if((year ^ (year << 16)) <= 15)
                    {
                      year2 = year << 24;
                      year3 = ~year2;
                      if(year2 > 15)
                        year3 = year2;
                      year1 = year3;
                    }
                    size = (((16 * (month1 & 0xF8) ^ ((month1 ^ 4 * month1) >> 25) ^ ((day1 ^ (day1 << 13)) >> 19)) ^ ((year1 ^ (8 * year1)) >> 11)) & 3)+13;
                    j = size -1;
                    domain = (char*)malloc(size);
                    memset(domain, 0, size);

                    for(i = 0; i < j; i++)
                    {
                      day1 = ((day1 ^ (day1 << 13)) >> 19) ^ ((day1 & 0xFFFFFFFE) << 12);
                      month1 = ((month1 ^ 4 * month1) >> 25) ^ 16 * (month1 & 0xFFFFFFF8);
                      year1 = ((year1 ^ 8 * year1) >> 11) ^ ((year1 & 0xFFFFFFF0) << 17);

                      domain[i] = ((year1 ^ month1 ^ day1) % 25) + 97;
                    }

                 printf("Domain: %s \n", (char *)domain);
                }

                int main(void)
                {
                 GenDomain();
                 return 0;
                }

Basically it generates a stream of characters calculated from the current date (using day, month and year), so that every day a new domain is generated and contacted.
For example,  the domain of the day (12/04/2012) is:
                
                Domain: fhnqskxxwloxl 
                
Note that this routine generates only the domain name, and not the TLD. It seems that the possible TLDs used by the malware are encrypted with strong encryption and the key is uniquely generated from the infected machine at install time, therefore I'm unable to find them as I got only the payload and not the full installer.

We can do a little research using Google anyway and it seems that the following TLDs have been observed:

.com
.net
.kz
.in
.info

Now that the domain generator algorithm is demystified you can try to register one of the domains (if you can find one still available!) and perform your own traffic analysis!