X-MAS CTF 2019 - Kernel Crackme (484pt)

December 19, 2019

Kernel Crackme (484pt)


Description: Leaked from a famous malware author this looks like some sort of testing device for a new obfuscation. Stop the hacker before he delivers his final product!

Update: The driver (.sys) is unsigned. You can only load it by disabling driver signature enforcement and enabling test mode. To load the driver you need to start the exe as admin. If you want to run this please do it on a virtual machine. (Author: avlad171)



TLDR: The windows .exe loads a driver that operates a device on \\.\Crackme. The password is read in by the client and sent to the driver by writing to this file. The kernel runs a sort of hash function to produce a new stream of bytes that are sent back and compared against a static buffer.

For this challenge, we are provided with a Windows .exe and a .sys file which is a kernel driver. I’m mostly a mac/linux person so I don’t have a ton of experience with Windows internals and I though this would be a good learning experience.

Since we are dealing with a kernel driver, I wanted to set up a kernel debugging environment. I was able to set up two Windows 7 VM’s in VirtualBox: one as the target configured with kernel debugging and one as the host running WinDbg.

The Client

The .exe tries to load the driver and then asks for a license key and reads it into a buffer:


Then it communicates with the driver over \\.\Crackme by writing the license key and then reading the response:


If the response matches the following static buffer, we have the correct flag:


The Driver

After examining the driver code, we see a function that registers \Device\Crackme and sets up an IRP disbatch table.

The two functions of interest are +0x1720 (read data), +0x15C0 (write data). These functions are called to handle read and write requests (such as the ones the client will make). I figured this out initially, by setting kernel breakpoints and watching which functions were triggered. However another sign is the use of ProbeForWrite and ProbeForRead which are functions that ensure a target buffer is in usermode space and correctly aligned.

So at this point, we know that the client will send some data (hopefully the flag) to the driver and the driver will then send back some processed data (via the client making a ReadFile request).

In order to solve this challenge, we need to figure out where and how the data is being processed.

Examining the driver write function, we see the following:


After the ProbeForWrite call, we see a global buffer being copied to a userspace buffer. This buffer is not initialized to anything so it must be set prior to this call (i.e. in the driver read call).

As a side note, the memcpy routine here did not originally have symbols and looks fairly complex for something so simple. One pattern I’ve learned to recognize over time is “accelerated memcpy” functions where you have a pyramid of loops: (copy bytes) (copy words) (copy blocks with xmm registers) (copy words) (copy bytes).

So let’s find where this global buffer is initialized…

Looking at the driver read function, we see:


After the ProbeForRead it calls a check_data function and then optionally a process_data function. The check_data function just ensures the input string is null terminated (which occurs if we type a number of bytes not divisible by 8).

Then we finally see the function that initializes the new global buffer.

Inside this function, we see a deterministic pre_process function that we can ignore for now. Then the flag data is processed in blocks of 0x10 bytes. We can also see that the arguments to this function are the respective chunks of the string buffer and the global buffer. So our global buffer will have a 1:1 correspondence with the input string.


The process function operates on read/write buffers which I’ve labled: A and B. First, we loop over the input block and clear B while setting A to the flag data:


Then we see some sneaky return pointer mangling to jump to a switch block in a loop:


At first this seems like a VM challenge, however we see that the order of the blocks that are executed is always the same. Essentially we have something like this:

char c = 0;

while (1) {
    switch (c) {
        case 0: {
            // ...
            c = 8;
        case 8: {
            // ...
            c = 127;
        case 127: {
            // ...
            c = 98;
        // ...

I started reversing these blocks by hand and after 12 blocks it repeats. It repeats a total of 10 times and then the last block exits midway through. Two more (read only) buffers are referenced that I’ve labled F1 and F2.

Additionally one block uses the pre-processed buffer (labled l2 below) as a lookup table. I ran the binary and dynamically dumped this table since it is deterministic:

buf2 = '''
4b 61 50 64 53 67 56 6b 58 70 32 73 35 76 38 79
72 66 e6 f2 21 01 b0 99 79 71 82 ea 4c 07 ba 93
b5 92 3a db 94 93 8a 42 ed e2 08 a8 a1 e5 b2 3b
68 a5 d8 e9 fc 36 52 ab 11 d4 5a 03 b0 31 e8 38
a7 3e df 0e 5b 08 8d a5 4a dc d7 a6 fa ed 3f 9e
e2 4b d4 23 b9 43 59 86 f3 9f 8e 20 09 72 b1 be
82 83 7a 22 3b c0 23 a4 c8 5f ad 84 c1 2d 1c 3a
1a 1f fa 5a 21 df d9 fe e9 80 74 7a 28 ad 68 40
0f 5a f3 6e 2e 85 2a 90 c7 05 5e ea ef a8 36 aa
d6 5f 5f b1 f8 da 75 21 3f df 2b cb d0 77 1d 61
15 fb b0 c1 ed 21 c5 e0 d2 fe ee 2b 02 89 f3 4a
9c 61 c6 4d d3 18 30 c4 ae 5a 46 68 0b aa 8a 0b
'''.replace(' ','').replace('\n','')
buf2 = binascii.unhexlify(buf2)
buf2 = [ord(x) for x in buf2]

I implemented the process function in python to ensure I understood it:

def sim(A):
    B = [0] * 16

    for i in range(10):
        for j in range(16):
            A[j] ^= buf2[i*16+j]
        print(' '.join([hex(x) for x in A]))
        for j in range(16):
            A[j] = l2[A[j]]
        print(' '.join([hex(x) for x in A]))

        for j in range(16):
            B[j] = A[(j*5)&0xf]

        for j in range(16):
            A[j] = B[j]
        print(' '.join([hex(x) for x in A]))

        if i == 9:

        for k in range(4):
            B[k*4] = f1[A[k*4]] ^ f2[A[k*4+1]] ^ A[k*4+2] ^ A[k*4+3]
            B[k*4+1] = A[k*4] ^ f1[A[k*4+1]] ^ f2[A[k*4+2]] ^ A[k*4+3]
            B[k*4+2] = A[k*4] ^ A[k*4+1] ^ f1[A[k*4+2]] ^ f2[A[k*4+3]]
            B[k*4+3] = f2[A[k*4]] ^ A[k*4+1] ^ A[k*4+2] ^ f1[A[k*4+3]]
        print(' '.join([hex(x) for x in B]))

        for j in range(16):
            A[j] = B[j]
    for j in range(16):
        A[j] ^= buf2[0xa0+j]
    return A

Most of this processing is easy to reverse except the last chunk that uses F1 and F2. A naive brute force would search 32 bits so I wrote a solver with z3:

def crack_partial(b):
    s = Solver()

    F1 = Array('f1', BitVecSort(8), BitVecSort(8))
    F2 = Array('f2', BitVecSort(8), BitVecSort(8))

    for i in range(256):
        s.add(F1[i] == f1[i])
        s.add(F2[i] == f2[i])

    a = [BitVec('a%d' % d, 8) for d in range(4)]

    s.add(F1[a[0]] ^ F2[a[1]] ^ a[2] ^ a[3] == b[0])
    s.add(a[0] ^ F1[a[1]] ^ F2[a[2]] ^ a[3] == b[1])
    s.add(a[0] ^ a[1] ^ F1[a[2]] ^ F2[a[3]] == b[2])
    s.add(F2[a[0]] ^ a[1] ^ a[2] ^ F1[a[3]] == b[3])

    m = s.model()

    return [m[a[i]].as_long() for i in range(4)]

At this point you simply reverse each chunk of the target string and obtain:


comments powered by Disqus