# Eldar

Description:

We found this app on a recent Ubuntu system, but the serial is missing from serial.c. Can you figure out a valid serial?

Just put it into serial.c and run the app with make && make run.

Thanks!

Note: you don’t have to put anything else other than the serial into serial.c to solve the challenge.

# Prologue I played GoogleCTF 2022 with DiceGang and we got second place! I spent close to 12 hours on this challenge and tried a lot of things that ended up not working including:

• throwing everything into z3
• emulate in python
• translate to C and emulate faster

Eventually the approach that ended up working was disassembling everything and applying 11 lifter passes to reduce the number of instructions from 100k+ to ~1500. (and then throwing select parts into z3)

My lifting implementation was inspired by Reductor’s breach writeup for DiceCTF 2022. Especially the use of Python 3.10 match statements for pattern matching.

## Grok complexity

Something I’ve thought about during the course of many CTFs is “what makes a reversing challenge hard to do?”

Usually there are two parts of a reversing problem: first you need to grok (i.e. understand) the problem, then you need to solve it.

For binary-based challenges, grokking usually requires disassembling/decompiling etc… Other times, it may just require understanding what is the program trying to do at a high level. This may be required, for example, to implement a more efficient solver.

Sometimes (if you’re lucky), you can skip the grok phase and blindly throw a binary into angr or throw some data into z3.

For each challenge, the author must put in some amount of effort (I) to construct the challenge (both computational and creative).

The player must then put in some effort (G) to grok the challenge and (S) to solve it.

Presumably, there are some bounds: $$S \le f_s(I)$$ and $$G \le f_g(I)$$ that limit how difficult a problem can be based on the amount of effort required to create it.

The question is what is the order of $f_s$ and $f_g$?

We know from cryptography that we can construct problems that scale exponentially with the amount of work required to construct it (for example integer factorization). So $f_s$ is at least exponential. For practical reasons however, people generally design CTF problems that can be solved within a few hours of reasonable computation.

But how does grok complexity scale with input effort? This is a harder question to answer because it is difficult to quantify.

My theory is that $f_g$ is linear; that is, $G$ is linearly bounded by $I$. In other words, a challenge can’t be harder to grok than it was to create; you can’t get “extra confusion” for free.

However, you can use pre-existing effort to increase the amount of effective construction effort ala Assembly Theory.

For example, a challenge written in C and compiled with gcc has: $$I = I_{challenge} + I_{gcc}$$.

In order to grok this challenge, you need to decompile the program and understand it contextually. In some sense, this can be split into: $$G = G_{challenge} + G_{gcc}$$.

Where $G_{gcc}$ is the effort required to decompile the binary. With modern decompilers, C decompilation is usually pretty good and so you get something like 95% of $G_{gcc}$ for free just by using something like Binary Ninja or IDA.

With other toolchains, you may get more or less “free understanding.” For example, problems written in Rust or Golang are usually much harder to grok because the challenge has the added complexity of understanding large mangled code generated by the respective languages, which existing decompilers are currently not very good at understanding.

This principle captures the intuitive understanding that a problem is harder if there is no existing tooling and easier if there is good, existing tooling.

In the case of compilation/assembly/obfuscation pipelines, the size of code generally scales exponentially with the number of steps taken during construction (steps is roughly linear with effort).

In these situations, the amount of effort required to grok a binary is not linear with respect to the size of the binary but rather logarithmic.

During this challenge, I implemented 11 lifter stages to reduce the code size from 110000 instructions down to ~1500 in order to properly grok it and found it to be a really intersting example of this principle.

Based on the types of patterns I lifted, you could imagine the author writing the same lifts in reverse in order to actually construct the binary and so it shows a very clear linear correlation between $G$ and $I$.

# Overview

Ok enough rambling, let’s discuss the challenge:

We are given a serial.c file (where we enter the flag):

char serial[] = "";


A Makefile:

serial:
gcc -shared -o libserial.so serial.c
run:
./eldar
clean:
rm libserial.so > /dev/null


And a binary: eldar which has a really small main function: We also see hundreds of thousands of reloactions in eldar:

(objdump -R eldar)

...
00000000009ca49a R_X86_64_RELATIVE  *ABS*
000000000080409c R_X86_64_RELATIVE  *ABS*+0x00000000008040e4
00000000009ca4fa R_X86_64_COPY     ^[email protected]@Base
00000000008040fc R_X86_64_64       ^[email protected]@Base
00000000009ca4fa R_X86_64_RELATIVE  *ABS*
000000000080409c R_X86_64_RELATIVE  *ABS*+0x00000000008040fc
00000000009ca55a R_X86_64_COPY     ^[email protected]@Base
00000000008040fc R_X86_64_64       ^[email protected]@Base
00000000009ca55a R_X86_64_RELATIVE  *ABS*
000000000080409c R_X86_64_RELATIVE  *ABS*+0x00000000008040e4
00000000009ca5ba R_X86_64_COPY     ^[email protected]@Base
00000000008040fc R_X86_64_64       ^[email protected]@Base
...


Evidently fail is somehow being computed during the dynamic linking phase based on serial

So let’s dive into relocations.

# ELF Relocations

1. relocations are cool
2. relocations are when you, for relocating things
3. you can write data
• symbols
• or just numbers
• (sometimes two)
4. relocation is for linking, but uh..
• let me start over
5. the linker does the relocation with the symbols
6. ok, listen...
7. just do the relocation please

Dynamic ELF Relocations have a structure like:

typedef struct {
Elf64_Xword     r_info;
} Elf64_Rela;


And info consists of a type and (optional)symbol reference:

#define ELF64_R_SYM(info)             ((info)>>32)
#define ELF64_R_TYPE(info)            ((Elf64_Word)(info))


Dynamic symbols will also be relevant and they have the structure:

typedef struct {
Elf64_Word      st_name;
unsigned char   st_info;
unsigned char   st_other;
Elf64_Half      st_shndx;
Elf64_Xword     st_size;
} Elf64_Sym;


All relocations write to an address described by r_offset. The value written depends on the relocation type.

(Note lief uses address instead of r_offset so some of my later terminology and code uses r_address. Pls don’t be confused)

There were four types of relocations used in eldar as follows:

## REL (type 8)

Set relocation to a base-relative value (but base is always 0):

*(uint64_t *)r.r_offset = r.r_addend;


## R64 (type 1)

Set relocation to symbol value + addend:

*(uint64_t *)r.r_offset = symbols[ELF64_R_SYM(r.r_info)] + r.r_addend;


## R32 (type 10)

Same as R64 but truncated to 32-bit:

*(uint32_t *)r.r_offset = symbols[ELF64_R_SYM(r.r_info)].st_value + r.r_addend;


## CPY (type 5)

Elf64_Sym sym = symbols[ELF64_R_SYM(r.r_info)];
memcpy(r.r_offset, sym.st_value, sym.st_size);


# Part 1: Parsing

I found lief to be pretty nice for parsing relocation data and symbol references.

import lief
from collections import namedtuple

b = lief.ELF.parse('./eldar')

def to_sym(name):
assert len(name) == 1
return ord(name)

Rel = namedtuple('REL', ['dst','val','ridx'])
Copy = namedtuple('CPY', ['dst', 'symbol', 'ridx'])

def parse(b) -> list:
relocs = list(b.relocations)

print('[*] Parsing...')
instructions = []
for i in range(3, len(relocs)):
r = relocs[i]
match r.type:
case 1: # R64
instructions.append(
case 5: # CPY
instructions.append(
case 8: # REL
instructions.append(
case 10: # R32
instructions.append(

return instructions

def dump(instructions):
for op in instructions:
match op:
case Rel(dst, val, ridx):
print(f'[{ridx:04d}] :: rel {dst}, {val}')
case Copy(dst, symbol, ridx):
print(f'[{ridx:04d}] :: copy {dst}, {symbol}')
print(f'[{ridx:04d}] :: r64 {dst}, {symbol} + {addend}')
print(f'[{ridx:04d}] :: r64 {dst}, {symbol} + {addend}')

instructions = parse(b)
dump(instructions)


Produces:

...
 :: copy 8405268, 2
 :: rel 8405148, 8405268
 :: rel 8405156, 8
 :: copy 8414226, 2
 :: r64 8405196, 4 + 0
 :: rel 8414226, 0
 :: r64 8405197, 1 + 0
 :: rel 8405148, 8405690
 :: copy 8405244, 2
 :: rel 8405148, 8405196
 :: copy 8414394, 2
 :: r64 8405220, 4 + 0
 :: rel 8414394, 0
 :: rel 8405148, 8405220
 :: copy 8414490, 2
 :: r64 8405220, 5 + 0
 :: rel 8414490, 0
 :: copy 8414562, 2
 :: r64 8405220, 5 + 0
 :: rel 8414562, 0
 :: r64 8405220, 5 + 8405690
 :: copy 8405690, 5
 :: copy 8414666, 2
 :: r64 0, 6 + 0
 :: rel 8405148, 8405698
...
 :: copy 10840770, 2
 :: r64 8405244, 6 + 0
 :: rel 10840770, 0
 :: rel 8405148, 8405244
 :: copy 10840866, 2
 :: r64 8405244, 6 + 0
 :: rel 10840866, 0
 :: rel 8405148, 8405220
 :: copy 10840962, 2
 :: r64 8405244, 6 + 0
 :: rel 10840962, 0
 :: rel 8405148, 8405244
 :: copy 10841058, 2
 :: r64 8405244, 6 + 0
 :: rel 10841058, 0
 :: copy 10841130, 2
 :: r64 8405244, 6 + 0
 :: rel 10841130, 0
 :: rel 8405148, 8405220


More than 100,000 relocations! This also looks pretty unreadable…

What are those values anyways?

At this stage, I noticed that there were some weird reloactions like:

 :: r64 0, 6 + 0


This should be writing to a null pointer, but the program doesn’t crash. So maybe some of the relocations are modifying data as they are executed…

Instead of working with raw numbers, we can check what range the number falls in and assign it a symbolic value if it lies inside the .rela.dyn or .dynsym sections. We also know the flag (serial symbol) is in the range [0x404040:0x40405d] and fail is at 0x404060.

Let’s define our symbolic values:

@dataclass
class Symbol(object):
idx: int
def __repr__(self):
return f's{self.idx}'

@dataclass
class Reloc(object):
idx: int
def __repr__(self):
return f'r{self.idx}'

@dataclass
class Ref(object):
val: Any
def __repr__(self):
return f'&{self.val}'

@dataclass
sym: Symbol
field: str
def __repr__(self):
return f'{self.sym}.{self.field}'

@dataclass
reloc: Reloc
field: str
def __repr__(self):
return f'{self.reloc}.{self.field}'

off = 0
match self.field:
case 'r_info': off = 8

return (self.reloc.idx * 24) + off + rela.virtual_address

@dataclass
idx: int
def __repr__(self):
return f'flag[{self.idx}]'



And we can implement a format_addr funtion that converts a virtual address into a symbolic value:

rela = [x for x in b.sections if x.name == '.rela.dyn']
dynsym = [x for x in b.sections if x.name == '.dynsym']

):
r_offset = (offset // 24)
r_rem = offset % 24

match r_rem:
):
r_offset = (offset // 24)
r_rem = offset % 24

match r_rem:
case 8: return Symbol(r_offset)
else:


While parsing, we will call format_address on every numeric value first (not shown here). We get something like the following:

// Interesting section of NOPs:

// Writing some consecutive values into previously
// applied relocations:
 :: rel r3.r_info, 1
 :: rel r4.r_info, 4
 :: rel r5.r_info, 7
 :: rel r6.r_info, 10

 :: rel r87.r_info, 253
 :: rel s4, 0
 :: rel s2.st_size, 8
 :: r64 s4, s4 + 0
 :: rel s2, flag
 :: rel s2.st_size, 1


We can also see stuff like:

 :: copy r350.r_addend, s2
 :: r64 s4, s4 + 0


Relocation 349 is overwriting the addend in relocation 350. Effectively, this pair is performing s4 += s2 (the 0 in the addend is overwritten with the value of s2).

We also see a similiar pattern for overwriting addresses:

 :: copy r377.r_address, s2
 :: r64 0, s6 + 0


I.e. s2 = s6.

We could probably get some cleaner code if we pattern match on these structures and define higher level operations…

# Part 2: Do you even lift bro?

Now for the fun part!

### 1. Arr + Out references

During parsing, I noticed some relocations were acting as NOPs (writing 0 to a fixed address). And other relocations were storing arbitrary values in those relococation spots (like a uin64_t array).

Specifically, it seems like r3 to r88 are being used as an array. Also r89 seems to be some sort of output value (will get to that later).

We can modify format_addr to understand these references:

@dataclass
idx: int

def __repr__(self):
return f'out[{self.idx}]'

@dataclass
idx: int

def __repr__(self):
return f'arr[{self.idx}]'

):
r_offset = (offset // 24)
r_rem = offset % 24

# --- NEW ---
if r_offset >= 3 and r_offset <= 88:
arr_idx = (r_offset - 3) * 3 + (r_rem // 8)
elif r_offset == 89:
# --- END ---

match r_rem:
# ...


Sample 1 (102000 instructions)

// Nice array references!
 :: rel arr, 248
 :: rel arr, 249
 :: rel arr, 250
 :: rel arr, 251
 :: rel arr, 252
 :: rel arr, 253
 :: rel arr, 254
 :: rel arr, 255
 :: rel s4, 0
 :: rel s2, arr            // neat
 :: rel s2.st_size, 8
 :: r64 s4, s4 + 0
 :: rel s2, flag           // flag!
 :: rel s2.st_size, 1
 :: copy s7, s2
 :: rel s2, s7
 :: rel s2.st_size, 8
 :: r64 s4, s4 + 0
 :: r64 s4.9, s1 + 0
 :: rel s2, arr
 :: copy s6, s2


Much nicer!

Next we can replace Rel, R64 and Copy with the more understandable Mov and Add.

In order to know the size of the mov (can vary with Copy), we need to first trace through all the instructions and keep track of symbol sizes at each offset. We can just look for writes to a symbols st_size to see those changes. Luckily they are deterministic in this program.

Mov = namedtuple('mov', ['dst', 'src', 'sz', 'ridx'])

idx = 0

sizes = []
curr =  * 8
sizes.append(curr)

# Trace instructions and monitor the size of every symbol.
for instr in instructions:
c = list(curr)
match instr:
c[idx] = val
sizes.append(c)

# Iterate and find Mov/Add patterns.
while idx < len(instructions):
match instructions[idx]:
case Rel(dst, val, ridx):
instructions[idx] = Mov(dst, Ref(val), 8, ridx)
case Copy(dst, sym, ridx):
instructions[idx] = Mov(
dst, sym, sizes[idx][sym.idx], ridx)
idx += 1
return instructions

def remove_sizes(instructions):
# Sizes are now nops.
idx = 0
while idx < len(instructions):
match instructions[idx]:
case Mov(SymAddr(Symbol(s), 'st_size'), _, _, _):
instructions[idx:idx+1] = []

idx += 1
return instructions


We also need to update our dump function to support these new types:

def dump(instructions):
for op in instructions:
match op:
# --- snip ---
case Mov(dst, src, 8, ridx):
print(f'[{ridx:04d}] :: mov {dst}, {src}')
case Mov(dst, src, sz, ridx):
print(f'[{ridx:04d}] :: mov({sz}) {dst}, {src}')


For the rest of the writeup, I’ll omit changes to dump; just imagine we are also implementing print formats.

Sample 2 (96179 instructions)

// smoooth
 :: mov arr, &246
 :: mov arr, &247
 :: mov arr, &248
 :: mov arr, &249
 :: mov arr, &250
 :: mov arr, &251
 :: mov arr, &252
 :: mov arr, &253
 :: mov arr, &254
 :: mov arr, &255
 :: mov s4, &0
 :: mov s2, &arr
 :: add s4, s4, 0
 :: mov s2, &flag
 :: mov(1) s7, s2
 :: mov s2, &s7
 :: add s4, s4, 0
 :: r64 s4.9, s1 + 0
 :: mov s2, &arr
 :: mov s6, s2
 :: mov s2, &s4


### 3. Indirect mov

As mentioned before, there are some cases where the r_addend of a subsequent relocation is modified before it is applied. For example:

 :: mov r350.r_addend, s2
 :: add s4, s4, 0


This triplet is just s4 = s2. I think the third instruction (resetting r_addend) is an artifact of how the program was compiled, I don’t think it’s actually necessary.

Using Python 3.10 match ❤️, we can perform really clean pattern matching on this type of structure and replace it with a single Add instruction:

def lift_indirect(instructions):
idx = 0
while idx < len(instructions):
match instructions[idx:idx+3]:
case [
] if (
(r1 == ridx_2) and (r3 == ridx_2)
):
instructions[idx:idx+3] = [
]

idx += 1
return instructions


Sample 3 (58155 instructions)

 :: mov s4, &0
 :: mov s2, &arr
 :: add s4, s4, s2     // wow
 :: mov s2, &flag
 :: mov(1) s7, s2
 :: mov s2, &s7
 :: add s4, s4, s2     // cool
 :: r64 s4.9, s1 + 0
 :: mov s2, &arr
 :: mov s6, s2
 :: mov s2, &s4
 :: add s5, s4, s2     // amazing
 :: mov s2, &s5


### 4. Block

Examining the disassembly, we can spot a repeating block of code:

// ---------------
 :: mov s2, &arr
 :: add s4, s4, s2
 :: mov s2, &flag
 :: mov(1) s7, s2
 :: mov s2, &s7
 :: add s4, s4, s2
 :: r64 s4.9, s1 + 0
 :: mov s2, &arr
 :: mov s6, s2
 :: mov s2, &s4
 :: add s5, s4, s2
 :: mov s2, &s5
 :: add s5, s5, s2
 :: add s5, s5, s2
 :: add s5, s5, arr
 :: mov arr, s5
 :: add 0, s6, 0
// ---------------
 :: mov s2, &arr
 :: add s4, s4, s2
 :: mov s2, &flag
 :: mov(1) s7, s2
 :: mov s2, &s7
 :: add s4, s4, s2
 :: r64 s4.9, s1 + 0
 :: mov s2, &arr
 :: mov s6, s2
 :: mov s2, &s4
 :: add s5, s4, s2
 :: mov s2, &s5
 :: add s5, s5, s2
 :: add s5, s5, s2
 :: add s5, s5, arr
 :: mov arr, s5
 :: add 0, s6, 0
// ---------------


These blocks do:

# ---
s6 = (s6 + arr + flag) & 0xff
arr, arr[s6] = arr[s6], arr
# ---
s6 = (s6 + arr + flag) & 0xff
arr, arr[s6] = arr[s6], arr
# ---


The only things that seem to change are the arr and flag indexes used. We can lift it!

Block = namedtuple('block', ['arr', 'flag', 'ridx'])

def lift_block(instructions):
idx = 0
while idx < len(instructions):
match instructions[idx:idx+18]:
case [
Mov(_,arr,_,ridx),
Mov(_,flag,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
R32(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
]:
instructions[idx:idx+18] = [
Block(arr, flag, ridx)
]
idx += 1
return instructions


Sample 4 (23339 instructions)

// array set
 :: mov arr, &252
 :: mov arr, &253
 :: mov arr, &254
 :: mov arr, &255

// s4 = 0
 :: mov s4, &0

// repeating shuffles...
 :: shuffle &arr, &flag
 :: shuffle &arr, &flag
 :: shuffle &arr, &flag
 :: shuffle &arr, &flag
 :: shuffle &arr, &flag
 :: shuffle &arr, &flag
 :: shuffle &arr, &flag

// later
 :: add s5, s5, s2
 :: add s5, s5, s2
 :: add s5, s5, arr
 :: mov out, s5

// array is reset again...
 :: mov arr, &0
 :: mov arr, &1
 :: mov arr, &2
 :: mov arr, &3
 :: mov arr, &4
 :: mov arr, &5
 :: mov arr, &6
 :: mov arr, &7
 :: mov arr, &8
 :: mov arr, &9
 :: mov arr, &10
 :: mov arr, &11


# 5. Reset + ShuffleBlock

Since the array initialization seems to happen multiple times, lets lift that too. We can also lift the series of shuffle instructions:

Reset = namedtuple('reset', ['ridx'])
ShuffleBlock = namedtuple('shuffleblock', ['f1', 'f2', 'ridx'])

def lift_reset(instructions):
idx = 0
while idx < len(instructions) - 256:
good = True

for i in range(256):
op = instructions[idx+i]
if type(op) == Mov:
dst, src, _, _ = op
if dst != ArrAddr(i) or src != Ref(i):
good = False
break
else:
good = False
break

if good:
instructions[idx:idx+256] = [
Reset(instructions[idx].ridx)
]

idx += 1
return instructions

def lift_shuffle_block(instructions):
idx = 0
while idx < len(instructions) - 256:
good = True

for i in range(256):
op = instructions[idx+i]
if type(op) == Block:
arr, flag, ridx = op
good = False
break
else:
good = False
break

if good:
instructions[idx:idx+256] = [
ShuffleBlock(
instructions[idx].flag,
instructions[idx+1].flag,
instructions[idx].ridx)]

idx += 1
return instructions


Sample 5 (19259 instructions)

 :: reset                             // wow
 :: mov s4, &0
 :: shuffleblock &flag, &flag   // so smol
 :: mov s4, &0
 :: mov s2, &arr
 :: add s4, s4, s2
 :: r32 s4.9, s1, 0
// ...
 :: mov out, s5
 :: reset
 :: mov s4, &0
 :: shuffleblock &flag, &flag   // another!
 :: mov s4, &0
 :: mov s2, &arr
 :: add s4, s4, s2
 :: r32 s4.9, s1, 0


# 6. Output

In between each reset/shuffleblock pair, there are three repeating blocks of instructions that seem to write to the out area:

// ---------------
 :: mov s2, &arr
 :: add s4, s4, s2
 :: r32 s4.9, s1, 0
 :: mov s6, s2
 :: mov s2, &s4
 :: add s5, s4, s2
 :: mov s2, &s5
 :: add s5, s5, s2
 :: add s5, s5, s2
 :: add s5, s5, arr
 :: mov arr, s5
 :: mov s2, &arr
 :: mov s7, s2
 :: mov s2, &s5
 :: add 0, s6, 0
 :: mov s2, &s7
 :: add s6, s6, s2
 :: r32 s6.9, s1, 0
 :: mov s2, &s6
 :: add s5, s6, s2
 :: mov s2, &s5
 :: add s5, s5, s2
 :: add s5, s5, s2
 :: add s5, s5, arr
 :: mov out, s5
// ---------------
 :: mov s2, &arr
 :: add s4, s4, s2
 :: r32 s4.9, s1, 0
 :: mov s6, s2
 :: mov s2, &s4
 :: add s5, s4, s2
 :: mov s2, &s5
 :: add s5, s5, s2
 :: add s5, s5, s2
 :: add s5, s5, arr
 :: mov arr, s5
 :: mov s2, &arr
 :: mov s7, s2
 :: mov s2, &s5
 :: add 0, s6, 0
 :: mov s2, &s7
 :: add s6, s6, s2
 :: r32 s6.9, s1, 0
 :: mov s2, &s6
 :: add s5, s6, s2
 :: mov s2, &s5
 :: add s5, s5, s2
 :: add s5, s5, s2
 :: add s5, s5, arr
 :: mov out, s5
// ---------------


These are doing:

# ---
s4 = (s4 + arr) & 0xff
arr, arr[s4] = arr[s4], arr
out = arr[(arr + arr[s4]) & 0xff]
# ---
s4 = (s4 + arr) & 0xff
arr, arr[s4] = arr[s4], arr
out = arr[(arr + arr[s4]) & 0xff]
# ---


Let’s lift them:

Output = namedtuple('output', ['out', 'arr', 'ridx'])

def lift_output(instructions):
idx = 0
while idx < len(instructions):
match instructions[idx:idx+26]:
case [
Mov(_,arr,_,ridx),
R32(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
R32(_,_,_,_),
Mov(_,_,_,_),
Mov(_,_,_,_),
Mov(out,_,_,_),
]:
instructions[idx:idx+26] = [Output(out, arr, ridx)]
idx += 1
return instructions


Sample 6 (18659 instructions)

// hey this is looking almost readable
 :: reset
 :: mov s4, &0
 :: shuffleblock &flag, &flag
 :: mov s4, &0
 :: output out, &arr        // neat
 :: output out, &arr
 :: output out, &arr
 :: reset
 :: mov s4, &0
 :: shuffleblock &flag, &flag
 :: mov s4, &0
 :: output out, &arr
 :: output out, &arr
 :: output out, &arr
 :: reset
 :: mov s4, &0
 :: shuffleblock &flag, &flag
 :: mov s4, &0
 :: output out, &arr
 :: output out, &arr
 :: output out, &arr
 :: reset
 :: mov s4, &0
 :: shuffleblock &flag, &flag
 :: mov s4, &0
 :: output out, &arr
 :: output out, &arr
 :: output out, &arr
 :: reset
 :: mov s4, &0


After we get through flag, these patterns stop and we get some new code.

We encounter these new blocks but they don’t seem to be identical:

/ ----------
 :: mov s2, &out
 :: mov(1) s5, s2
 :: mov s6, &0
 :: mov s2, &s6
 :: add s6, s6, s2
 :: mov s2, &s5
 :: add s6, s6, s2
 :: mov s2, &s6
 :: add s6, s6, s2
 :: mov s2, &s5
 :: add s6, s6, s2
 :: mov s2, &s6
 :: add s6, s6, s2
 :: add s6, s6, s2
 :: mov s2, &s5
 :: add s6, s6, s2
 :: mov s2, &s6
 :: add s6, s6, s2
 :: add s6, s6, s2
 :: add s6, s6, s2
 :: add s6, s6, s2
 :: add s7, s7, s2
// ----------
 :: mov s2, &out
 :: mov(1) s5, s2
 :: mov s6, &0
 :: mov s2, &s6
 :: add s6, s6, s2
 :: mov s2, &s5
 :: add s6, s6, s2
 :: mov s2, &s6
 :: add s6, s6, s2
 :: mov s2, &s5
 :: add s6, s6, s2
 :: mov s2, &s6
 :: add s6, s6, s2
 :: mov s2, &s5
 :: add s6, s6, s2
 :: mov s2, &s6
 :: add s6, s6, s2
 :: mov s2, &s5
 :: add s6, s6, s2
 :: mov s2, &s6
 :: add s6, s6, s2
 :: add s6, s6, s2
 :: mov s2, &s5
 :: add s6, s6, s2
 :: mov s2, &s6
 :: add s6, s6, s2
 :: mov s2, &s5
 :: add s6, s6, s2
 :: mov s2, &s6
 :: add s7, s7, s2
/ ----------


The blocks work like this:

// s5 = out
 :: mov s2, &out
 :: mov(1) s5, s2
 :: mov s6, &0

// compute some multiple of s5

 :: add s7, s7, s2


The inner multiple bit works by perforing either s6 += s5 (add value) or s6 += s6 (double current value).

For example:

 :: mov s2, &out
 :: mov(1) s5, s2    // s5 = out
 :: mov s6, &0       // s6 = 0

 :: mov s2, &s6
 :: add s6, s6, s2   // s6 = 0
 :: mov s2, &s5
 :: add s6, s6, s2   // s6 = out
 :: mov s2, &s6
 :: add s6, s6, s2   // s6 = 2 * out
 :: mov s2, &s5
 :: add s6, s6, s2   // s6 = 3 * out
 :: mov s2, &s6
 :: add s6, s6, s2   // s6 = 6 * out
 :: add s6, s6, s2   // s6 = 12 * out
 :: mov s2, &s5
 :: add s6, s6, s2   // s6 = 13 * out
 :: mov s2, &s6
 :: add s6, s6, s2   // s6 = 26 * out
 :: add s6, s6, s2   // s6 = 52 * out
 :: add s6, s6, s2   // s6 = 104 * out
 :: add s6, s6, s2   // s6 = 208 * out

 :: add s7, s7, s2   // s7 += (208 * out)


It takes a bit more work but we can also lift these operations:

MultAdd = namedtuple('multadd', ['out', 'val', 'k', 'ridx'])

idx = 0
while idx < len(instructions):
match instructions[idx:idx+3]:
# block prefix
case [
Mov(Symbol(2), out, _, ridx),
Mov(Symbol(5), Symbol(2), _, _),
Mov(Symbol(6), Ref(0), _, _),
]:
k = 0
double = False

ptr = idx + 3

good = True
while ptr < len(instructions):
match instructions[ptr]:
case Mov(Symbol(2), Ref(Symbol(6)), _, _):
double = True
case Mov(Symbol(2), Ref(Symbol(5)), _, _):
double = False
k = (k * 2) if double else (k + 1)
ptr += 1
break
case _:
good = False
break

ptr += 1

if good:
instructions[idx:ptr] = [
]

idx += 1

return instructions


Sample 7 (2377 instructions)

 :: mov s4, &0
 :: shuffleblock &flag, &flag
 :: mov s4, &0
 :: output out, &arr
 :: output out, &arr
 :: output out, &arr
 :: mov(2064) arr, s2
 :: mov s4, &0
 :: mov s5, &0
 :: mov s7, &-393039
 :: madd s7 += (&out * 155)   // math!
 :: madd s7 += (&out * 190)
 :: madd s7 += (&out * 208)
 :: madd s7 += (&out * 123)
 :: madd s7 += (&out * 214)
 :: madd s7 += (&out * 146)
 :: madd s7 += (&out * 211)
 :: madd s7 += (&out * 85)
 :: madd s7 += (&out * 87)
 :: madd s7 += (&out * 251)
 :: madd s7 += (&out * 122)
 :: madd s7 += (&out * 60)
 :: madd s7 += (&out * 237)
 :: madd s7 += (&out * 27)
 :: madd s7 += (&out * 63)
 :: madd s7 += (&out * 207)
 :: madd s7 += (&out * 33)
 :: madd s7 += (&out * 138)
 :: madd s7 += (&out * 51)
 :: madd s7 += (&out * 103)
 :: madd s7 += (&out * 58)
 :: madd s7 += (&out * 68)
 :: madd s7 += (&out * 120)
 :: madd s7 += (&out * 111)
 :: mov s2, &s5.11
 :: mov(5) s7.11, s2
 :: mov s2, &s7
 :: add s4, s4, s2
 :: mov s7, &-368502
 :: madd s7 += (&out * 254)
 :: madd s7 += (&out * 21)
 :: madd s7 += (&out * 157)
 :: madd s7 += (&out * 204)
 :: madd s7 += (&out * 56)
 :: madd s7 += (&out * 134)


## 8. Trunc

After each set of madd ops, we see some weird instructions:

 :: mov s2, &s5.11
 :: mov(5) s7.11, s2


We are writing to an 11-byte offset of a symbol which means &sym.st_value + 3.

Examining the code, we see that s5 is always zero. So this instruction is just zeroing the upper 5 bytes of the symbol value. In other words sym &= 0xffffff.

Trunc = namedtuple('trunc', ['val', 'k', 'ridx'])

def lift_truncate(instructions):
idx = 0
while idx < len(instructions):
match instructions[idx:idx+2]:
case [
]:
instructions[idx:idx+2] = [
Trunc(Symbol(7), 0xffffff, ridx)]
idx += 1
return instructions


Sample 8 (2336 instructions)

 :: madd s7 += (&out * 219)
 :: madd s7 += (&out * 74)
 :: madd s7 += (&out * 223)
 :: madd s7 += (&out * 220)
 :: madd s7 += (&out * 235)
 :: madd s7 += (&out * 151)
 :: madd s7 += (&out * 165)
 :: trunc s7 &= 0xffffff         // yay
 :: mov s2, &s7
 :: add s4, s4, s2
 :: mov s7, &-488931
 :: madd s7 += (&out * 207)
 :: madd s7 += (&out * 123)
 :: madd s7 += (&out * 9)


## 9. Array slots

Looking at this new bytecode, we see a ton of scary constants. And some new places that look like they might be arrays:

Note: setinfo is just an 8-byte mov into the symbol st_name field which sets various other fields.

// NOP relocations for array space
// Array data...
 :: mov r93678.r_info, &-4041842636978026168
 :: mov r93679.r_info, &5011179070235933765
 :: mov r93680.r_info, &-3458402876407461680
 :: mov r93681.r_info, &5009120185953550336
 :: mov r93682.r_info, &51245831810508869
 :: setinfo s3, name=0x1a, info=0x1, other=0x0, shndx=0x0
 :: add s5, s3, 0
 :: mov s2, &s5
 :: add s4, s4, s2
 :: mov s5, &0
 :: mov s7, &-55708
 :: madd s7 += (&flag * 76)
 :: madd s7 += (&flag * 104)
 :: mov s2, &flag
 :: mov(1) s5, s2
// Array space?
// Array data...
 :: mov r93806.r_info, &195
 :: setinfo s3, name=0x1a, info=0x1, other=0x0, shndx=0x0
 :: mov s2, &s5


These arrays always seem to consist of a bunch of nop instructions (mov baseaddr(), &0) followed by mov’s into the array with constant values.

We can scan bytecode for these patterns and lift into a fixed constant array:

ArraySlots = namedtuple('arr', ['values', 'ridx'])

def lift_array_slots(instructions):
idx = 0
while idx < len(instructions):
match instructions[idx]:
ptr = idx+1
while ptr < len(instructions):
op = instructions[ptr]
if type(op) != Mov or op.dst != BaseAddr():
break
ptr += 1

start = idx
end = ptr

data = []

# Check for movs into array.
offset = 0
while end + offset < len(instructions) and offset < ((end - start) * 3):
op = instructions[end + offset]
if type(op) == Mov and type(op.dst) is RelocAddr and op.dst.vaddr() == vstart + (offset * 8):
data.append(op.src.val)
else:
break
offset += 1

if len(data) > 0:
data +=  * (((end - start) * 3) - len(data))
instructions[idx:end+offset] = [
ArraySlots(data, ridx)
]

idx += 1
return instructions


Sample 9 (2161 instructions)

 :: mov s5, &0
// what a chonker
 :: array [0x5500404040c7c748, -0x38178276b71a76b8, 0x45c700000000fc45, -0x74c414ffffffff08, 0x458b48d06348fc45, 0x3c00b60fd00148e8, 0x6348fc458b147620, -0x2ffeb717ba74b730, -0x47f88981c3ff49f1, 0xb805eb00000001, 0x4583f84509000000, -0x4081e403827cfe04, -0x74b72f9cb703ba75, 0xb60fd00148e845, 0x458bf84509c0b60f, 0xc3c35d9848f8, 0x0, 0x0]
 :: setinfo s3, name=0x1a, info=0x1, other=0x0, shndx=0x0
 :: add s5, s3, 0
 :: mov s2, &s5
 :: add s4, s4, s2
 :: mov s5, &0
 :: mov s7, &-55708
 :: madd s7 += (&flag * 76)
 :: madd s7 += (&flag * 104)
 :: mov s2, &flag
 :: mov(1) s5, s2
// smol
 :: array [-0x60ff7fbf1bdadb80, 0xc3, 0x0]
 :: setinfo s3, name=0x1a, info=0x1, other=0x0, shndx=0x0


## 10. Shellcode

Now wtf are those values? At first I thought they might be constant target values for some matrix computations.

But I couldn’t help but noticing the 0x48 prefix and 0xc3 suffix and got suspicious about bytecode. I had also noticed earlier that the relocation section was marked rwx.

So let’s throw this first chunk into Binary Ninja and see what happens… Well… that is certainly bytecode. This first large function seems to just be checking that all the flag bytes are in range (0x20, 0x7e).

This is also the only big function, the rest seem to all be 9-byte shellcode data.

I don’t really understand how the shellcode is actually being executed here, but it has something to do with loading a symbol that points to the shellcode. We can recognize the following blocks as an “execute shellcode” block:

 :: array [-0x60ff7fbf1bdadb80, 0xc3, 0x0]
 :: setinfo s3, name=0x1a, info=0x1, other=0x0, shndx=0x0


Let’s lift!

from capstone import *
md = Cs(CS_ARCH_X86, CS_MODE_64)

# ---

Shellcode = namedtuple('shellcode', ['dst', 'code', 'ridx'])

def lift_shellcode(instructions):
idx = 0
while idx < len(instructions):
match instructions[idx:idx+6]:
case [
ArraySlots(values, ridx),
Mov(Symbol(2), _, _, _),
] if (rel2 == ridx) and (rel6 == ridx):
instructions[idx:idx+6] = [
Shellcode(dst, b''.join([(x & 0xffffffffffffffff).to_bytes(8, 'little') for x in values]), ridx)
]
idx += 1
return instructions


And we can also get fancy and disassemble with capstone when we dump:

def dump(instructions):
# ...
case Shellcode(dst, code, ridx):
print(f'[{ridx:04d}] :: exec {dst} <- {code.hex()}')
print('-' * 20)
for i in md.disasm(code, 0):
if i.mnemonic == 'ret':
break
print("    0x%x:\t%s\t%s" % (
i.mnemonic,
i.op_str.replace('0x8040e4', 's5')
.replace('0x8040cc', 's4')))
print('-' * 20)


Sample 10 (1771 instructions)

// have you seen sexier disassembly?

 :: add s4, s4, s2
 :: mov s7, &-30801
 :: madd s7 += (&flag * 15)
 :: madd s7 += (&flag * 41)
 :: mov s2, &flag
 :: mov(1) s5, s2
 :: exec r100498.r_address <- c00425e4408000b4c3000000000000000000000000000000
--------------------
0x0:	rol	byte ptr [s5], 0xb4
--------------------
 :: mov s2, &s5
 :: add s7, s7, s2
 :: mov s2, &flag
 :: mov(1) s5, s2
 :: exec r100515.r_address <- c02c25e440800003c3000000000000000000000000000000
--------------------
0x0:	shr	byte ptr [s5], 3
--------------------
 :: mov s2, &s5
 :: add s7, s7, s2
 :: mov s2, &flag
 :: mov(1) s5, s2
 :: exec r100532.r_address <- c00c25e440800064c3000000000000000000000000000000
--------------------
0x0:	ror	byte ptr [s5], 0x64
--------------------
 :: mov s2, &s5
 :: add s7, s7, s2
 :: madd s7 += (&flag * 167)
 :: mov s2, &flag
 :: mov(1) s5, s2
 :: exec r100607.r_address <- c00425e440800051c3000000000000000000000000000000
--------------------
0x0:	rol	byte ptr [s5], 0x51
--------------------
 :: mov s2, &s5


## 11. Aop

So this shellcode is just being used a mechanism to do more types of operations.

For example:

 :: mov s2, &flag
 :: mov(1) s5, s2
 :: exec r100515.r_address <- c02c25e440800003c3000000000000000000000000000000
--------------------
0x0:	shr	byte ptr [s5], 3
--------------------
 :: mov s2, &s5
 :: add s7, s7, s2


This block is computing s7 += (flag >> 3).

We can define a new operation aop (add operation) that is similar to madd but also contains an operation specifier:

Aop = namedtuple('aop', ['dst', 'op', 'val', 'k', 'ridx'])

def lift_aop(instructions):
idx = 0
while idx < len(instructions):
match instructions[idx:idx+5]:
case [
Mov(Symbol(2), val, _, ridx),
Mov(Symbol(5), Symbol(2), _, _),
Shellcode(_, data, _),
Mov(Symbol(2), Ref(Symbol(5)), _, _),
] if len(data) == 24 and (dst == dst2):
op = next(md.disasm(data, 0))

t = op.mnemonic
k = int(op.op_str.split(', ')[-1], 16)

instructions[idx:idx+5] = [
Aop(dst, t, val, k, ridx)
]

idx += 1
return instructions


Sample 11 (1467 instructions)

// out madd computation
 :: mov s5, &0
 :: mov s7, &-393039
 :: madd s7 += (&out * 155)
 :: madd s7 += (&out * 190)
 :: madd s7 += (&out * 208)
 :: madd s7 += (&out * 123)
 :: madd s7 += (&out * 214)
 :: madd s7 += (&out * 146)
 :: madd s7 += (&out * 211)
 :: madd s7 += (&out * 85)
 :: madd s7 += (&out * 87)
 :: madd s7 += (&out * 251)
 :: madd s7 += (&out * 122)
 :: madd s7 += (&out * 60)
 :: madd s7 += (&out * 237)
 :: madd s7 += (&out * 27)
 :: madd s7 += (&out * 63)
 :: madd s7 += (&out * 207)
 :: madd s7 += (&out * 33)
 :: madd s7 += (&out * 138)
 :: madd s7 += (&out * 51)
 :: madd s7 += (&out * 103)
 :: madd s7 += (&out * 58)
 :: madd s7 += (&out * 68)
 :: madd s7 += (&out * 120)
 :: madd s7 += (&out * 111)
 :: trunc s7 &= 0xffffff
 :: mov s2, &s7
 :: add s4, s4, s2

// flag 16-27 computation
 :: mov s2, &s7
 :: add s4, s4, s2
 :: mov s7, &-62593
 :: madd s7 += (&flag * 162)
 :: aop s7 += (&flag ror 59)
 :: madd s7 += (&flag * 4)
 :: aop s7 += (&flag shl 0)
 :: madd s7 += (&flag * 207)
 :: madd s7 += (&flag * 128)
 :: aop s7 += (&flag ror 21)
 :: aop s7 += (&flag or 72)
 :: madd s7 += (&flag * 23)
 :: aop s7 += (&flag rol 1)
 :: madd s7 += (&flag * 247)
 :: aop s7 += (&flag shr 3)
 :: trunc s7 &= 0xffffff
 :: mov s2, &s7
 :: add s4, s4, s2
 :: mov s7, &-80711
 :: madd s7 += (&flag * 62)
 :: aop s7 += (&flag or 139)
 :: madd s7 += (&flag * 21)
 :: madd s7 += (&flag * 239)
 :: madd s7 += (&flag * 145)
 :: madd s7 += (&flag * 103)
 :: aop s7 += (&flag shr 0)
 :: aop s7 += (&flag rol 228)
 :: madd s7 += (&flag * 184)
 :: madd s7 += (&flag * 241)
 :: madd s7 += (&flag * 72)
 :: aop s7 += (&flag xor 38)
 :: trunc s7 &= 0xffffff
 :: mov s2, &s7
 :: add s4, s4, s2
 :: mov s7, &-98024


# Part 3: Solution

Based on the previous disassembly, we can see that the constraint checker is dividied into chunks like this:

 :: mov s7, &-80711
 :: madd s7 += (&flag * 62)
 :: aop s7 += (&flag or 139)
 :: madd s7 += (&flag * 21)
 :: madd s7 += (&flag * 239)
 :: madd s7 += (&flag * 145)
 :: madd s7 += (&flag * 103)
 :: aop s7 += (&flag shr 0)
 :: aop s7 += (&flag rol 228)
 :: madd s7 += (&flag * 184)
 :: madd s7 += (&flag * 241)
 :: madd s7 += (&flag * 72)
 :: aop s7 += (&flag xor 38)
 :: trunc s7 &= 0xffffff
 :: mov s2, &s7
 :: add s4, s4, s2


Looking near the end, we see:

 :: mov fail(), &0
 :: exec s4 <- 8b0425cc408000f7d819c0c3000000000000000000000000
--------------------
0x0:	mov	eax, dword ptr [s4]
0x7:	neg	eax
0x9:	sbb	eax, eax
--------------------
 :: mov s2, &s4
 :: mov fail(), s2


So it looks like s4 must remain zero or else we will set the fail symbol and the flag will be invalid.

So our block above represents a summation of flag bytes operated on with another value. The summation plus -80711 (initial value of s7) must equal zero.

## 1. Flag pt.2

Similar to our lifters before, we can write a new function that patter matches on constraints with the second half of the flag (flag indexes 16 to 27) and pulls the expression into z3:

from z3 import *

def solve_end_flag(instructions):
orig = [BitVec('f%d' % i, 8) for i in range(16,28)]
flag = ( * 16) + [ZeroExt(24, x) for x in orig]

s = Solver()
for o in orig:

idx = 0
while idx < len(instructions):
match instructions[idx]:
case Mov(Symbol(7), Ref(v), _, ridx) if type(v) is int:
x = BitVecVal(v, 32)

ptr = idx+1
while ptr < len(instructions):
match instructions[ptr]:
x += (flag[f] * k)
case Aop(Symbol(7), op, Ref(FlagAddr(f)), k, _):
v = None
match op:
case 'and': v = flag[f] & k
case 'xor': v = flag[f] ^ k
case 'or': v = flag[f] | k
case 'rol': v = ZeroExt(24, RotateLeft(Extract(7,0,flag[f]), k))
case 'ror': v = ZeroExt(24, RotateRight(Extract(7,0,flag[f]), k))
case 'shl': v = ZeroExt(24, Extract(7,0,flag[f]) << k)
case 'shr': v = flag[f] >> k
case _:
raise Exception(f'unknown aop: {op}')
x += v
case Trunc(Symbol(7), k, _):
case _:
break

ptr += 1

idx += 1

print(s.check())

m = s.model()
flag = bytes([m.eval(o).as_long() for o in orig])

return flag


We get:

sat
b'3_3LF_m4g1c}'


## 2. Output array

I tried doing the same z3 solution for the out array checks (which just use madd) but it was taking a while to solve. ahem

Each individual constraint with only madd is equivalent to one row of a matrix multiplication and since there are exactly 24 out values and 24 constraints, we can solve this matrix.

Instead of using z3, we can pull out the actual matrix and use np.linalg.solve to solve it:

import numpy as np

# ---

def solve_out_arr(instructions):
X = []
Y = []

idx = 0
while idx < len(instructions):
match instructions[idx]:
case Mov(Symbol(7), Ref(v), _, ridx) if type(v) is int:
row = []

ptr = idx+1
while ptr < len(instructions):
match instructions[ptr]:
row.append(k)
case Trunc(Symbol(7), k, _):
X.append(row)
Y.append(-v)
case _:
break

ptr += 1

idx += 1

a = np.array(X, dtype=np.uint32)
b = np.array(Y, dtype=np.uint32)

return [int(x) for x in np.linalg.solve(a,b)]


We get:

[134, 72, 7, 236, 29, 49, 88, 228, 231, 231, 227, 16, 242, 80, 242, 1, 224, 113, 46, 224, 108, 91, 103, 182]


## 3. Flag pt.1

This out array was computed based on the first 16 bytes of the flag with these blocks:

 :: shuffleblock &flag, &flag
 :: mov s4, &0
 :: output out, &arr
 :: output out, &arr
 :: output out, &arr
 :: reset
 :: mov s4, &0
 :: shuffleblock &flag, &flag
 :: mov s4, &0
 :: output out, &arr
 :: output out, &arr
 :: output out, &arr
 :: reset
 :: mov s4, &0
 :: shuffleblock &flag, &flag
 :: mov s4, &0
 :: output out, &arr
 :: output out, &arr
 :: output out, &arr
 :: reset


Specifically, out[0:3] is derived from flag[0:2]. out[3:6] is derived from flag[2:4], etc…

We can simply brute force each pair of flag bytes to solve for a given output triplet:

def solve_start_flag(output):
def sim(a,b):
arr = list(range(256))

z = 0
for i in range(256):
p = a if i % 2 == 0 else b
z = (z + arr[i] + p) & 0xff
arr[i], arr[z] = arr[z], arr[i]

out = [0,0,0]
z = 0
for i in range(3):
z = (z + arr[i]) & 0xff
arr[i], arr[z] = arr[z], arr[i]
out[i] = arr[(arr[i] + arr[z]) & 0xff]

return out

def solve_chunk(k1,k2,k3):
# Brute force chunk.
for a in range(0x20, 0x7f):
for b in range(0x20, 0x7f):
out = sim(a,b)

if abs(out - k1) <= 1 and abs(out - k2) <= 1 and abs(out - k3) <= 1:
return (a,b)

return None

f = []
for i in range(0,len(output),3):
f += list(solve_chunk(*output[i:i+3]))

return bytes(f)


This gives us:

CTF{H0p3_y0u_l1k


Our final flag is: CTF{H0p3_y0u_l1k3_3LF_m4g1c}

# Conclusion

See the full solve script here: https://gist.github.com/hgarrereyn/9e536e8b3471d3cb8ecbb5932a776b95

I really enjoyed this challenge; weird VM reversing is one of my favorite subcategories and this one proved to be especially tricky due to self modifying code.