Tracking down a segfault in malloc_consolidate()

I’ve been making my way through From Nand to Tetris, a self-guided course that gives you the knowledge to go from having only a simulated NAND gate, through building a basic (simulated) CPU, all the way to writing an OS on top of your simulated hardware that can run Tetris. As a very hands-on learner, I love that Nand2Tetris gives you the resources you need to get started, and then leaves the implementation up to the learner.

I’ve been using the course as an opportunity to learn C, because I’m starting to move away from web programming and towards embedded systems, and my knowledge of lower-level languages is pretty basic.

One section of the course involves building an assembler, which converts a very basic assembly language into machine code, which can be read by the CPU built earlier in the course. This assembler is the first significant program I’ve written in C, and I’ve learned a ton. I ended up implementing my own hash table library to store the assembler’s symbol table, as well as learning to use Valgrind, the basics of Makefiles, and much more.

But when I added and then ran my very last unit test (I’ve been using minunit.h), I got an error that I’d never seen before: malloc_consolidate(): invalid chunk size. The segfault seemed to be happening on the line I’ve marked below, in the main() method of my test.c file:

    int main() {
        char *result = all_tests();
        if (result != 0) {
            printf("%s\n", result);
        } else {
            printf("ALL TESTS PASSED\n");  //  <-------------------
        }
        printf("Tests run: %d\n", tests_run);
    
        return result != 0;
    }

I couldn’t think of a reason why a simple printf statement could be causing a segfault, but after doing a little research, I found that memory corruption can cause errors in a completely different part of the program than where the corruption actually happens. With that in mind, I started commenting out parts of my tests until I tracked down the problem to a line where I deleted a hash table (I’ve left out most of the code for brevity):

    // ....
    
    ht_hash_table *ht = constructor(2 * (19 / 3));
    
    // ...
    
    char *loop = ht_search(ht, "LOOP");
    char *infinite_loop = ht_search(ht, "INFINITE_LOOP");
    mu_assert("first_pass failed to insert at least one label into the symbol table",
        (loop != NULL && !strcmp(loop, "0000000000001010")) &&
        (infinite_loop != NULL && !strcmp(infinite_loop, "0000000000010111")));
    reinit_char(&loop);
    reinit_char(&infinite_loop);
    
    // Test second_pass()
    fseek(fp_files.in, 0, SEEK_SET);
    second_pass(fp_files.in, fp_files.out, ht);
    
    char *counter = ht_search(ht, "counter");
    char *address = ht_search(ht, "address");
    mu_assert("second pass failed to insert at least one symbol into the symbol table",
        !strcmp(counter, "0000000000010000") && !strcmp(address, "0000000000010001"));
    reinit_char(&counter);
    reinit_char(&address);
    
    fclose(fp_files.in);
    fclose(fp_files.out);
    ht_delete(ht);   //  <--------- This is the line in question

Most of the other functions in that chunk of code aren’t related to the issue, but for reasons that’ll become clear later, reinit_char is just a helper function that frees a char* and sets it to NULL:

    void reinit_char(char **to_process) {
        free(*to_process);
        *to_process = NULL;
    }

While I now knew that my issue was happening in ht_delete() , I had no idea what kind of error I was looking for. I spent a ton of time in GDB stepping through parts of that function, and the functions it calls, but still couldn’t figure out what was going on. I tried Valgrind, but it didn’t show any errors:

    $ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose ./test

I came across a Stack Overflow answer mentioning Valgrind’s SGCheck tool (in the SO answer I linked to, it’s called PtrCheck, but it’s since been renamed). According to SGCheck’s documentation, it is “an experimental stack and global array overrun detector.” Well, that sounded like just what I needed, so I ran another Valgrind command:

    $ valgrind --tool=exp-sgcheck --verbose ./test

And while that did at least produce an error of some kind, it also crashed Valgrind, and the error it have me didn’t seem to be related to the problem I was having.

Another route I tried was adding a couple more compiler flags to GCC:

  • -Wextra , because I wanted to make 100% sure that I was getting all the relevant info out of the compiler that I could. It found a few accidental implicit type conversions that I’d missed (mostly between int and unsigned int, in loops that relied on a sizeof call), but didn’t add any new info to what was causing my malloc_consolidate() error.
  • -fsanitized=undefined, which checks for undefined behavior, also didn’t produce any new information — that being said, I read some SO questions that seemed to imply that there were differences in how -fsanitize works in Clang and GCC, so it’s possible that if I was using Clang, that flag would have helped. That being said, the flag should work pretty much the same across the two compilers, so I didn’t bother switching over to Clang.

Unsurprisingly, the glibc docs include a page on tools for checking heap consistency…I wish I’d checked those earlier. The two main tools it discusses are setting the MALLOC_CHECK_ environment variable, and adding the linker flag -lmcheck. At the bottom of the page, it says:

So, what’s the difference between using MALLOC_CHECK_ and linking with -lmcheck? MALLOC_CHECK_ is orthogonal with respect to -lmcheck. -lmcheck has been added for backward compatibility. Both MALLOC_CHECK_ and -lmcheck should uncover the same bugs - but using MALLOC_CHECK_ you don’t need to recompile your application.

With that in mind, I set MALLOC_CHECK_=2 and ran my tests again…and nothing changed. Bummer. Well, even though it supposedly does the same thing, I recompiled with -lmcheck, and…BINGO! I got a single new line of error information: block freed twice. Awesome.

An hour of stepping through my ht_delete() method later, I found the spot of the crash — it was when I freed one of the values (not the key, only the value) in my hash table. As it turned out, I’d built my hash table’s search method to take in a char *key and return a char* value, but the pointer I was returning was the actual pointer to the value stored in the hash table, not a duplicate. So when part of my tests included searching a hash table to make sure it contained all the symbols in the assembly file I fed into it, I was storing the result of each hash table search in a char* variable, and then calling my reinit_char() method on that variable. reinit_char() was freeing the char* variable in my test method, which was also freeing the actual value in the hash table! Here’s a trimmed down version of the snippet I shared above, but now with the problem pointed out:

    // ...
    
    ht_hash_table *ht = constructor(2 * (19 / 3));
    
    // ...
    
    char *loop = ht_search(ht, "LOOP");  // Search the hash table for a node with the key "LOOP"
    
    // ...
    
    reinit_char(&loop);  // Free `loop` and set it to NULL
    
    // ...
    
    // Now, when I try to delete the hash table, it goes to free the value corresponding to the key "LOOP".
    // But, since the `loop` variable above was a pointer directly to where the value was stored in the hash table,
    // when I free'd/NULL'd `loop`, the value in the hash table was deleted. The `ht_delete` method tries to free
    // that value again, resulting in a double free error and memory corruption.
    ht_delete(ht);       

I know that this is a relatively simple memory corruption problem, but since I’d never run into something like it before, it took me a lot of time and research to figure it out. I’m stoked to have some more insight into what’s going on when I run into a problem like this in the future, and I hope this helps out someone who’s trying to fix the same issue.