Using malloc_debug to Find Memory Related Bugs

Blog post by mmlr on Mon, 2010-02-08 01:17

Enabling malloc_debug

Since the malloc_debug heap implementation does a lot of unconditional error checking and validation it isn't used by default. Instead it is part of libroot_debug.so. To run an application with libroot_debug.so instead of libroot.so, which automatically makes the app use the debug heap, you need to export the environment variable LD_PRELOAD with a value of "libroot_debug.so". The easiest way to do so is to run your app from a Terminal like this:

		LD_PRELOAD=libroot_debug.so myApplication
	
With this you already get most of the malloc_debug features like always initialized memory, overwriting of freed memory and overrun and alignment checks on deallocation.

Helpful Kernel Debugger Output

In general when your app crashes or enters the debugger, the kernel will output relevant information to the syslog. This info most often is more to the point and easier to understand than what gdb will tell you in userland. Therefore I recommend to always keep a Terminal with the syslog output open while debugging things. The easiest is to run tail on the syslog like this:

	tail -F /var/log/syslog

Bug: Using Uninitialized Memory

One of the things malloc_debug does for you is to always initialize memory blocks it returns to you to a known value: 0xcc. This helps you find cases where you use allocated memory uninitialized.

Common Occurrence

Commonly this happens when forgetting to initialize class members or structure fields.

Example

	class TheClass {
		...
		BWindow *fWindow;
	};

	TheClass::TheClass()
		:
		fMemberA(0),
		fMemberB(NULL)
	{
	}

	...

	TheClass::SomeMthod()
	{
		if (fWindow == NULL)
			fWindow = new BWindow(...);
		fWindow->Show();
		...
	}

In this example the fWindow member of the class was not initialized in the constructor. When it is later used like in the method SomeMethod(), the assumption is that fWindow happens to be initialized to 0. This may be the case, or it may not, it really is mostly random.

How To Spot

Since the data allocated by malloc() and friends as well as new (including the storage for your object members) is normally not initialized the results of running the buggy code depend on what happens to be in these memory ranges at the time you run it. When running without the debug heap this will usually manifest in random misbehaviour that changes from run to run, it sometimes crashes, it sometimes doesn't. With the debug heap the memory returned by the memory allocation functions is always initialized to 0xcc. This means that if you are comparing uninitialized memory to NULL for example all of these checks will now fail reliably and if you try to execute or access uninitialized pointers your application will reliably crash with a segfault. If you look at the syslog output of such a crash you can easily spot a line similar to this:

		KERN: vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0xccccccd4
	
The fault address (0xccccccd4 here) will vary depending on what exactly you accessed, but it it is still easy to see that the base pointer used here was 0xcccccccc indicating uninitialized memory. The syslog output will also contain a stack trace, so you should have most info you need to fix the problem.

Recommendations

Always initialize all your variables. The only exception are static variables, because these are always initialized to 0. I personally still initialize them to make it absolutely obvious, but that's a matter of taste (or coding style policy).

Bug: Using Already Freed Memory

Another thing malloc_debug does for you is to always fill memory you return to the heap by the means of free() or delete to a different known value: 0xdeadbeef. This helps you find cases where you use already freed memory.

Common Occurrence

Most often this happens when holding pointers to memory blocks in different locations of an application. Like when keeping multiple lists and forgetting to remove a pointer from one of them.
It can also happen if a certain process runs through in an order that was not expected. For example event handlers being called while or after a target has already been freed (due to missing locking).
Sometimes it is also just a simple oversight like still accessing data inside an object just freed.

Example

	element_data *
	unlink_and_free(linked_list_element *previous, linked_list_element *element)
	{
		previous->next = element->next;
		free(element);
		return element->data;
	}

	void
	function()
	{
		...
		element_data *data = unlink_and_free(...);
		if (data->has_something)
			...
	}

When reading closely this is pretty obvious. Still it can easily occur, especially when refactoring code. The thing is that code like this will often work just fine due to the allocator not necessarily doing anything with the memory and not giving it back out quickly enough. Crashes coming from such bugs can be very rare and therefore frustrating to analyze.

How To Spot

When freeing memory malloc_debug will overwrite the block you hand in before returing. The pattern used is 0xdeadbeef and when accessing pointers in freed memory blocks the app will crash with a segfault. The syslog will contain a line similar to this:

		KERN: vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0xdeadbeef
	
The fault address will vary depending on what exactly is accessed. Note though that due to how the heap implementation works internally, the first sizeof(void *) bytes will be used as a free list element and therefore may not contain 0xdeadbeef after returning from free.

Recommendations

Depending on the actual bug. If it is a concurrency issue introduce proper locking, if memory management seems to get difficult consider introducing reference counting or other more advanced memory management techniques. Sometimes is enough to take a good long look at the function the stack trace points to.

Bug: Double Free

Freeing memory twice, i.e. by actually using free() twice or by using delete on already deleted objects.

Common Occurrence

Most often this is just an oversight. It can happen easily when using self-deleting objects and then deleting manually again on exit.

Example

	TheClass::Recycle()
	{
		...
		delete this;
	}

	void
	main_function()
	{
		TheClass *object = new TheClass();
		function_that_will_indirectly_cause_a_recycle_of_the_object(object);
		delete object;
	}

With that wording pretty obvious, usually not quite that easy to see.

How To Spot

A double free will cause the debugger to be invoked directly. The debugger message will be similar to this:

		free(): address 0x18006000 already exists in page free list (double free)
	
The line is located in the gdb output right after the gdb copyright, license info and the process group error but before the library loading output. If you have a lot of libraries (directly or indirectly) in use you may need to scroll up to see it. The syslog will also contain the debugger call as well as a stack trace.

Recommendations

Remove the superfluous free() / delete calls. If memory management gets difficult a solution might be to switch to reference counting or other more advanced techniques.

Bug: Misaligned Free / Free of Unallocated Memory

Using free() on an address that is offset from the originally returned address or using free() on an address that wasn't allocated by the heap at all.

Common Occurrence

Misaligned frees can happen when doing pointer arithmetic.

Example

	void
	some_function()
	{
		char *string = strdup("hello");
		while (string[0] != 'l')
			string++;

		...
		free(string);
	}

In the example above the string variable has been advanced and therefore doesn't point to the same address the allocation returned.

How To Spot

A misaligned free will cause the debugger to be invoked. The debugger message will be similar to this:

		free(): address 0x1800102a does not fall on allocation boundary for page base 0x18001000 and element size 20
	
The line is again visible both in gdb as well as in the syslog output. The message contains the address that was supplied to free, the page base from where this allocation has been made as well as the element size used to satisfy the allocation request. If relevant, the misalignment can be computed from these numbers:
		address - (int((address - base) / elementSize) * elementSize)
	
When freeing an address that hasn't been allocated by the heap at all, the free() call will fail and the debugger will be invoked with a message similar to this:
		free(): free failed for address 0x8000
	
The address will be the address supplied to free(). Note that this message will be triggered when you continue from a misaligned free as well.

Recommendations

Review the places where you do pointer arithmetic and make proper copies of the originally returned addresses where necessary.

Bug: Overwriting Memory Past the Allocation

When overwriting memory past the allocated size this doesn't necessarily lead to a segfault. If the write stays within the heap address range it will usually just corrupt whatever is overwritten. Sadly malloc_debug cannot tell the exact place where this corruption happens. In some cases it can however tell you that it did happen sooner than you would otherwise notice.

Common Occurrence

Most often the classic buffer overrun by writing more data into an allocated space than fits. It can also happen when (reinterpret) casting memory blocks to the wrong type and then using fields that aren't actually there.

Example

	void
	some_function(const char *inputString)
	{
		char *buffer = new char[64];
		strcpy(buffer, inputString);
		...
		delete[] buffer;
	}

The function uses the unsafe strcpy() instead of strncpy() and therefore doesn't tell the function how much space is available in the buffer. Depending on the length of the input string, memory will be corrupted.

How To Spot

When not using interval based wall checking (see below) this form of corruption will be detected on free() / delete of the allocation where certain extra data stored on allocation is verified. Either of these two messages can occur depending on how far the memory has been corrupted:

		someone wrote beyond small allocation at 0x18003050; size: 64 bytes; allocated by 9084; value: 0x18003000
	
This is the so called "wall" value being verified. Currently this is only a very small wall, it consists of exactly sizeof(addr_t) bytes only and stores the address where it is expected to be found. This kind of wall checking is enough for many kinds of buffer overruns, it can't detect more random access like when casting to a wrong type of struct though. A more elaborate wall may be introduced at a later stage. The expected value can be easily calculated by adding the allocation and the size. The value might just give a clue as to what it was that wrote beyond the allocated space.

If the overwriting progresses further it is possible that another debugger call happens:

		leak check info has invalid size 1633771873 for element size 80, probably memory has been overwritten past allocation size
	
This one occurs when the leak checking info is compromised by the overrun and usually indicates the same issue.

Recommendations

Check for places where unsafe string handling is done like with strcpy, strcat, sprintf and replace by the safe counterparts (strncpy, strlcat, snprintf), review sizes to memcpy and memset. Maybe consider to use higher level abstractions for string handling like BString or C++ strings.

Advanced Use of malloc_debug

There are also features present in malloc_debug that aren't enabled by default because of their performance overhead. These features include interval based wall checking and heap validation. There are also reporting features that allow you to request that heap info is dumped for you to manually analyze.

Accessing Advanced Features

The malloc_debug functions are part of the malloc_debug.h header present in the posix header directory (it might not be there yet depending on how recent your installation is, it was introduced in r35431. Note that you can't just copy that header and use these features, as the API was redesigned when adding the header so the functions aren't compatible). Since this API is not present in the normal libroot, you need to explicitly link against libroot_debug.so when building your app with these function calls. So to use:

		#include <malloc_debug.h>
	
And when compiling/linking use:
		-lroot_debug
	

Interval Based and Manual Wall Checking

Wall checking can be done for you at a certain interval automatically, or you can trigger wall checking manually if you already have a more specific suspicion.

	extern status_t heap_debug_start_wall_checking(int msInterval);
	extern status_t heap_debug_stop_wall_checking();

These functions are used to start and stop interval based wall checking. The start function takes an interval in milliseconds (so a value of 1000 will cause a wall check every second). Note that these checks have a certain overhead, so you might not want to use them with too small an interval.

	extern void heap_debug_validate_walls();

This will trigger a manual validation of the wall values of all allocations. It does the same as the interval based wall checker does.

When either of these methods detect that a wall value has been overwritten a debugger call will be triggered. The output line is the same as for wall checking on free (see above).

Paranoid and Manual Validation

The heap has extensive validation functions which pretty much validate every aspect of the internal heap implementation. If memory corruption is going on and hits any of the data used by the allocator validation will most likely detect it. This feature was mainly used during the heap implementation, but since it can also uncover random memory corruption this has been made available.

	extern void heap_debug_set_paranoid_validation(bool enabled);

Using this function paranoid validation can be enabled and disabled again. By default it is disabled. Paranoid validation means that after every heap operation (allocation, reallocation, free) a full validation of the corresponding heap will be done. Note that this is very performance intensive, especially when the heap usage gets bigger.

	extern void heap_debug_validate_heaps();

With that function a manual validation of all heaps can be triggered (internally the heap implementation uses different heap classes for the different allocation sizes). If you suspect a specific code part to be problematic you could for example run a wall check and a validation after running through that code.

Dumping Heap and Allocation Info

There are two functions that can be used to dump allocations done by the application and to dump general info about the heap. Note that the dumped info also include allocations done by the system during startup of the application and by the system classes used by the application.

	extern void heap_debug_dump_allocations(bool statsOnly, thread_id thread);

The heap keeps certain info when allocating memory. Currently this is only the allocation size as well as the allocating thread. In the future this might be extended by stack traces. Using this function the allocations can be dumpped to stdout. If statsOnly is true it will only print a few stats and not all allocations. If thread is >= 0 it is interpreted as a filter and only allocations done by that thread are dumped. If you want to dump all allocations just provide -1 as the thread argument.

	extern void heap_debug_dump_heaps(bool dumpAreas, bool dumpBins);

Dumping the heap info gives an idea about the internal structure of the heap and how it is currently used. Usually this info is of little value to the app developer directly and it was added mostly for completeness' sake.