Saving memory for free
By Rémi Duraffort on Wednesday, July 14 2010, 16:08 - VLC media player - Permalink
We have reduced the memory footprint of VLC media player only by repacking some important structures. Let's have a look at the way the structures are arranged in memory and its impact on the memory usage.
Memory representation of a structure
When creating a structure you do not really care about the representation of this structure in memory. You expect the structure's size to be the sum of its components sizes. Unfortunately the size of a structure also depends on other parameters.
These parameters are mainly:
- The CPU architecture (32 or 64 bits)
- Some optimizations done by the compiler
- The order in which the components appears in the structure
Lets have a look at the following basic structure:
struct
{
int i_age;
char *psz_name;
int i_level;
} people_t;
For a 64 bits CPU, the structure will look like this in memory:

As you noticed, the structure is full of holes when compiled for a 64 bits processor: it uses 50% more memory than the sum of the size of its elements. The explanation is really simple: the CPU can read faster aligned memory that non-aligned memory. In the case of a 64 bits processor, the alignment corresponds to 64 bits of memory (for a 32 bits CPU the right alignment is 32 bits). To improve performances, the compiler tries to align each variable of the structure on 64 bits.
Pahole : finding holes in your structures
Pahole is a tool that help you find out holes in your structures. On Debian you can install Pahole with the package dwarves.
gcc -g -o test test.c
pahole test
typedef struct {
int i_age; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
char *psz_name; /* 8 8 */
int i_level; /* 16 4 */
/* size: 24, cachelines: 1 */
/* sum members: 16, holes: 1, sum holes: 4 */
/* padding: 4 */
/* last cacheline: 24 bytes */
} people_t; /* definitions: 1 */
Pahole analyzes the binary produced by GCC (do not forget the -g switch to enable debug symbols) and lists the structures that contain holes. Pahole shows that:
- There is a 4 bytes (32 bits) long hole between i_level and psz_name
- The compiler adds 4 bytes of padding to fill the structure at the end
- The size of the structure is 24 bytes though the sum of its members is only 16 bytes
We can now reorganize the elements inside the structure to reduce the size of this structure
struct
{
char *psz_name;
int i_age;
int i_level;
} people_t;
The structure now looks like this in memory:

Important structures in VLC
Let's have a look at the memory footprint of VLC media player when VLC isn't doing anything. Of course most of the memory is used by the Qt4 interface. Let's restart VLC without the Qt4 interface to look deeper in the core memory footprint.
Most of the memory used by an instance of VLC (without any interface) comes from the module bank. This structure lists the properties of every module the current VLC can launch. Actually there are 369 modules in the source tree. Moreover some of these modules depend on the architecture and the Operating System, thus most VLC instances have approximatively 200 modules. For each module, a structure called module_t is created. This structure contains another structure called module_config_t.
Analysis of this structure
With Pahole, we can look at the memory used by one instance of the structure
pahole --class_name=module_config_t src/modules/.libs/libvlccore_la-entry.o
struct module_config_t {
int i_type; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
char * psz_type; /* 8 8 */
char * psz_name; /* 16 8 */
char i_short; /* 24 1 */
/* XXX 7 bytes hole, try to pack */
char * psz_text; /* 32 8 */
char * psz_longtext; /* 40 8 */
module_value_t value; /* 48 8 */
module_value_t orig; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
module_value_t saved; /* 64 8 */
module_value_t min; /* 72 8 */
module_value_t max; /* 80 8 */
vlc_callback_t pf_callback; /* 88 8 */
void * p_callback_data; /* 96 8 */
char * * ppsz_list; /* 104 8 */
int * pi_list; /* 112 8 */
char * * ppsz_list_text; /* 120 8 */
/* --- cacheline 2 boundary (128 bytes) --- */
int i_list; /* 128 4 */
/* XXX 4 bytes hole, try to pack */
vlc_callback_t pf_update_list; /* 136 8 */
vlc_callback_t * ppf_action; /* 144 8 */
char * * ppsz_action_text; /* 152 8 */
int i_action; /* 160 4 */
_Bool b_dirty; /* 164 1 */
_Bool b_advanced; /* 165 1 */
_Bool b_internal; /* 166 1 */
_Bool b_restart; /* 167 1 */
char * psz_oldname; /* 168 8 */
_Bool b_removed; /* 176 1 */
_Bool b_autosave; /* 177 1 */
_Bool b_unsaveable; /* 178 1 */
_Bool b_safe; /* 179 1 */
/* size: 184, cachelines: 3 */
/* sum members: 165, holes: 3, sum holes: 15 */
/* padding: 4 */
/* last cacheline: 56 bytes */
};
Pahole shows that the memory used by the structure is 15 bytes bigger than the sum of its elements.
Saving some memory
That's now really easy to save some memory by repacking the structure. The goal is simple: try to fill the holes. For example there are two holes of size 4 (just after i_type and i_list), If i_type and i_list are placed side by side, the hole disappears.
The manual packing was done some months ago in this commit. This change saved some kilo bytes of memory only by repacking one structure.
Comments
Interesting post. Sounds like a lot of work for a few kilobytes though.
To be honest, that's harder and longer to explain than to apply. You just need to run Pahole on some important structures (often used), move some variables, recompile and you are done. The patch was done in less than 10 minutes. For some KBytes of memory that's worth it.
They are other ways to pack than reordering. You can also use smaller types when adequate, at least for integers. For instance, this whole pack of booleans could be turned into a set of bit fields ("unsigned b_xxx:1;").
But there are other considerations. Code readability can get really bad if one reorders large structures blindly. Also, if the structure occupies more than one cache line, "related" members should be kept together for optimal perfomance. For instance, you'd want a mutex on the same cache line as the data that it protects, or the table length on the same cache line as the table head pointer.
Unfortunately, those savings are dwarfed by the memory size of decoded pictures, the input byte stream cache. Those and the Qt4 runtime make up most of the VLC anonymous memory allocations.