I ran across a mention of Callgrind in a weblog the other day. “Hm,” I thought. “I should use that to profile Parrot.”

I have a little PIR program that prints “Hello, world!”. I use it for valgrinding Parrot. Profiling Parrot’s startup and shutdown time seemed useful:

$ valgrind --tool=callgrind -v parrot hi.pir

When you do this, run callgrind annnotate on the resulting output file to get a nice report of which functions did the most work. I saw:

20,301,112  PROGRAM TOTALS

3,034,654  ???:dl_new_hash [/lib/ld-2.5.so]
2,985,581  mmd.c:mmd_expand_y [/home/chromatic/dev/parrot/blib/lib/libparrot.so.0.4.12]
2,974,773  ???:do_lookup_x [/lib/ld-2.5.so]
2,652,803  ???:_dl_relocate_object [/lib/ld-2.5.so]

Here’s what happened when I dug into the code.

Hm. Parrot spent almost three million units (ticks? cycles?) in mmd_expand_y(). What is that function?

Parrot registers all subroutines that participate in multi-dispatch in a table. This function expands that table in the Y direction. That seemed like a costly operation, so I decided to poke around in the code to see if I could improve the performance somewhat. Here’s the code:

static void
mmd_expand_y(Interp *interp, INTVAL func_nr, INTVAL new_y)
{
    funcptr_t *new_table;
    UINTVAL x;
    UINTVAL y;
    UINTVAL i;
    MMD_table *table = interp->binop_mmd_funcs + func_nr;

    x = table->x;
    assert(x);
    y = table->y;

    /* First, fill in the whole new table with the default function
       pointer. We only really need to do the new part, but... */
    new_table = (funcptr_t *)mem_sys_allocate(sizeof (funcptr_t) * x * new_y);
    for (i = 0; i < x * new_y; i++) {
        new_table[i] = NULL;
    }

    /* Then copy the old table over, if it existed in the first place. */
    if (table->mmd_funcs) {
        memcpy(new_table, table->mmd_funcs,
               sizeof (funcptr_t) * x * y);
        mem_sys_free(table->mmd_funcs);
    }
    table->y = new_y;
    table->mmd_funcs = new_table;
}

Hmm, that resizing looks suspicious, and I don’t like that comment. I originally decided to loop only from y to y_new, as there’s no sense in filling in values that the memcpy() will overwrite anyway.

Then I thought again.

All of this code is a resizing operation. libc already has a perfectly good function for that, called realloc()! Now Parrot does have wrappers around memory-management functions for various good reasons, but could I replace all of this fun with a call to mem_sys_realloc()?

Here’s my revised code:

static void
mmd_expand_y(Interp *interp, INTVAL func_nr, INTVAL new_y)
{
    funcptr_t *new_table;
    UINTVAL    new_size;
    MMD_table *table = interp->binop_mmd_funcs + func_nr;

    assert(table->x);

    new_size = sizeof (funcptr_t) * table->x * new_y;

    if (table->mmd_funcs)
        table->mmd_funcs = mem_sys_realloc(table->mmd_funcs, new_size);
    else
        table->mmd_funcs = (funcptr_t *)mem_sys_allocate(new_size);

    table->y = new_y; 
}

It’s 11 lines shorter and uses fewer variables. It also passes all of Parrot’s test suite (which seems to run perhaps 10% faster). Yes, it’s not quite the same algorithm; it may be necessary to throw in a memset() call too, but I have a fair amount of confidence in our test coverage here. How does it look in callgrind?

17,311,763  PROGRAM TOTALS

3,034,315  ???:dl_new_hash [/lib/ld-2.5.so]
2,972,930  ???:do_lookup_x [/lib/ld-2.5.so]
2,652,803  ???:_dl_relocate_object [/lib/ld-2.5.so]
  922,782  ???:_dl_lookup_symbol_x [/lib/ld-2.5.so]
  533,822  ???:check_match.7793 [/lib/ld-2.5.so]

Wow! That simple change–pushing an algorithm into libc where, presumably, it’s much faster–removed nearly three million units of work from “Hello, world!”. That’s 14% less work for a program which only starts Parrot, prints a line, and ends, and all from removing eleven lines of code. The remaining big bottlenecks are in the dynamic linker. Hmm.

Not all wins are this nice, but I’m very pleased.