I ran across a mention of Callgrind in a weblog the other day. “Hm,” I thought. “I should use that to profile Parrot.”
I have a little PIR program that prints “Hello, world!”. I use it for valgrinding Parrot. Profiling Parrot’s startup and shutdown time seemed useful:
$ valgrind --tool=callgrind -v parrot hi.pir
When you do this, run callgrind annnotate on the resulting output file to get a nice report of which functions did the most work. I saw:
20,301,112 PROGRAM TOTALS
3,034,654 ???:dl_new_hash [/lib/ld-2.5.so]
2,985,581 mmd.c:mmd_expand_y [/home/chromatic/dev/parrot/blib/lib/libparrot.so.0.4.12]
2,974,773 ???:do_lookup_x [/lib/ld-2.5.so]
2,652,803 ???:_dl_relocate_object [/lib/ld-2.5.so]
Here’s what happened when I dug into the code.
Hm. Parrot spent almost three million units (ticks? cycles?) in mmd_expand_y(). What is that function?
Parrot registers all subroutines that participate in multi-dispatch in a table. This function expands that table in the Y direction. That seemed like a costly operation, so I decided to poke around in the code to see if I could improve the performance somewhat. Here’s the code:
static void
mmd_expand_y(Interp *interp, INTVAL func_nr, INTVAL new_y)
{
funcptr_t *new_table;
UINTVAL x;
UINTVAL y;
UINTVAL i;
MMD_table *table = interp->binop_mmd_funcs + func_nr;
x = table->x;
assert(x);
y = table->y;
/* First, fill in the whole new table with the default function
pointer. We only really need to do the new part, but... */
new_table = (funcptr_t *)mem_sys_allocate(sizeof (funcptr_t) * x * new_y);
for (i = 0; i < x * new_y; i++) {
new_table[i] = NULL;
}
/* Then copy the old table over, if it existed in the first place. */
if (table->mmd_funcs) {
memcpy(new_table, table->mmd_funcs,
sizeof (funcptr_t) * x * y);
mem_sys_free(table->mmd_funcs);
}
table->y = new_y;
table->mmd_funcs = new_table;
}
Hmm, that resizing looks suspicious, and I don’t like that comment. I originally decided to loop only from y to y_new, as there’s no sense in filling in values that the memcpy() will overwrite anyway.
Then I thought again.
All of this code is a resizing operation. libc already has a perfectly good function for that, called realloc()! Now Parrot does have wrappers around memory-management functions for various good reasons, but could I replace all of this fun with a call to mem_sys_realloc()?
Here’s my revised code:
static void
mmd_expand_y(Interp *interp, INTVAL func_nr, INTVAL new_y)
{
funcptr_t *new_table;
UINTVAL new_size;
MMD_table *table = interp->binop_mmd_funcs + func_nr;
assert(table->x);
new_size = sizeof (funcptr_t) * table->x * new_y;
if (table->mmd_funcs)
table->mmd_funcs = mem_sys_realloc(table->mmd_funcs, new_size);
else
table->mmd_funcs = (funcptr_t *)mem_sys_allocate(new_size);
table->y = new_y;
}
It’s 11 lines shorter and uses fewer variables. It also passes all of Parrot’s test suite (which seems to run perhaps 10% faster). Yes, it’s not quite the same algorithm; it may be necessary to throw in a memset() call too, but I have a fair amount of confidence in our test coverage here. How does it look in callgrind?
17,311,763 PROGRAM TOTALS
3,034,315 ???:dl_new_hash [/lib/ld-2.5.so]
2,972,930 ???:do_lookup_x [/lib/ld-2.5.so]
2,652,803 ???:_dl_relocate_object [/lib/ld-2.5.so]
922,782 ???:_dl_lookup_symbol_x [/lib/ld-2.5.so]
533,822 ???:check_match.7793 [/lib/ld-2.5.so]
Wow! That simple change–pushing an algorithm into libc where, presumably, it’s much faster–removed nearly three million units of work from “Hello, world!”. That’s 14% less work for a program which only starts Parrot, prints a line, and ends, and all from removing eleven lines of code. The remaining big bottlenecks are in the dynamic linker. Hmm.
Not all wins are this nice, but I’m very pleased.


Your this article is very good
Doesn't this confirm Perl's doom? lwall would never have written code like this. If Parrot's written to this standard it'll take a lot of work to being it up to Perl 5 performance, no?
@Ralph, I'm not sure how to answer your questions. I do invite you to compare the code in almost any part of Parrot with the code in almost any part of Perl 5. One's under active development (and regular review of coding standards and clarity) and the other's in maintenance mode, where very few people want to disturb old code.
One also has had very little optimization and profiling work--because it's not yet finished.
@chromatic: The old parrot code you've replaced had poorer clarity than the new, and faster, code. It is longer with more hoops to jump through to understand what it is attempting, and the reader is further confused by having to wonder what they're missing. They must be missing something, they think, else why wouldn't the author had written it the new way.