I’ve spent several hours optimizing Parrot over the past few months. In particular, I’ve concentrated on the build process for Rakudo (Perl 6 on Parrot), as it exercises a lot of parts of Parrot. We don’t yet have accurate numbers on the improvements, but rough figures show that the parts of the build process I’ve optimized will be about twice as fast as they were three months ago, despite Rakudo having grown tremendously since then.
Some of this comes from luck, some comes from a deepening knowledge of Parrot internals, a lot of it comes thanks to Callgrind and KCacheGrind, and some of it is experience. My instincts are improving.
Despite some very deliberate differences between Parrot and the Perl 5 implementation, there are also some similarities. In particular, the default runcore for both virtual machines is very similar. For every operation performed (that is, every logical operation expressed in Perl 5 or PIR source code), the default runcore dispatches to a C function which performs the operation and returns the next op.
The default Perl 5 runloop checks for pending signal delivery after each op, before looping again.
One of the best optimization strategies is to understand what happens frequently and what happens infrequently, and to try to make infrequent things mostly free — or at least cheap. Signal delivery in Perl 5 is pretty infrequent.
I reasoned, without profiling, that eliminating the check for signal delivery might provide a small performance improvement to Perl 5 code. (You can’t eliminate signal delivery overall without removing features, but I had the notion that it’s possible to write code which runs when installing a signal handler — there are hooks for this — and replaces the default “Don’t check signals” runcore with a runcore which does check signals. Perl Hacks demonstrates how to replace runloops (see also Runops::Trace, so it’s doable.)
I updated my copy of bleadperl, built a fresh version, and then looked for some long-running code to profile. t/op/pack.t looked likely — it’s the second largest test file of a Perl operator, and it’s not heavily tied to regex performance as is the largest test file. I ran it through callgrind.
30 seconds later, I had some performance data (amended):
Profiled target: ./perl -Ilib t/op/pack.t (PID 4285, part 1)
770,891,351 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir file:function
--------------------------------------------------------------------------------
38,060,996 ???:Perl_sv_setsv_flags [/home/chromatic/dev/bleadperl/perl]
30,148,727 hooks.c:mem2chunk_check [/usr/lib/debug/libc-2.6.1.so]
28,032,914 ???:Perl_sv_upgrade [/home/chromatic/dev/bleadperl/perl]
25,980,923 malloc.c:_int_malloc [/usr/lib/debug/libc-2.6.1.so]
21,486,973 memmove.c:memmove [/usr/lib/debug/libc-2.6.1.so]
20,980,121 ???:Perl_runops_standard [/home/chromatic/dev/bleadperl/perl]
I edited Perl_runops_standard in run.c, removing the
emboldened line:
int
Perl_runops_standard(pTHX)
{
dVAR;
while ((PL_op = CALL_FPTR(PL_op->op_ppaddr)(aTHX))) {
PERL_ASYNC_CHECK();
}
TAINT_NOT;
return 0;
}
… rebuilt, and re-ran the benchmark:
Profiled target: ./perl -Ilib t/op/pack.t (PID 4367, part 1)
762,975,114 PROGRAM TOTALS
38,060,996 ???:Perl_sv_setsv_flags [/home/chromatic/dev/bleadperl/perl]
30,148,727 hooks.c:mem2chunk_check [/usr/lib/debug/libc-2.6.1.so]
28,032,914 ???:Perl_sv_upgrade [/home/chromatic/dev/bleadperl/perl]
25,980,923 malloc.c:_int_malloc [/usr/lib/debug/libc-2.6.1.so]
21,486,973 memmove.c:memmove [/usr/lib/debug/libc-2.6.1.so]
...
13,114,584 ???:Perl_runops_standard [/home/chromatic/dev/bleadperl/perl]
That’s a 1.03% performance improvement — almost statistical noise. Over five or six percent, it might have been worth considering. It looks like this optimization isn’t worth the work it would take to figure out runloop swapping.
It’s not a bad investment of half an hour, but performance improvements in
Perl will have to come from somewhere else. In particular, the 425-line monster
Perl_sv_setsv_flags looks likely….

On a possibly unrelated note, have you ever used Coverity? I don't know if it would help with optimizations or just security.
http://www.coverity.com/
You wrote: "One of the best optimization strategies is to understand what happens frequently and what happens infrequently, and to try to make infrequent things mostly free — or at least cheap. Signal delivery in Perl 5 is pretty infrequent."
This sounds exactly backwards to me. Making the frequent things mostly free or at least cheap would be where the big rewards lie.
I think Mr. Jones has misread the strategy in this context.
It's not a matter of paying closer attention to code which runs infrequently, but to code that runs often but actually needed infrequently. What the author meant is that when you're not making use of a feature, it should be using as few cycles as possible.
The improvement in question was a question of not checking for signals needing to be delivered to user code unless a signal handler was present. Currently, there's a check even if there's no user-installed signal handler present to receive signals.