Memory problems can be difficult to find and fix. I’m a huge fan of Valgrind, but it only works at the C level. The Parrot virtual machine allocates memory with malloc and releases it with free in a few places, and Valgrind is indispensable for making sure that those match.

Unfortunately for Valgrind, Parrot provides garbage collection, and two of its fundamental data structures rely on the correctness of the garbage collector. Some of the weirdest, and most difficult bugs to solve within Parrot are bugs in our implementation of garbage collection, usually where active objects get collected too soon.

Memory problems are often very unportable. Not only can one program demonstrate a problem where hundreds of other programs run without problems, but one operating system differs from another. Perhaps 64-bit platforms work just fine, but 32-bit platforms have a nasty segfault. Maybe the particulars of how you compiled the program may change memory layout sufficiently to avoid the problem.

Debugging GC-related problems reported on that one platform you don’t have access to is time-consuming and difficult. I discovered another way recently. Now reproducing GC problems in Parrot is possible across platforms, making them much easier to diagnose–and once you can diagnose a problem, you’ve done most of the work to fix them.

An Aside on Parrot’s GC

Parrot’s default garbage collector uses a mark-and-sweep algorithm. There are two types of objects in Parrot that rely on the garbage collector: strings and PMCs. Parrot creates large pools of memory which hold many string and PMC structures. Whenever code needs a new PMC or a string, Parrot grabs the next free object out of a pool, rather than calling malloc to ask the operating system to find enough free memory.

This has several nice effects, such as reducing memory fragmentation and obviating the need for potentially costly malloc calls.

Of course, Parrot also has to be able to return an unused string or PMC to the pool of free objects–and it has to be able to detect when a string or PMC is unused.

The mark-and-sweep algorithm has two parts. First, Parrot starts from a root set of objects it knows will always be live. They’re PMCs used internally to keep track of interpreter-specific data. Parrot sets a flag on each element of the root set, to use later, and calls a mark() method if one exists. (Strings do not have it; PMCs may, depending on the type of the PMC.) Any PMC which contains other strings or PMCs needs to mark those as live as well. In this way, Parrot walks a graph of all live objects in the system. At the end of this process, all objects reachable through the graph have their live flags set.

The sweep phase walks through the memory pools which contain all of the PMC and string structures and adds everything not marked as live to a linked-list of free objects for later use.

This is not necessarily the best GC algorithm, but there are a lot of potential improvements which start from this base, and it works pretty well for software under development. (With that said, Parrot’s GC is pluggable, and there’s a generational system in progress which could use the attention of someone with both knowledge and spare time to contribute.)

One obvious point of trouble in this system is that any code which relies on a PMC or string being live needs to be able to mark that object as live. Failing to mark contained objects is the source of many bugs in these systems.

Improving Predictability

One of the biggest problems affecting GC-debugging is that GC tends to be unpredictable. It’s not always easy to tell when GC will run just by examining a program, especially across multiple platforms. I’d recently diagnosed a problem in a very short test case provided by Jerry Gay by forcing a GC run right after a certain operation in a PIR program. By tracing the code of that operation, I found a spot where Parrot wasn’t marking an object as live appropriately. After I made that change, the object stuck around on all platforms even after forced GC runs.

Because creating a new object, using an object, freeing an object can all happen so far away, figuring out an object’s correct lifecycle is very important. I’ve found that forcing GC-related bugs to occur as soon as possible helps me track down why they happen.

A Runloop for GC Debugging

Any true computing machine executes a sequence of instructions. Like Perl 5 (and I suspect Python, the JVM, the CLR, your CPU, et cetera), Parrot has a central runloop which fetches the next instruction and dispatches to it:

    runops_slow_core(PARROT_INTERP, NOTNULL(opcode_t *pc))
    {
        while (pc) {
            if (pc < code_start || pc >= code_end)
                real_exception(interp, NULL, 1,
                    "attempt to access code outside of current code segment");

            CONTEXT(interp->ctx)->current_pc = pc;

            DO_OP(pc, interp);
        }

        return pc;
    }

In this code, pc (the Program Counter) points to the current opcode in the bytecode. Most of the code is sanity checking. The important line is DO_OP, which is a macro that executes the appropriate function for that opcode. This code is just a loop that executes functions corresponding to opcodes. Again, there are smarter runloops (and Parrot supports several), but this is a good example.

In the interest of predictability, I decided that the best way to make GC-related bugs immediately obvious was to force a full mark-and-sweep run before every opcode dispatch. The GC debugging runcore resembles:

    runops_gc_debug_core(PARROT_INTERP, NOTNULL(opcode_t *pc))
    {
        while (pc) {
            if (pc < code_start || pc >= code_end)
                real_exception(interp, NULL, 1,
                    "attempt to access code outside of current code segment");

            Parrot_do_dod_run(interp, 0);
            CONTEXT(interp->ctx)->current_pc = pc;

            DO_OP(pc, interp);
        }

        return pc;
    }

There’s one change. In theory, the only effect this should have is slowing down programs tremendously. In practice, it causes all sorts of wonderful failures, especially segfaults. This is good–segfaults are debuggable, especially when they occur repeatedly and reliably in small test cases, such as our PMC test suite.

Using the New Runloop to Find Problems

Running the tests for the Sub PMC through the GC debugging runcore reveals two problems:

    $ TEST_PROG_ARGS="--runcore=gcdebug" prove t/pmc/sub.t
    t/pmc/sub....ok 38/61
    #     Failed test (t/pmc/sub.t at line 1262)
    # Exited with error code: 139
    # Received:
    # Segmentation fault (core dumped)
    #
    # Expected:
    # "VAR1" => PMC 'PGE::Match' => "aabbb" @ 3
    #
    t/pmc/sub....NOK 55/61
    #     Failed test (t/pmc/sub.t at line 1278)
    # Exited with error code: 139
    # Received:
    # Segmentation fault (core dumped)
    #
    # Expected:
    # "VAR1" => PMC 'PGE::Match' => "aabbb" @ 3
    #
    t/pmc/sub....ok 60/61# Looks like you failed 2 tests of 61.
    t/pmc/sub....dubious
            Test returned status 2 (wstat 512, 0x200)
    DIED. FAILED tests 55-56
            Failed 2/61 tests, 96.72% okay
    Failed Test Stat Wstat Total Fail  List of Failed
    -------------------------------------------------
    t/pmc/sub.t    2   512    61    2  55-56
    Failed 1/1 test scripts. 2/61 subtests failed.
    Files=1, Tests=61,  4 wallclock secs ( 2.47 cusr +  0.66 csys =  3.13 CPU)
    Failed 1/1 test programs. 2/61 subtests failed.

The Parrot command-line option to use a different runcore is --runcore=gcdebug is the name of my new runcore; it’s available from the Subversion trunk right now and will be available the stable release of Parrot 0.5.0 on 20 November. Set the environment variable TEST_PROG_ARGS to contain any CLI options you want to pass to Parrot if you don’t invoke it directly on a PIR, PASM, or PBC file.

The second test failure isn’t terribly informative (but it works with the default runcore, which is a sign that there’s a GC problem somewhere waiting to happen), but the segfault in the first failing test is a great spot to start debugging. I’ll walk through the debugging strategy with gdb tomorrow.