As I mentioned in Debugging GC Problems in Parrot, being able to find a failure as soon as possible helps with debugging. With a new runcore for GC debugging, we’ve found several memory problems and fixed most of them. (Some require rethinking certain design decisions.)

Here’s the process to finding and fixing bugs of this type.

Sub 55

I ended the previous posting by showing that one of the test cases in t/pmc/sub.t failed under the GC debugging runcore. Here it is run on its own:

    $ ./parrot --runcore=gcdebug t/pmc/sub_55.pir
    Segmentation fault (core dumped)

Parrot should never segfault on invalid input, so this is clearly a problem worth investigating. I launched gdb:

    $ gdb ./parrot
    (gdb) run --runcore=gcdebug t/pmc/sub_55.pir

    ...

    Program received signal SIGSEGV, Segmentation fault.
    [Switching to Thread -1213212480 (LWP 14554)]
    0xb7c703bb in start_flatten (interp=0x804e008, st=0xbf8c5410, p_arg=0x81c5bf0)
        at src/inter_call.c:307
    307             if (!VTABLE_does(interp, p_arg, CONST_STRING(interp, "hash"))) {
    (gdb)

From experience, I know that the problem is that p_arg is a PMC with a missing vtable. The VTABLE_does macro looks in an array of function pointers attached to the PMC (this is the vtable), finds the function pointer for the does entry, and calls it with the remaining arguments. If any of those pointers are invalid, Parrot will crash with a segfault. Here’s how I know that this is a problem:

  (gdb) p p_arg
  $1 = (PMC *) 0x81c5bf0
  (gdb) p p_arg->vtable
  $2 = (VTABLE *) 0xdeadbeef

The p command displays information about the variable or expression. I know that the vtable pointer should only be 0xdeadbeef when the garbage collector has put a PMC on the free list; Parrot_dod_free_pmc() in src/gc/dod.c sets the PMC’s pointers to 0xdeadbeef explicitly. The problem is probably that this PMC gets freed inappropriately, when something still holds onto it. That something isn’t marking the PMC as live, during the mark phase of mark-and-sweep.

Usually fixing these problems means finding the code that creates the doomed PMC and figuring out where it ends up, but sometimes looking at the backtrace leading up to the segfault is also useful:

    (gdb) bt
    #0  0xb7c703bb in start_flatten (interp=0x804e008, st=0xbf8c5410,
        p_arg=0x81c5bf0) at src/inter_call.c:307
    #1  0xb7c7096c in fetch_arg_op (interp=0x804e008, st=0xbf8c5410)
        at src/inter_call.c:429
    #2  0xb7c70c92 in Parrot_fetch_arg (interp=0x804e008, st=0xbf8c5410)
        at src/inter_call.c:501

In this case, the immediately preceding call frame looks worth investigating, because p_arg doesn’t appear in any other calls. Thus fetch_arg_op() must access it somehow to pass to start_flatten(). (If you’re curious, this code is the part of Parrot which handles Parrot’s calling conventions between Parrot subroutines, methods, and continuations. It’s somewhat complex because Parrot supports named, positional, optional, and default parameters, as well as autoboxing, multi-dispatch, and CPS.)

Moving up the call stack and examining the code can be instructive:

    (gdb) up
    #1  0xb7c7096c in fetch_arg_op (interp=0x804e008, st=0xbf8c5410)
        at src/inter_call.c:429
    429                     start_flatten(interp, st, UVal_pmc(st->val));
    (gdb) l
    424                     UVal_pmc(st->val) = constant ? st->src.ctx->constants[idx]->u.key
    425                                           : CTX_REG_PMC(st->src.ctx, idx);
    426
    427                 if (st->src.sig & PARROT_ARG_FLATTEN) {
    428                     int retval;
    429                     start_flatten(interp, st, UVal_pmc(st->val));
    430
    431                     /* if the :flat arg is empty, just go to the next arg */
    432                     if (!st->src.slurp_n) {
    433                         st->src.mode &= ~CALL_STATE_FLATTEN;

up and down move up and down the call stack. Pass a number to jump directly to that call frame. l displays the surrounding source code of the current point.

Line 429 is most relevant here; it’s the call to start_flatten(). st is a structure representing the current call state. UVal_pmc accesses a PMC pointer stored in a C union. It may be the case that if those call state structures get marked for GC they need to mark any PMCs they contain. I don’t know. It’s also not clear what type of PMC this is. Knowing that might be useful to figure out what’s wrong.

I opened src/inter_call.c in Vim and searched for all uses of UVal_pmc, looking for assignments. Then I realized that the PMC might already be invalid at the point of assignment, so I need to figure out what creates it and where it gets stored.

Tracking PMC Creation

All PMC creation code is in src/pmc.c. Most PMCs get created by calls to pmc_new(), though some use pmc_new_noinit(). Both functions call into the GC to get a new free PMC object, so it’s possible to set a breakpoint there. I’m lazy and setting a breakpoint on pmc_new() usually works.

Parrot creates a lot of PMCs, so I prefer to use conditional breakpoints. This doesn’t always find the offending PMC right away, but it narrows down the problem considerably. p p_arg told me that the PMC’s memory address is 0×81c5bf0, so:

    (gdb) l pmc_new
    59      PARROT_API
    60      PARROT_CANNOT_RETURN_NULL
    61      PARROT_MALLOC
    62      PMC *
    63      pmc_new(PARROT_INTERP, INTVAL base_type)
    64      {
    65          PMC *pmc;
    66          PMC *const classobj = interp->vtables[base_type]->pmc_class;
    67          if (!PMC_IS_NULL(classobj) && PObj_is_class_TEST(classobj))
    68              pmc = VTABLE_instantiate(interp, classobj, PMCNULL);
    (gdb)
    69          else {
    70              pmc = get_new_pmc_header(interp, base_type, 0);
    71              VTABLE_init(interp, pmc);
    72          }
    73          return pmc;
    74      }
    75
    76      /*
    77
    (gdb) break 73 if pmc == 0x81c5bf0
    Breakpoint 1 at 0xb7ca43d3: file src/pmc.c, line 73.
    (gdb) run

Note that all line numbers are valid as of Parrot revision 22628 on the trunk.

The breakpoint stopped execution five times, then the segfault occurred. I ran the program again, using c to continue, until the fifth occurrence.

    Breakpoint 1, pmc_new (interp=0x804e008, base_type=42) at src/pmc.c:73
    73          return pmc;
    (gdb) bt
    #0  pmc_new (interp=0x804e008, base_type=42) at src/pmc.c:73
    #1  0xb7d273d4 in init_first_dest_named (interp=0x804e008, st=0xbf937c80)
        at src/inter_call.c:701
    #2  0xb7d280e3 in Parrot_process_args (interp=0x804e008, st=0xbf937c80,
        param_or_result=PARROT_PASS_PARAMS) at src/inter_call.c:1071
    #3  0xb7d283df in parrot_pass_args (interp=0x804e008, src_ctx=0x8216fe0,
        dest_ctx=0x82ab878, src_indexes=0x822624c, dest_indexes=0xb6b6aa1c,
        param_or_result=PARROT_PASS_PARAMS) at src/inter_call.c:1189
    #4  0xb7cb94f6 in Parrot_get_params_pc (cur_opcode=0xb6b6aa1c,
        interp=0x804e008) at src/ops/core.ops:543
    #5  0xb7d5b3db in runops_gc_debug_core (interp=0x804e008, pc=0xb6b6aa1c)
        at src/runops_cores.c:231

There’s one important piece of information that I didn’t have last time: the type of the PMC is 42. include/parrot/core_pmcs.h reveals that this value corresponds to the enumeration enum_class_Hash, which represents the Hash PMC. The backtrace shows that the PMC gets created in init_first_dest_named() as part of the argument passing process, which makes me a little bit suspicious.

The GC debugging runcore only runs the garbage collector between Parrot operations. It’s unlikely that Parrot will pause during argument passing to run the garbage collector, then resume, so I’m curious where this PMC will end up:

    (gdb) up
    #1  0xb7d273d4 in init_first_dest_named (interp=0x804e008, st=0xbf937c80)
        at src/inter_call.c:701
    701                 st->dest.slurp = pmc_new(interp, Parrot_get_ctx_HLL_type(interp, enum_class_Hash));
    (gdb) l
    696             else if (sig & PARROT_ARG_SLURPY_ARRAY) {
    697                 int idx;
    698
    699                 /* Create PMC for slurpy mode and register it; we must do this
    700                  * otherwise it may get collected. */
    701                 st->dest.slurp = pmc_new(interp, Parrot_get_ctx_HLL_type(interp, enum_class_Hash));
    702                 dod_register_pmc(interp, st->dest.slurp);
    703
    704                 /* pass the slurpy hash */
    705                 idx = st->dest.u.op.pc[i];
    706                 CTX_REG_PMC(st->dest.ctx, idx) = st->dest.slurp;

Line 701 is the key here, as it creates the Hash and assigns it to the call state. However, line 702 confuses me. dod_register_pmc() tells Parrot to pin a PMC so that it is always live during the mark phase. Something must unregister this PMC later on, and it occurs higher up the call stack in parrot_pass_args() on line 1193. I remember intending to write something like this code at some point.

This probably isn’t the culprit, as the lifecycle of this object looks correct to me. The Hash must get assigned elsewhere sometime while it’s valid, and that destination may not mark it correctly as live. That assignment occurs on line 706 (the line after my list command, but that’s okay).

The CTX_REG_PMC macro accesses a PMC register in a Parrot context. Contexts represent what would normally be call frames in a stack-based language. As Parrot is a register-based virtual machine that uses continuation-passing style for flow control, a context contains a variable number of the four types of registers as well as a pointer to the calling context and a pointer to the return context. Think of them as the equivalent to call frames and you’ll be fine.

I know that contexts get marked (mark_context() in src/sub.c because I patched that code a couple of days ago (but didn’t check in my change, as I wondered if it would cause GC problems). The next debugging step was to track down what happened to this destination context. The destination context is the arguments as the invoked entity will receive them.

Up one more level, Parrot_process_args() grabs the destination context out of the call state (line 957), but the rest of the functions don’t store it anywhere. Something beneath this point in the stack must assign the ctx pointer of the call state’s dest struct somewhere.

An interesting experiment is to set a watchpoint in GDB on this structure. A read watchpoint seems most appropriate. This doesn’t always work, because execution might leave the context of the expression’s lifetime, but it may work here. I moved up in the call stack to parrot_pass_args, set a watchpoint, and then continued:

    (gdb) up
    #3  0xb7c753df in parrot_pass_args (interp=0x804e008, src_ctx=0x8216fe0,
        dest_ctx=0x82ab878, src_indexes=0x822624c, dest_indexes=0xb6ab7a1c,
        param_or_result=PARROT_PASS_PARAMS) at src/inter_call.c:1189
    1189        Parrot_process_args(interp, &st, param_or_result);
    (gdb) rwatch st->dest->ctx
    Hardware read watchpoint 2: st->dest->ctx
    (gdb) c
    Continuing.
    Hardware read watchpoint 2: st->dest->ctx

    Value = (parrot_context_t *) 0x82ab878
    0xb7c7440e in init_first_dest_named (interp=0x804e008, st=0xbfb7dec0)
        at src/inter_call.c:706
    706                 CTX_REG_PMC(st->dest.ctx, idx) = st->dest.slurp;
    (gdb) c
    Continuing.

    Watchpoint 2 deleted because the program has left the block in
    which its expression is valid.
    0xb7c064f6 in Parrot_get_params_pc (cur_opcode=0xb6ab7a1c, interp=0x804e008)
        at src/ops/core.ops:543
    543         parrot_pass_args(interp, caller_ctx, ctx, src_indexes, dst_indexes, PARROT_PASS_PARAMS);
    (gdb)

That didn’t work. (That error is the same one I usually get.) What could be keeping hold of this PMC but not marking it? Looking back in the backtrace, I checked the signature of parrot_pass_args() once again:

    (gdb) l parrot_pass_args
    1164            NOTNULL(parrot_context_t *src_ctx),
    1165            NOTNULL(parrot_context_t *dest_ctx),
    1166            NOTNULL(opcode_t *src_indexes),
    1167            NOTNULL(opcode_t *dest_indexes),
    1168            arg_pass_t param_or_result)

Then at Parrot_init_arg_indexes_and_sig_pmc(), called once each for the source and destination contexts within parrot_pass_args(). Specifically, line 244 is:

    sti->ctx = ctx;

… called from:

    Parrot_init_arg_indexes_and_sig_pmc(interp, dest_ctx, dest_indexes,
        dest_signature, &st.dest);

… at line 1188 in parrot_pass_args(). That is, set st->dest to the destination context. Whatever that context is, it comes from the call frame immediately preceding this one, the get_params opcode in src/ops/core.ops:

    interp->current_params = _this;
    ctx                    = CONTEXT(interp->ctx);
    ccont                  = ctx->current_cont;

    caller_ctx  = ctx->caller_ctx;

    src_indexes = interp->current_args;
    dst_indexes = interp->current_params;

    /* the args and params are now 'used.' */
    interp->current_args   = NULL;
    interp->current_params = NULL;

    parrot_pass_args(interp, caller_ctx, ctx, src_indexes,
        dst_indexes, PARROT_PASS_PARAMS);

Thus if there’s a GC problem with the slurpy PMC not getting marked, it’s because the current context is not marking its caller context as live, which is exactly what my experimental patch removed. That’s a pretty clear confirmation that the patch was wrong, and that I shouldn’t commit it. Moving back to src/sub.c, I restored two lines into mark_context() at line 66:

    if (ctx->caller_ctx)
        mark_context(interp, ctx->caller_ctx);

From there, I rebuilt Parrot (make parrot), disabled the breakpoint in gdb (disable 1), and ran the program again:

    (gdb) run
    The program being debugged has been started already.
    Start it from the beginning? (y or n) y
    `/home/chromatic/dev/parrot/parrot' has changed; re-reading symbols.
    Error in re-setting breakpoint 1:
    No source file named src/pmc.c.

    Starting program: /home/chromatic/dev/parrot/parrot --runcore=gcdebug t/pmc/sub_55.pir
    Error in re-setting breakpoint 1:
    No source file named src/pmc.c.
    Error in re-setting breakpoint 1:
    No source file named src/pmc.c.
    Breakpoint 1 at 0xb7ce83d3: file src/pmc.c, line 73.
    [Thread debugging using libthread_db enabled]
    [New Thread -1212933952 (LWP 18828)]
    [New Thread -1213072496 (LWP 18829)]
    [New Thread -1221465200 (LWP 18830)]
    "VAR1" => PMC 'PGE::Match' => "aabbb" @ 3

    Program exited normally.

I’m a little bit disappointed that my patch didn’t work; it cut out a lot of duplicate marking. Because Parrot allows native access to continuations from all languages running on Parrot (if they choose to support it, anyway), contexts form a call graph, and repeatedly marking the same context is possible when traversing the graph during the mark phase. It would be nice to be able to identify contexts marked during the current mark phase so as to avoid re-marking them here. A little bit of caching could save us a tremendous amount of work in long-lived programs with complex graphs. The solution is more than just adding a live flag to the context structure, as those flags need to get cleared during the sweep. (It’s easy to clear live flags when sweeping pools. Walk the pool and test each object for the live flag. If the object is live, clear the flag. Otherwise, put the object back on the free list.)

My disappointment runs shallow, however. When I first tested this patch, I ran the full test suite and nothing appeared extraordinary. I ran the PMC tests (t/pmc/*.t) under the GC debugging core and started to see unexpected failures. Even though the patch turned out to be faulty, the new runcore helped me see and diagnose the problem without exposing the problem to language implementors on other platforms. I had hoped that the problem would be elsewhere, but it’s clear that there’s a better solution elsewhere.

Preventing developers from checking in bugs is a tremendous benefit, and I learned something important in the process. That’s definitely progress.