As I mentioned in Debugging GC Problems in Parrot, being able to find a failure as soon as possible helps with debugging. With a new runcore for GC debugging, we’ve found several memory problems and fixed most of them. (Some require rethinking certain design decisions.)
Here’s the process to finding and fixing bugs of this type.
Sub 55
I ended the previous posting by showing that one of the test cases in t/pmc/sub.t failed under the GC debugging runcore. Here it is run on its own:
$ ./parrot --runcore=gcdebug t/pmc/sub_55.pir
Segmentation fault (core dumped)
Parrot should never segfault on invalid input, so this is clearly a problem worth investigating. I launched gdb:
$ gdb ./parrot
(gdb) run --runcore=gcdebug t/pmc/sub_55.pir
...
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1213212480 (LWP 14554)]
0xb7c703bb in start_flatten (interp=0x804e008, st=0xbf8c5410, p_arg=0x81c5bf0)
at src/inter_call.c:307
307 if (!VTABLE_does(interp, p_arg, CONST_STRING(interp, "hash"))) {
(gdb)
From experience, I know that the problem is that p_arg is a
PMC with a missing vtable. The VTABLE_does macro looks in an
array of function pointers attached to the PMC (this is the vtable), finds
the function pointer for the does entry, and calls it with the
remaining arguments. If any of those pointers are invalid, Parrot will
crash with a segfault. Here’s how I know that this is a problem:
(gdb) p p_arg
$1 = (PMC *) 0x81c5bf0
(gdb) p p_arg->vtable
$2 = (VTABLE *) 0xdeadbeef
The p command displays information about the variable or
expression. I know that the vtable pointer should only be
0xdeadbeef when the garbage collector has put a PMC on the
free list; Parrot_dod_free_pmc() in src/gc/dod.c sets
the PMC’s pointers to 0xdeadbeef explicitly. The problem is probably that
this PMC gets freed inappropriately, when something still holds onto it.
That something isn’t marking the PMC as live, during the mark phase of
mark-and-sweep.
Usually fixing these problems means finding the code that creates the doomed PMC and figuring out where it ends up, but sometimes looking at the backtrace leading up to the segfault is also useful:
(gdb) bt
#0 0xb7c703bb in start_flatten (interp=0x804e008, st=0xbf8c5410,
p_arg=0x81c5bf0) at src/inter_call.c:307
#1 0xb7c7096c in fetch_arg_op (interp=0x804e008, st=0xbf8c5410)
at src/inter_call.c:429
#2 0xb7c70c92 in Parrot_fetch_arg (interp=0x804e008, st=0xbf8c5410)
at src/inter_call.c:501
In this case, the immediately preceding call frame looks worth
investigating, because p_arg doesn’t appear in any other
calls. Thus fetch_arg_op() must access it somehow to pass to
start_flatten(). (If you’re curious, this code is the part of
Parrot which handles Parrot’s calling conventions between Parrot
subroutines, methods, and continuations. It’s somewhat complex
because Parrot supports named, positional, optional, and default
parameters, as well as autoboxing, multi-dispatch, and CPS.)
Moving up the call stack and examining the code can be instructive:
(gdb) up
#1 0xb7c7096c in fetch_arg_op (interp=0x804e008, st=0xbf8c5410)
at src/inter_call.c:429
429 start_flatten(interp, st, UVal_pmc(st->val));
(gdb) l
424 UVal_pmc(st->val) = constant ? st->src.ctx->constants[idx]->u.key
425 : CTX_REG_PMC(st->src.ctx, idx);
426
427 if (st->src.sig & PARROT_ARG_FLATTEN) {
428 int retval;
429 start_flatten(interp, st, UVal_pmc(st->val));
430
431 /* if the :flat arg is empty, just go to the next arg */
432 if (!st->src.slurp_n) {
433 st->src.mode &= ~CALL_STATE_FLATTEN;
up and down move up and down the call stack.
Pass a number to jump directly to that call frame. l displays
the surrounding source code of the current point.
Line 429 is most relevant here; it’s the call to
start_flatten(). st is a structure representing
the current call state. UVal_pmc accesses a PMC pointer stored
in a C union. It may be the case that if those call state structures get
marked for GC they need to mark any PMCs they contain. I don’t know. It’s
also not clear what type of PMC this is. Knowing that might be useful to
figure out what’s wrong.
I opened src/inter_call.c in Vim and searched for all uses of
UVal_pmc, looking for assignments. Then I realized that the
PMC might already be invalid at the point of assignment, so I need to
figure out what creates it and where it gets stored.
Tracking PMC Creation
All PMC creation code is in src/pmc.c. Most PMCs get created by
calls to pmc_new(), though some use
pmc_new_noinit(). Both functions call into the GC to get a new
free PMC object, so it’s possible to set a breakpoint there. I’m lazy and
setting a breakpoint on pmc_new() usually works.
Parrot creates a lot of PMCs, so I prefer to use conditional
breakpoints. This doesn’t always find the offending PMC right
away, but it narrows down the problem considerably. p p_arg
told me that the PMC’s memory address is 0×81c5bf0, so:
(gdb) l pmc_new
59 PARROT_API
60 PARROT_CANNOT_RETURN_NULL
61 PARROT_MALLOC
62 PMC *
63 pmc_new(PARROT_INTERP, INTVAL base_type)
64 {
65 PMC *pmc;
66 PMC *const classobj = interp->vtables[base_type]->pmc_class;
67 if (!PMC_IS_NULL(classobj) && PObj_is_class_TEST(classobj))
68 pmc = VTABLE_instantiate(interp, classobj, PMCNULL);
(gdb)
69 else {
70 pmc = get_new_pmc_header(interp, base_type, 0);
71 VTABLE_init(interp, pmc);
72 }
73 return pmc;
74 }
75
76 /*
77
(gdb) break 73 if pmc == 0x81c5bf0
Breakpoint 1 at 0xb7ca43d3: file src/pmc.c, line 73.
(gdb) run
Note that all line numbers are valid as of Parrot revision 22628 on the trunk.
The breakpoint stopped execution five times, then the segfault occurred.
I ran the program again, using c to continue, until the fifth
occurrence.
Breakpoint 1, pmc_new (interp=0x804e008, base_type=42) at src/pmc.c:73
73 return pmc;
(gdb) bt
#0 pmc_new (interp=0x804e008, base_type=42) at src/pmc.c:73
#1 0xb7d273d4 in init_first_dest_named (interp=0x804e008, st=0xbf937c80)
at src/inter_call.c:701
#2 0xb7d280e3 in Parrot_process_args (interp=0x804e008, st=0xbf937c80,
param_or_result=PARROT_PASS_PARAMS) at src/inter_call.c:1071
#3 0xb7d283df in parrot_pass_args (interp=0x804e008, src_ctx=0x8216fe0,
dest_ctx=0x82ab878, src_indexes=0x822624c, dest_indexes=0xb6b6aa1c,
param_or_result=PARROT_PASS_PARAMS) at src/inter_call.c:1189
#4 0xb7cb94f6 in Parrot_get_params_pc (cur_opcode=0xb6b6aa1c,
interp=0x804e008) at src/ops/core.ops:543
#5 0xb7d5b3db in runops_gc_debug_core (interp=0x804e008, pc=0xb6b6aa1c)
at src/runops_cores.c:231
There’s one important piece of information that I didn’t have last time:
the type of the PMC is 42. include/parrot/core_pmcs.h reveals that
this value corresponds to the enumeration enum_class_Hash,
which represents the Hash PMC. The backtrace shows that the PMC gets
created in init_first_dest_named() as part of the argument
passing process, which makes me a little bit suspicious.
The GC debugging runcore only runs the garbage collector between Parrot operations. It’s unlikely that Parrot will pause during argument passing to run the garbage collector, then resume, so I’m curious where this PMC will end up:
(gdb) up
#1 0xb7d273d4 in init_first_dest_named (interp=0x804e008, st=0xbf937c80)
at src/inter_call.c:701
701 st->dest.slurp = pmc_new(interp, Parrot_get_ctx_HLL_type(interp, enum_class_Hash));
(gdb) l
696 else if (sig & PARROT_ARG_SLURPY_ARRAY) {
697 int idx;
698
699 /* Create PMC for slurpy mode and register it; we must do this
700 * otherwise it may get collected. */
701 st->dest.slurp = pmc_new(interp, Parrot_get_ctx_HLL_type(interp, enum_class_Hash));
702 dod_register_pmc(interp, st->dest.slurp);
703
704 /* pass the slurpy hash */
705 idx = st->dest.u.op.pc[i];
706 CTX_REG_PMC(st->dest.ctx, idx) = st->dest.slurp;
Line 701 is the key here, as it creates the Hash and assigns it to the
call state. However, line 702 confuses me. dod_register_pmc()
tells Parrot to pin a PMC so that it is always live during the mark phase.
Something must unregister this PMC later on, and it occurs higher up the
call stack in parrot_pass_args() on line 1193. I remember
intending to write something like this code at some point.
This probably isn’t the culprit, as the lifecycle of this object looks
correct to me. The Hash must get assigned elsewhere sometime while it’s
valid, and that destination may not mark it correctly as live. That
assignment occurs on line 706 (the line after my list
command, but that’s okay).
The CTX_REG_PMC macro accesses a PMC register in a Parrot
context. Contexts represent what would normally be call frames in a
stack-based language. As Parrot is a register-based virtual machine that
uses continuation-passing style for flow control, a context contains a
variable number of the four types of registers as well as a pointer to the
calling context and a pointer to the return context. Think of them as the
equivalent to call frames and you’ll be fine.
I know that contexts get marked (mark_context() in
src/sub.c because I patched that code a couple of days ago (but
didn’t check in my change, as I wondered if it would cause GC problems).
The next debugging step was to track down what happened to this destination
context. The destination context is the arguments as the invoked entity
will receive them.
Up one more level, Parrot_process_args() grabs the
destination context out of the call state (line 957), but the rest of the
functions don’t store it anywhere. Something beneath this point in the
stack must assign the ctx pointer of the call state’s
dest struct somewhere.
An interesting experiment is to set a watchpoint in GDB on this
structure. A read watchpoint seems most appropriate. This doesn’t always
work, because execution might leave the context of the expression’s
lifetime, but it may work here. I moved up in the call stack to
parrot_pass_args, set a watchpoint, and then continued:
(gdb) up
#3 0xb7c753df in parrot_pass_args (interp=0x804e008, src_ctx=0x8216fe0,
dest_ctx=0x82ab878, src_indexes=0x822624c, dest_indexes=0xb6ab7a1c,
param_or_result=PARROT_PASS_PARAMS) at src/inter_call.c:1189
1189 Parrot_process_args(interp, &st, param_or_result);
(gdb) rwatch st->dest->ctx
Hardware read watchpoint 2: st->dest->ctx
(gdb) c
Continuing.
Hardware read watchpoint 2: st->dest->ctx
Value = (parrot_context_t *) 0x82ab878
0xb7c7440e in init_first_dest_named (interp=0x804e008, st=0xbfb7dec0)
at src/inter_call.c:706
706 CTX_REG_PMC(st->dest.ctx, idx) = st->dest.slurp;
(gdb) c
Continuing.
Watchpoint 2 deleted because the program has left the block in
which its expression is valid.
0xb7c064f6 in Parrot_get_params_pc (cur_opcode=0xb6ab7a1c, interp=0x804e008)
at src/ops/core.ops:543
543 parrot_pass_args(interp, caller_ctx, ctx, src_indexes, dst_indexes, PARROT_PASS_PARAMS);
(gdb)
That didn’t work. (That error is the same one I usually get.) What could
be keeping hold of this PMC but not marking it? Looking back in the
backtrace, I checked the signature of parrot_pass_args() once
again:
(gdb) l parrot_pass_args
1164 NOTNULL(parrot_context_t *src_ctx),
1165 NOTNULL(parrot_context_t *dest_ctx),
1166 NOTNULL(opcode_t *src_indexes),
1167 NOTNULL(opcode_t *dest_indexes),
1168 arg_pass_t param_or_result)
Then at Parrot_init_arg_indexes_and_sig_pmc(), called once
each for the source and destination contexts within
parrot_pass_args(). Specifically, line 244 is:
sti->ctx = ctx;
… called from:
Parrot_init_arg_indexes_and_sig_pmc(interp, dest_ctx, dest_indexes,
dest_signature, &st.dest);
… at line 1188 in parrot_pass_args(). That is, set
st->dest to the destination context. Whatever that context is,
it comes from the call frame immediately preceding this one, the
get_params opcode in src/ops/core.ops:
interp->current_params = _this;
ctx = CONTEXT(interp->ctx);
ccont = ctx->current_cont;
caller_ctx = ctx->caller_ctx;
src_indexes = interp->current_args;
dst_indexes = interp->current_params;
/* the args and params are now 'used.' */
interp->current_args = NULL;
interp->current_params = NULL;
parrot_pass_args(interp, caller_ctx, ctx, src_indexes,
dst_indexes, PARROT_PASS_PARAMS);
Thus if there’s a GC problem with the slurpy PMC not getting marked,
it’s because the current context is not marking its caller context as live,
which is exactly what my experimental patch removed. That’s a pretty clear
confirmation that the patch was wrong, and that I shouldn’t commit it.
Moving back to src/sub.c, I restored two lines into
mark_context() at line 66:
if (ctx->caller_ctx)
mark_context(interp, ctx->caller_ctx);
From there, I rebuilt Parrot (make parrot), disabled the
breakpoint in gdb (disable 1), and ran the program again:
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
`/home/chromatic/dev/parrot/parrot' has changed; re-reading symbols.
Error in re-setting breakpoint 1:
No source file named src/pmc.c.
Starting program: /home/chromatic/dev/parrot/parrot --runcore=gcdebug t/pmc/sub_55.pir
Error in re-setting breakpoint 1:
No source file named src/pmc.c.
Error in re-setting breakpoint 1:
No source file named src/pmc.c.
Breakpoint 1 at 0xb7ce83d3: file src/pmc.c, line 73.
[Thread debugging using libthread_db enabled]
[New Thread -1212933952 (LWP 18828)]
[New Thread -1213072496 (LWP 18829)]
[New Thread -1221465200 (LWP 18830)]
"VAR1" => PMC 'PGE::Match' => "aabbb" @ 3
Program exited normally.
I’m a little bit disappointed that my patch didn’t work; it cut out a lot of duplicate marking. Because Parrot allows native access to continuations from all languages running on Parrot (if they choose to support it, anyway), contexts form a call graph, and repeatedly marking the same context is possible when traversing the graph during the mark phase. It would be nice to be able to identify contexts marked during the current mark phase so as to avoid re-marking them here. A little bit of caching could save us a tremendous amount of work in long-lived programs with complex graphs. The solution is more than just adding a live flag to the context structure, as those flags need to get cleared during the sweep. (It’s easy to clear live flags when sweeping pools. Walk the pool and test each object for the live flag. If the object is live, clear the flag. Otherwise, put the object back on the free list.)
My disappointment runs shallow, however. When I first tested this patch, I ran the full test suite and nothing appeared extraordinary. I ran the PMC tests (t/pmc/*.t) under the GC debugging core and started to see unexpected failures. Even though the patch turned out to be faulty, the new runcore helped me see and diagnose the problem without exposing the problem to language implementors on other platforms. I had hoped that the problem would be elsewhere, but it’s clear that there’s a better solution elsewhere.
Preventing developers from checking in bugs is a tremendous benefit, and I learned something important in the process. That’s definitely progress.
