This article describes how the m88k-specific backend of the GNU C compiler, gcc, was fixed, from the discovery and analysis of the problems to the real fixing work. Since it started with almost zero gcc internals knowledge, it should be understandable by anyone able to read C code, and proves that diving into gcc is not as hard as one could imagine.
Most of the code snippets displayed below come from gcc 2.95 sources,
in the m88k specific code (found inside
gcc/config/m88k/m88k.*). For more details about the gcc
internals, the reader is welcome to refer to the Resources.
The Motorola 88000 architecture is not well-known today. Think of it as a bridge between the famous Motorola 68000 family and the well-known PowerPC family. Although now a dead architecture, many fine m88k-based systems were produced from 1988 to 1992, such as the Data General Aviion workstations and Motorola's own embedded systems.
Due to the elegance of its design and the availability of second-hand machines, the m88k systems became and remain quite popular among hobbyists. No wonder that several free operating systems are ported, or are being ported to, these machines. The most advanced effort was the OpenBSD/mvme88k port to the Motorola VME boards.
Nivas Madhur started the OpenBSD/mvme88k port in 1995. Back then, OpenBSD would still ship with gcc 2.7.2.1 and local patches (these were the days!), and the problems Nivas had to face were dire kernel bugs, so no real effort was done on the toolchain. Optimization was disabled in order to prevent compiler bugs, if any, from interfering.
Nivas Madhur eventually stopped working on the port. Dale Rahn integrated it into the OpenBSD main sources, but Dale did not have the resources to maintain it. Steve Murphree, Jr eventually took over. At some point, OpenBSD started to use gcc 2.8, which did not fix the optimizer problems.
Steve was very close to producing an OpenBSD/mvme88k release.
Unfortunately, some weeks before the code freeze, the in-tree gcc was updated
to egcs 1.1 and compiler problems started to plague the port: even at
-O0 non-optimization level, the compiler would not always output
correct code. As the kernel was compiled with -O as an exception
to the -O0 rule, people started running unreliable kernels.
At some point, the userret() code path factorization in the
OpenBSD kernel for all architectures required kernels to be built with
optimization, relying on userret() being an inline function.
Unfortunately, gcc, by design, will never inline functions (even if explicitly
requested) at -O0. The non-return point had been crossed: gcc had
to be fixed.
Finding and fixing compiler problems is never easy. It's like climbing a mountain: you need a good rope, and good assets. Moreover, a debugger is mostly useless: you're not trying to find why code behaves incorrectly, but rather what causes gcc to produce incorrect code.
In my case, I made sure to keep a known working gcc 2.8 binary in a safe place, which could be used to bootstrap gcc 2.95 first, then as a working reference. I also made sure I had a good backup.
After compiling gcc 2.95 with gcc 2.8, my first test was to compile and run a kernel and hope for the best. After a long compilation, my hopes were smashed very quickly: the kernel failed very early in an assertion:
panic: kernel diagnostic assertion "obj == NULL || anon == NULL" failed:
file "/usr/src/sys/uvm/uvm_page.c", line 899
dropping me into the debugger. Yet the function parameters from the traceback were apparently correct!
From the assertion message and the debugger traceback, I could easily
reconstruct the code flow. Since the assertion failure would happen in an
uvm_pagealloc_start invocation, I built the following simple
program to reproduce a similar flow:
$ cat assert.c
#include <stdio.h>
#include <sys/types.h>
#define KASSERT(e) ((e) ? (void) 0 : __assert( __FUNCTION__, #e))
void
__assert(const char *funcname, const char *error)
{
printf("Assertion failed in %s: %s\n", funcname, error);
/* exit(0); */
}
void *
uvm_pagealloc_strat(void *obj, u_int64_t off, void *anon, int flags, int strat,
int free_list)
{
KASSERT(anon == NULL);
return obj;
}
void *
uvm_pagealloc(void *obj, u_int64_t off, void *anon, int flags)
{
KASSERT(anon == NULL);
return obj;
}
main()
{
char *pg;
char *kobj = "kobj";
pg = uvm_pagealloc(kobj, 0, NULL, 0);
pg = uvm_pagealloc(kobj, 0, NULL, 0);
pg = uvm_pagealloc_strat(kobj, 0, NULL, 0, 0, 0);
pg = uvm_pagealloc(kobj, 0, NULL, 0);
}
$
When compiled with gcc 2.95, this program reproduced the problem:
$ gcc295 -O0 -o assert assert.c
$ ./assert
Assertion failed in uvm_pagealloc: anon == NULL
$
gcc 2.8 worked fine:
$ gcc28 -O0 -o assert assert.c
$ ./assert
$
More interestingly, the assertion would only fail for the first call but not afterwards. This would hint toward either incorrect stack or incorrect register usage, eventually corrected by the side effects of multiple function calls.
After some tinkering, I finally ended with this interesting sample program:
$ cat eighty2.c
#include <stdio.h>
#include <sys/types.h>
void
odd64(int oddmaker, u_int64_t stamper, int value)
{
printf("odd64: value=%d\n", value);
}
void
odd32(int oddmaker, u_int32_t stamper, int value)
{
printf("odd32: value=%d\n", value);
}
void
even64(int oddmaker, int evenmaker, u_int64_t stamper, int value)
{
printf("even64: value=%d\n", value);
}
void
even32(int oddmaker, int evenmaker, u_int32_t stamper, int value)
{
printf("even32: value=%d\n", value);
}
main()
{
odd64(0, 0, 1);
odd32(0, 0, 1);
even64(0, 0, 0, 1);
even32(0, 0, 0, 1);
}
$
When run, this program would produce:
$ gcc295 -O0 -o eighty2 eighty2.c
$ ./eighty2
odd64: value=3
odd32: value=1
even64: value=1
even32: value=1
$
instead of:
$ gcc28 -O0 -o eighty2 eighty2.c
$ ./eighty2
odd64: value=1
odd32: value=1
even64: value=1
even32: value=1
$
The problem was apparently tied to the use of a 64 bit argument, but only in some cases. Why?
Let's examine the m88k calling convention for these routines. The
canonical m88k calling convention mandates that the arguments are passed
in registers r2 to r9, with extra arguments
passed on the stack. If an argument can not fit in one register (such as
a double, or an int64_t), it will be put in two
consecutive registers starting at an even number, so that double word load
and store instructions can be used. In our case, the calling convention
would be:
Calling convention for even64()
r2 - oddmaker
r3 - evenmaker
r4, r5 - stamper
r6 - value
Calling convention for odd64()
r2 - oddmaker
r3 (wasted)
r4, r5 - stamper
r6 - value
However, looking at the code generated by gcc 2.95, odd64()
would be invoked with
r2 - oddmaker
r3 (unused)
r4, r5 - stamper
stack - value
Why would the last parameter be passed on the stack with the new compiler?
Let's dive in to the gcc sources. A large piece of code in
gcc/calls.c is responsible for function invocations, choosing
where to pass arguments, whether on the stack or in a specific
register. To do this, it relies upon a set of macros provided by the
processor-dependent backend of gcc: the FUNCTION_ARG macro
set.
In this particular case, the macros misbehave as soon as they encounter
a 64-bit parameter, for which it is necessary to skip and waste an
odd-numbered register. As such, the problem probably lies in
FUNCTION_ARG or FUNCTION_ARG_ADVANCE. The first
macro will decide where to put the argument while the second one updates a
position counter for the next FUNCTION_ARG update to know at
which register number or stack location to start.
A simple grep through the gcc sources shows that gcc 2.8 invokes
FUNCTION_ARG_ADVANCE in five places:
$ grep FUNCTION_ARGS_ADVANCE *.c
calls.c: FUNCTION_ARG_ADVANCE (args_so_far, TYPE_MODE (type), type,
calls.c: FUNCTION_ARG_ADVANCE (args_so_far, mode, (tree) 0, 1);
calls.c: FUNCTION_ARG_ADVANCE (args_so_far, Pmode, (tree) 0, 1);
calls.c: FUNCTION_ARG_ADVANCE (args_so_far, mode, (tree) 0, 1);
function.c: FUNCTION_ARG_ADVANCE (args_so_far, promoted_mode,
$
as does gcc 2.95:
$ grep FUNCTION_ARGS_ADVANCE *.c
calls.c: FUNCTION_ARG_ADVANCE (*args_so_far, TYPE_MODE (type), type,
calls.c: FUNCTION_ARG_ADVANCE (args_so_far, mode, (tree) 0, 1);
calls.c: FUNCTION_ARG_ADVANCE (args_so_far, Pmode, (tree) 0, 1);
calls.c: FUNCTION_ARG_ADVANCE (args_so_far, mode, (tree) 0, 1);
function.c: FUNCTION_ARG_ADVANCE (args_so_far, promoted_mode,
$
but note how the first invocation uses a pointer dereference now? Of
course, the FUNCTION_ARG_ADVANCE macro is not entirely
expansion-safe:
$ cd config/m88k
$ head -1069 m88k.h | tail -18
/* A C statement (sans semicolon) to update the summarizer variable
CUM to advance past an argument in the argument list. The values
MODE, TYPE and NAMED describe that argument. Once this is done,
the variable CUM is suitable for analyzing the *following* argument
with `FUNCTION_ARG', etc. (TYPE is null for libcalls where that
information may not be available.) */
#define FUNCTION_ARG_ADVANCE(CUM, MODE, TYPE, NAMED) \
do { \
enum machine_mode __mode = (TYPE) ? TYPE_MODE (TYPE) : (MODE); \
if ((CUM & 1) \
&& (__mode == DImode || __mode == DFmode \
|| ((TYPE) && TYPE_ALIGN (TYPE) > BITS_PER_WORD))) \
CUM++; \
CUM += (((__mode != BLKmode) \
? GET_MODE_SIZE (MODE) : int_size_in_bytes (TYPE)) \
+ 3) / 4; \
} while (0)
$
Notice how CUM, the first parameter, is used unprotected?
Guess what happens in the CUM++; statement when
CUM happens to be *args_so_far? Our exact
problem! Compiling a call to odd64() would trigger the
CUM++; statement from the macro invocation using the
pointer. This is just one more bug caused by unnoticed preprocessor-unsafe
code, especially since it had been working correctly in previous gcc
versions.
As a result of the bug, args_so_far would end up incremented,
pointing to a semi-random memory location holding a huge value. As a result,
subsequent FUNCTION_ARG invocations would consider that all the
r2-r9 registers are in use, placing the remaining arguments on the
stack.
That's not all. There is another bug left in there. Look more closely at the macro expansion:
enum machine_mode __mode = (TYPE) ? TYPE_MODE (TYPE) : (MODE); \
...
CUM += (((__mode != BLKmode) \
? GET_MODE_SIZE (MODE) : int_size_in_bytes (TYPE)) \
It is obvious that the GET_MODE_SIZE macro should be applied
to the revisited __mode, not the initial MODE. Yet
people keep telling me gcc 2.95 works for them under SysV/m88k. Happy fellows.
I eventually rewrote this macro as a function in my tree, as it had more
bugs and needed to follow the FUNCTION_ARG logic more closely,
which would make it too big to be worth kept as a macro:
/* Update the summarizer variable CUM to advance past an argument in
the argument list. The values MODE, TYPE and NAMED describe that
argument. Once this is done, the variable CUM is suitable for
analyzing the *following* argument with `FUNCTION_ARG', etc. (TYPE
is null for libcalls where that information may not be available.) */
void
m88k_function_arg_advance (args_so_far, mode, type, named)
CUMULATIVE_ARGS *args_so_far;
enum machine_mode mode;
tree type;
int named;
{
int bytes;
if ((type != 0) &&
(TREE_CODE(type) == RECORD_TYPE || TREE_CODE(type) == UNION_TYPE))
mode = BLKmode;
bytes = (mode != BLKmode) ? GET_MODE_SIZE (mode) : int_size_in_bytes(type);
if ((*args_so_far & 1) && (mode == DImode || mode == DFmode
|| ((type != 0) && TYPE_ALIGN (type) > BITS_PER_WORD)))
(*args_so_far)++;
(*args_so_far) += (bytes + UNITS_PER_WORD - 1) / UNITS_PER_WORD;
}
|
Even with the function calls fixed, gcc 2.95 was unable to recompile itself:
the build would die when running the helper program genrecog,
which would segfault and dump core.
The genrecog tool is used to generate insn-recog.c
from the machine-dependent backend script, in our case
config/m88k/m88k.md. It shares a lot of code with other
genfoo tools extracting the various information from the
m88k.md machine description file. In fact, I very quickly tracked
the breakage to genrecog.c. Compiling every other part of
genrecog except genrecog.c with gcc 2.95 and
compiling genrecog.c with gcc 2.8 produced a working binary.
While I could continue the build (and did several times!) either by
cross-generating insn-recog.c on a different machine or simply by
reverting to gcc 2.8 for genrecog.c build only, I had to dive in
this problem.
genrecog provides a lot of information (in fact, C source code)
on standard output. When compiled with gcc 2.95, it would die very quickly with
this output:
$ cd obj
$ ./genrecog /usr/src/gnu/egcs/gcc/config/m88k/m88k.md
/* Generated automatically by the program `genrecog'
from the machine description file `md'. */
#include "config.h"
#include "system.h"
#include "rtl.h"
#include "insn-config.h"
#include "recog.h"
#include "real.h"
#include "output.h"
#include "flags.h"
extern rtx gen_split_1 ();
extern rtx gen_split_2 ();
Memory fault (core dumped)
$
By searching for this kind of output (such as extern rtx
gen_split) in the source, I could quickly trace the segfault to this
snippet from main():
while (1)
{
c = read_skip_spaces (infile);
if (c == EOF)
break;
ungetc (c, infile);
desc = read_rtx (infile);
if (GET_CODE (desc) == DEFINE_INSN)
recog_tree = merge_trees (recog_tree,
make_insn_sequence (desc, RECOG));
else if (GET_CODE (desc) == DEFINE_SPLIT)
split_tree = merge_trees (split_tree,
make_insn_sequence (desc, SPLIT));
if (GET_CODE (desc) == DEFINE_PEEPHOLE
|| GET_CODE (desc) == DEFINE_EXPAND)
next_insn_code++;
next_index++;
}
The function responsible for the extern rtx gen_split
lines is make_insn_sequence. The program would then die
between the second and the third merge_trees invocation.
make_insn_sequence might produce an insn
sequence or whatever. To continue diving into genrecog, all
we need to know is what kind of return value it provides. A quick glance
at the source shows:
static struct decision_head make_insn_sequence PROTO((rtx, enum routine_type));
...
static struct decision_head merge_trees PROTO((struct decision_head,
struct decision_head));
These functions work on decision_head structures, which are
defined as:
/* Data structure for a listhead of decision trees. The alternatives
to a node are kept in a doubly-linked list so we can easily add nodes
to the proper place when merging. */
struct decision_head { struct decision *first, *last; };
The best way to know at which point invalid data would be generated was to
add a trace of the decision_head structures.
Adding traces to genrecog is very easy: since it outputs C
code, you just have to output your traces as C comments. Adding simple
traces to merge_trees proved very quickly that it was given
an invalid argument. But then traces in make_insn_sequence
would prove that it would produce valid data. The output would end like
this:
...
extern rtx gen_split_1 ();
/* make_insn_sequence -> 2e000.2e000 */
/* merge_trees <- 0.0 8.4008 */
/* merge_trees -> 8.4008 */
extern rtx gen_split_2 ();
/* make_insn_sequence -> 2e380.2e380 */
/* merge_trees <- 8.4008 18.4018 */
Memory fault (core dumped)
$
This is proof that the problem lies in the way short structures are passed to and returned from functions. Consider the following example:
$ cat maze.c
#include <stdio.h>
struct maze { const char *item1, *item2; };
struct maze
builder(void)
{
struct maze m;
m.item1 = "you are lost";
m.item2 = "in the maze.";
printf("builder: m.item1 = %p, item2 = %p\n", m.item1, m.item2);
return m;
}
void
checker(struct maze m)
{
printf("checker: m.item1 = %p, item2 = %p\n", m.item1, m.item2);
}
main()
{
checker(builder());
}
$
When compiled by gcc 2.8, it would output, for example:
$ gcc28 -O0 -o maze maze.c
$ ./maze
builder: m.item1 = 0x10f8, item2 = 0x1108
checker: m.item1 = 0x10f8, item2 = 0x1108
$
However, gcc 2.95 produces:
$ gcc295 -O0 -o maze maze.c
$ ./maze
builder: m.item1 = 0x10f8, item2 = 0x1108
checker: m.item1 = 0x0, item2 = 0x0
$
Trial and error showed that this problem only affected structures from 8 to 32 bytes, inclusive.
If you look at the generated code for main(), aside from the
prologue and epilogue code, you'll see that gcc 2.8 generates the following
correct code:
or r12,r0,r31
bsr _builder
bsr _checker
gcc 2.95 adds two extra instructions:
ld.d r24,r0,r31
or r12,r0,r31
bsr _builder
st.d r24,r0,r31
bsr _checker
Before we go further, we need to know more about the m88k calling
convention. The standard is never to pass the structures in registers,
always on the stack. If the function returns a struct itself, the address
of the returned struct should be set in r12 by the caller
which will have allocated the appropriate space.
In the gcc 2.8 code, the compiler has already optimized the call flow
so that the structure is immediately on the stack frame and is thus
immediately usable in checker(). After r12 is
set to point to the temporary location, builder() and
checker() are invoked.
gcc 2.95, on the other hand, adds two extra statements. The first one
saves the stack area which is about to be used by builder()
in registers r24 and r25. The second statement
restores this area immediately before invoking checker(),
effectively making it check uninitialized memory.
If we use a temporary variable to store the builder() result,
such as:
#if 0
checker(builder());
#else
struct maze m = builder();
checker(m);
#endif
then gcc 2.8 and gcc 2.95 will produce the same, correct, code:
addu r12,r30,8
bsr _builder
or r13,r0,r31
addu r11,r30,8
ld r12,r11,0
ld r11,r11,4
st r12,r13,0
st r11,r13,4
bsr _checker
In this case, the variable m is at r30(8).
After builder() returns, the structure is copied to
r31(0), which will be the frame for checker().
(Note that the structure could be copied with one ld.d and
one st.d instruction, rather than two ld and two
st...)
So we have a problem which affects structures only when they are not
forced to be in memory. It makes then sense to hunt for a bug in the
RETURN_IN_MEMORY macro, which will decide whether a
particular object can be returned in a register:
$ cd config/m88k
$ head -1001 m88k.h | tail -8
/* Disable the promotion of some structures and unions to registers. */
#define RETURN_IN_MEMORY(TYPE) \
(TYPE_MODE (TYPE) == BLKmode \
|| ((TREE_CODE (TYPE) == RECORD_TYPE || TREE_CODE(TYPE) == UNION_TYPE) \
&& !(TYPE_MODE (TYPE) == SImode \
|| (TYPE_MODE (TYPE) == BLKmode \
&& TYPE_ALIGN (TYPE) == BITS_PER_WORD \
&& int_size_in_bytes (TYPE) == UNITS_PER_WORD))))
$
But if we look back at the code for FUNCTION_ARG, the comments
say:
Structures and unions which are not 4 byte, and word
aligned are passed in memory rather than registers, even if they
would fit completely in the registers under OCS rules.
and the code indeed does:
if (type != 0 /* undo putting struct in register */
&& (TREE_CODE (type) == RECORD_TYPE || TREE_CODE (type) == UNION_TYPE))
mode = BLKmode;
...
if (mode == BLKmode
&& (TYPE_ALIGN (type) != BITS_PER_WORD
|| bytes != UNITS_PER_WORD))
return (rtx) 0;
which really prevents any structure larger than 4 bytes from being passed in a register.
A more realistic version of RETURN_IN_MEMORY becomes then:
/* Disable the promotion of some structures and unions to registers.
Note that this matches FUNCTION_ARG behavior. */
#define RETURN_IN_MEMORY(TYPE) \
(TYPE_MODE (TYPE) == BLKmode \
|| ((TREE_CODE (TYPE) == RECORD_TYPE || TREE_CODE(TYPE) == UNION_TYPE) \
&& (TYPE_ALIGN (TYPE) != BITS_PER_WORD || \
GET_MODE_SIZE (TYPE_MODE (TYPE)) != UNITS_PER_WORD)))
Unfortunately, this fix, albeit an improvement, did not solve the issue. However, this problem was specific to the m88k compiler and could not be triggered on various other architectures, from Alpha to Vax. Investigating the differences between m88k and the other CPUs was the next logical step.
Depending on the various way function calls work across the different cpus, with various calling conventions and stack layouts, not even counting register windows, the machine-independent part of gcc tries to offer as much flexibility as possible, letting architecture-dependent configuration files define the different behaviors they provide or expect, using a bunch of macros.
Some of these macros must be defined for every architecture, such as
LIBCALL_VALUE which defines where a function returns its
result (on m88k, in register r2), while other macros are only
defined if the architecture requires it. The m88k is unique in defining
REG_PARM_STACK_SPACE and
OUTGOING_REG_PARM_STACK_SPACE (actually, a few other
architectures do this as well), as well as
STACK_PARMS_IN_REG_PARM_AREA.
What do these macros tell the compiler? Let's quote from the manuals:
REG_PARM_STACK_SPACE (FNDECL)Define this macro if functions should assume that stack space has been allocated for arguments even when their values are passed in registers.
The value of this macro is the size, in bytes, of the area reserved for
arguments passed in registers for the function represented by
FNDECL, which can be zero if GNU CC is calling a library
function.
This space can be allocated by the caller, or be a part of the machine-dependent stack frame:
OUTGOING_REG_PARM_STACK_SPACE says which.
OUTGOING_REG_PARM_STACK_SPACENothing fancy here. The m88k-specific backend indeed will automagically allocate 32 bytes on the stack during the function prologue. Hey, wait, 32 bytes, exactly like the structure size threshold, where bigger structures don't get clobbered.
A simple experiment is to comment out
OUTGOING_REG_PARM_STACK_SPACE. Compiling gcc with these settings
will produce a stack-wasting compiler, because every function prologue will now
automagically allocate an extra 64 bytes on the stack: 32 from the
m88k-specific prologue and 32 from the architecture-independent prologue code,
since it has been now instructed to do so. However, when using this compiler,
the problem disappears completely, whatever the size of the structure.
Another path worth trying would be to remove this automatic stack
allocation, and undefine REG_PARM_STACK_SPACE. However, I am
afraid this could break some 88Open rule, or, even worse, some implicit
assumption hidden in the m88k-specific backend code.
Now it's time to look at STACK_PARMS_IN_REG_PARM_AREA. Quoting
the manuals again:
STACK_PARMS_IN_REG_PARM_AREADefine this macro if REG_PARM_STACK_SPACE is defined, but
the stack parameters don't skip the area specified by it.
Normally, when a parameter is not passed in registers, it is placed on the
stack beyond the REG_PARM_STACK_SPACE area.
Defining this macro suppresses this behavior and causes the parameter to be passed on the stack in its natural location.
This is very interesting, and also very obscure (I'll know what to blame for
my next headache). No other architecture defines this. If I understand this
correctly, it means that the REG_PARM_STACK_SPACE area can be
shared by both function call parameters, and local variables.
While gcc 2.8 would not care about this area being shared, and assumes we
know what we are doing, gcc 2.95 is more strict, and will explicitly save any
variable of this shared area, when it might be clobbered. This is exactly what
we witnessed. The area on the stack which has been saved before invoking
builder() and restored after it returned is the implicit location
for a temporary struct maze.
A real fix would be to teach gcc to handle the shared area as gcc 2.8 did,
but then this behaviour is objectionable, and probably more a bug than a
feature. My workaround was to stop defining
STACK_PARMS_IN_REG_PARM_AREA. The generated code becomes:
addu r12,r30,8
bsr _builder
addu r13,r31,32
addu r11,r30,8
ld r12,r11,0
ld r11,r11,4
st r12,r13,0
st r11,r13,4
bsr _checker
which is the same code as when the result of builder()
was stored in an explicit temporary variable, except for the stack
location: instead of being at the beginning of the
REG_PARM_STACK_SPACE area, it is now past this area. So there
is still some stack waste, but as told before I do not want to change
REG_PARM_STACK_SPACE at the moment.
With these bugs fixed, I reached a point where -O0 would
apparently always produce correct code. It was time to make optimization
reliable as well!
Miod Vallat makes all sort of strange telephony hardware work reliably for a living.
Return to the BSD DevCenter.
Copyright © 2009 O'Reilly Media, Inc.