My home office network connection went out for a while yesterday. Instead of taking the day off to rest and relax, I asked fellow Perl 6 and Parrot hacker Allison Randal to walk me through building a compiler for a little language with Parrot’s compiler tools.
Though we didn’t finish the project before the fairly useless telephone company fixed the problem, I learned a lot about how the compiler tools work. Here’s what I know now.
The Parrot compiler tools are a suite of utilities and languages and libraries intended to make it easier to write compilers for Parrot. This includes the official Perl 6 compiler, according to the current plan.
So far, there are preliminary versions of each of these tools in the Parrot tree. So far, one language really uses them — Punie, a port of Perl 1 to Parrot. (This is Allison’s current project and the real driver of the compiler tools. Though it seems fairly useless, it’s a good project because it has low technical risk of failure, it has smaller but similar complexity to Perl 6, and it has a wonderful pun in the name, even if I must admit that I came up with it.) There is also an APL port in the works from Will “hey, didn’t he do a lot of work on the Tcl port too?” Coleda.
Most languages don’t map directly to PIR constructs. That is, it’s possible only for trivial languages to use regular expressions or very simple parsers to translate their source code directly to source code that Parrot can execute. Real compiler suites are much more complicated — they need to lex and parse data and then emit some sort of object or source or byte code that other processes can run. This often requires multiple compilation stages or layers. That’s how Parrot’s compiler tools work.
The entire process is conceptually simple, once you understand the multi-stage approach. The compiler takes source code in another language and lexes and parses it into a data structure (usually a tree, but potentially at least a graph). From there, the compiler walks the tree, often transforming that tree into a different type of tree to make it easier to perform analysis and optimization and to emit code. Then the compiler emits its output.
In Parrot, PGE (the Parrot Grammar Engine) performs lexing and parsing and produces a parse tree. TGE (the Tree Grammar Engine) supports the transformation of the parse tree into a PAST (Parrot Abstract Syntax Tree) representation, then to a POST (Parrot Optimized/Output Syntax Tree) representation. Finally, TGE can transform POST into PIR, Parrot’s high level native language.
PGE is a pure-Parrot compiler for Perl 6 rules, as specified in Apocalypse, Exegesis, and Synopsis 5. If you’ve ever tried writing a parser with Perl 5 regular expressions, you’ll like Perl 6 rules much, much more.
The process of writing a grammar for a language is straightforward. PGE doesn’t (yet) support BNF grammars, but the process was easy enough even for me, and I hate parsing. Though I deliberately chose a language with very simple syntax, writing a parser only took about 30 minutes, and most of that was debugging time for a simple whitespace-related typo.
PGE can produce a pre-compiled grammar — that is, it can emit PIR code that runs without needing PGE. Making this was trivial (though I admit to copying Allison’s Makefile and tweaking it before asking her for help). This part of the process was the easiest and I ended up with a small driver program that loads the pre-compiled grammar, loads the source code of the to compile, and runs that code through the grammar, giving me back a PGE match object: the parse tree.
The next step in the process is to transform the parse tree into an abstract syntax tree. One of the reasons this is useful is because of the way parsing works. It’s common to have a lot of little parsing rules for various odd syntactic constructs in your language. You need to handle things slightly differently, even if they map to the same behavior. The step of building an AST helps to unify this behavior again, so the next stages of compilation can be somewhat less complex.
TGE is a set of tools that support attribute grammars. Think of these as little parsers that operate not on source code but on trees, producing different trees as output.
Could you walk the trees yourself to convert them? Absolutely — but haven’t you written enough tree-walking code?
TGE works a lot like PGE, in that you give it a small file of rules that match certain nodes in the tree with (currently PIR) code to execute on the matching nodes. Ultimately, it won’t use only PIR, as the language itself can be a bit clunky when creating complex data structures.
To make the most of TGE, you have to know something about how PAST and POST interact. As I see it, these are mostly conventions, but if the tools support them natively, you should get to know them.
PAST is the first tree transformation step. Actually, the step is really from parse tree to PAST. In this step, you use TGE to go from the match object’s parse tree to create PAST nodes. PAST nodes are just Parrot objects, just like parse tree nodes are merely PGE objects. This does mean that you need to define (or steal)
PAST::Op, and whatever other node types your PAST tree needs. I borrowed some from Punie, at Allison’s suggestion. Hopefully the existence of more languages will fill out the available and necessary ops sufficiently so that they can be part of TGE proper and so that all languages can share the same objects.
At this point, understanding how abstract syntax trees work is very helpful. Without going into too much detail (either to make my fingers tired or to make a silly mistake in explanation that generates heaps of angry e-mails), the AST resembles something much more like the execution order of a program. That is, if you have to add two numbers together, you have to evaluate the two numbers first, then perform the addition. (This is much more complex and useful when it’s not simply two numbers, but two expressions you have to evaluate to numbers before you can add them.)
Why are there two steps to transform trees? The parse tree of match objects from PGE is too close to the parser; it’s difficult to perform general tree transformations on it without going to an intermediate step first. That step is PAST. The next step is POST.
The POST step allows for optimizations, but more importantly for transforming the PAST into a form much easier to emit and execute. It works much like POST, requiring PAST nodes (and corresponding Parrot objects). It also has its own grammar file, specific to your language. There’s not much else to say here, except that the multi-stage tree transformations make compiling non-trivial languages much easier.
I’ll have more intelligent things to say about this with more experience.
The final transformation step is from POST to PIR. (This doesn’t have to be the final step. There’s no reason Parrot can’t or won’t someday have a native AST interface. This is just the simplest way to run and view the output.) There’s another tree grammar here used to traverse the POST nodes, visiting them appropriately, and emitting PIR output for each node. The more sensible your POST, the easier it is to write the correct code in the correct order.
With PIR, you can run it in Parrot natively, register your compiler suite as a Parrot compiler and allow Parrot to handle code in your language as code in any supported language, or even emit Parrot bytecode so that no one knows you started from a non-PIR language.
This does mean that writing a compiler suite for a language requires some experimentation to find the right way to represent the language in tree forms and where the right place is for a specific transformation. I don’t know of any good way to do this without experience and some familiarity with the existing PAST and POST objects.
Fortunately, having more languages use the compiler tools (Punie, APL, and my secret project, if I can get it to the point of running my next sample program and adding a few more primitives) will serve many purposes. First, it’ll mean that more people know how the tools work, likely leading to more people able to answer questions (and write about them). Second, it’ll inspire more refactorings to merge common and duplicated code into a single place. Third, it will (hopefully) lead to better tools, as all of the language explore the same types of computations in different ways, making it obvious what is common and what is different.
If you want to use these tools now on your own language, or if you want to help with an existing language port, you do need to be comfortable with Parrot and PIR. It was tremendously helpful to translate my sample program to PIR myself, just to see what type of output I expected.
Even though it probably would have been faster (at least for my demo programs) to write a small interpreter or translate the code to PIR directly, using the compiler tools (and having Allison explain them by example) showed how valuable this process is for a serious language. Layered designs are as useful here as they are in other types of programs.