Yesterday’s How to Patch Perl 5 explained the big picture of how to add a new feature to a dynamic language with a virtual machine. Now it’s time to discuss the technical details.

The TODO item suggested that the implementation was probably as simple as translating ... into the Perl 5 code die "Unimplemented";. That’s emimently possible through a source filter such as Filter::Simple (think macros for Perl 5), but I wanted something better.

As I suggested yesterday, there are two steps. The first is convincing the Perl parser to recognize this syntactic construct in statement form, and the second is generating the correct ops to perform the desired behavior. Recognizing a syntactic construct is the job of the lexer, and here’s the relevant portion of the patch:

diff --git a/toke.c b/toke.c
index 431938f..fecbf9f 100644
--- a/toke.c
+++ b/toke.c
@@ -368,6 +368,7 @@ static struct debug_tokens {
     { WHEN,            TOKENTYPE_IVAL,         "WHEN" },
     { WHILE,           TOKENTYPE_IVAL,         "WHILE" },
     { WORD,            TOKENTYPE_OPVAL,        "WORD" },
+    { YADAYADA,                TOKENTYPE_IVAL,         "YADAYADA" },
     { 0,               TOKENTYPE_NONE,         NULL }
 };
 
@@ -4774,6 +4775,10 @@ Perl_yylex(pTHX)
        pl_yylval.ival = 0;
        OPERATOR(ASSIGNOP);
     case '!':
+       if (PL_expect == XSTATE && s[1] == '!' && s[2] == '!') {
+           s += 3;
+           LOP(OP_DIE,XTERM);
+       }
        s++;
        {
            const char tmp = *s++;
@@ -5025,10 +5030,14 @@ Perl_yylex(pTHX)
            AOPERATOR(DORDOR);
        }
      case '?':                 /* may either be conditional or pattern */
-        if(PL_expect == XOPERATOR) {
+       if (PL_expect == XSTATE && s[1] == '?' && s[2] == '?') {
+           s += 3;
+           LOP(OP_WARN,XTERM);
+       }
+       if (PL_expect == XOPERATOR) {
             char tmp = *s++;
             if(tmp == '?') {
-                 OPERATOR('?');
+               OPERATOR('?');
             }
              else {
                 tmp = *s++;
@@ -5067,6 +5076,10 @@ Perl_yylex(pTHX)
            PL_expect = XSTATE;
            goto rightbracket;
        }
+       if (PL_expect == XSTATE && s[1] == '.' && s[2] == '.') {
+           s += 3;
+           OPERATOR(YADAYADA);
+       }
        if (PL_expect == XOPERATOR || !isDIGIT(s[1])) {
            char tmp = *s++;
            if (*s == tmp) {

This lexer is a state machine with a little bit of lookahead. I added a new token type for the lexer to pass to the parser — the YADAYADA token. This isn’t always necessary, but it does add a small bit of self-documentation to the process.

As you might expect, the lexer walks through a program’s source code one character at a time, and there’s an enormous switch statement at its heart. The first two chunks of lexing code add support for !!! and ??? respectively. It should be easy to see how they look for three exclamation points or question marks in a row. PL_expect allows some degree of lookahead. This syntax is only valid where the Perl grammar expects a statement.

The LOP macro is a little confusing. This ties directly in with how Perl represents programs internally.

I said before that ... is equivalent to die "Unimplemented";. Similarly, !!! $some_message; is equivalent to die $some_message; and ??? $another_message to warn $another_message;. This means that the optree produced by these new operators must be the same as the optree produced by their equivalent forms.

This is easy to discover in Perl 5:

$ perl -MO=Concise -e 'die "Unimplemented"'
6  <@> leave[1 ref] vKP/REFC ->(end)
1     <0> enter ->2
2     <;> nextstate(main 1 -e:1) v ->3
5     <@> die[t1] vK/1 ->6
3        <0> pushmark s ->4
4        <$> const[PV "Unimplemented"] s ->5
-e syntax OK

This probably looks like gibberish to you, but it’s a textual representation of Perl’s optree. The numbers correspond to execution order, and the bracketed symbols denote the type of op. What’s most important here is the die op. The leading @ means that it’s a list op, and it obviously has two children, a stack-manipulating pushmark op (Perl 5 is a stack-based VM) and a constant string (the SvPV type).

The LOP macro in the tokenizer produces a list op of the given type (you can figure out what OP_WARN and OP_DIE represent) and expects a following term — a string or variable or expression which evaluates to a term. That’s all. The rest of the parsing process grafts this branch into the optree properly, and everything works as expected.

What about the ... chunk? It stands on its own; it’s a complete statement. In that case, the lexer consumes the input and produces the YADAYADA token for the parser to process appropriately. Here’s the relevent part of the patch for the parser:

diff --git a/perly.y b/perly.y
index ad7b552..22790f9 100644
--- a/perly.y
+++ b/perly.y
@@ -72,7 +72,7 @@
 %token  FORMAT SUB ANONSUB PACKAGE USE
 %token  WHILE UNTIL IF UNLESS ELSE ELSIF CONTINUE FOR
 %token  GIVEN WHEN DEFAULT
-%token  LOOPEX DOTDOT
+%token  LOOPEX DOTDOT YADAYADA
 %token  FUNC0 FUNC1 FUNC UNIOP LSTOP
 %token  RELOP EQOP MULOP ADDOP
 %token  DOLSHARP DO HASHBRACK NOAMP
@@ -106,7 +106,7 @@
 %left  ','
 %right  ASSIGNOP
 %right  '?' ':'
-%nonassoc DOTDOT
+%nonassoc DOTDOT YADAYADA
 %left  OROR DORDOR
 %left  ANDAND
 %left  BITOROP
@@ -1227,6 +1227,11 @@ term     :       termbinop
                        }
        |       WORD
        |       listop
+       |       YADAYADA
+                       {
+                         $$ = newLISTOP(OP_DIE, 0, newOP(OP_PUSHMARK, 0),
+                               newSVOP(OP_CONST, 0, newSVpvs("Unimplemented")));
+                       }
        ;
 
 /* "my" declarations, with optional attributes */

The first two patch chunks register the YADAYADA token as a valid token and give it no particular associativity. The last chunk is more interesting; it’s a new branch of the term rule in the grammar. Wherever a term is valid in the Perl 5 grammar, ... is valid.

Like the LOP macro, all that’s necessary here is to produce a valid branch of the optree consisting of the die op with two children: a pushmark op and a string (in SvPV form). That’s exactly what this code does.

The rest of the patch is tests, documentation, and changes to generated files. It took eighteen lines of code (being very generous about whitespace and braces counting as lines) to add three new features to Perl 5. Not all features are this easy to add, and I’ve certainly written prettier code, but this was a modest task for a Friday evening, and a relatively easy way to demonstrate how a VM-hosted programming language works at the parsing level.