Yesterday’s How to Patch Perl 5 explained the big picture of how to add a new feature to a dynamic language with a virtual machine. Now it’s time to discuss the technical details.
The TODO item suggested that the implementation was probably as simple as translating ... into the Perl 5 code die "Unimplemented";. That’s emimently possible through a source filter such as Filter::Simple (think macros for Perl 5), but I wanted something better.
As I suggested yesterday, there are two steps. The first is convincing the Perl parser to recognize this syntactic construct in statement form, and the second is generating the correct ops to perform the desired behavior. Recognizing a syntactic construct is the job of the lexer, and here’s the relevant portion of the patch:
diff --git a/toke.c b/toke.c
index 431938f..fecbf9f 100644
--- a/toke.c
+++ b/toke.c
@@ -368,6 +368,7 @@ static struct debug_tokens {
{ WHEN, TOKENTYPE_IVAL, "WHEN" },
{ WHILE, TOKENTYPE_IVAL, "WHILE" },
{ WORD, TOKENTYPE_OPVAL, "WORD" },
+ { YADAYADA, TOKENTYPE_IVAL, "YADAYADA" },
{ 0, TOKENTYPE_NONE, NULL }
};
@@ -4774,6 +4775,10 @@ Perl_yylex(pTHX)
pl_yylval.ival = 0;
OPERATOR(ASSIGNOP);
case '!':
+ if (PL_expect == XSTATE && s[1] == '!' && s[2] == '!') {
+ s += 3;
+ LOP(OP_DIE,XTERM);
+ }
s++;
{
const char tmp = *s++;
@@ -5025,10 +5030,14 @@ Perl_yylex(pTHX)
AOPERATOR(DORDOR);
}
case '?': /* may either be conditional or pattern */
- if(PL_expect == XOPERATOR) {
+ if (PL_expect == XSTATE && s[1] == '?' && s[2] == '?') {
+ s += 3;
+ LOP(OP_WARN,XTERM);
+ }
+ if (PL_expect == XOPERATOR) {
char tmp = *s++;
if(tmp == '?') {
- OPERATOR('?');
+ OPERATOR('?');
}
else {
tmp = *s++;
@@ -5067,6 +5076,10 @@ Perl_yylex(pTHX)
PL_expect = XSTATE;
goto rightbracket;
}
+ if (PL_expect == XSTATE && s[1] == '.' && s[2] == '.') {
+ s += 3;
+ OPERATOR(YADAYADA);
+ }
if (PL_expect == XOPERATOR || !isDIGIT(s[1])) {
char tmp = *s++;
if (*s == tmp) {
This lexer is a state machine with a little bit of lookahead. I added a new token type for the lexer to pass to the parser — the YADAYADA token. This isn’t always necessary, but it does add a small bit of self-documentation to the process.
As you might expect, the lexer walks through a program’s source code one
character at a time, and there’s an enormous switch statement at its heart.
The first two chunks of lexing code add support for !!! and
??? respectively. It should be easy to see how they look for
three exclamation points or question marks in a row. PL_expect
allows some degree of lookahead. This syntax is only valid where the Perl
grammar expects a statement.
The LOP macro is a little confusing. This ties directly in
with how Perl represents programs internally.
I said before that ... is equivalent to die
"Unimplemented";. Similarly, !!! $some_message; is
equivalent to die $some_message; and ???
$another_message to warn $another_message;. This means
that the optree produced by these new operators must be the same as the optree
produced by their equivalent forms.
This is easy to discover in Perl 5:
$ perl -MO=Concise -e 'die "Unimplemented"'
6 <@> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v ->3
5 <@> die[t1] vK/1 ->6
3 <0> pushmark s ->4
4 <$> const[PV "Unimplemented"] s ->5
-e syntax OK
This probably looks like gibberish to you, but it’s a textual representation
of Perl’s optree. The numbers correspond to execution order, and the bracketed
symbols denote the type of op. What’s most important here is the
die op. The leading @ means that it’s a list op, and
it obviously has two children, a stack-manipulating pushmark op
(Perl 5 is a stack-based VM) and a constant string (the SvPV type).
The LOP macro in the tokenizer produces a list op of the given
type (you can figure out what OP_WARN and OP_DIE
represent) and expects a following term — a string or variable or expression
which evaluates to a term. That’s all. The rest of the parsing process grafts
this branch into the optree properly, and everything works as expected.
What about the ... chunk? It stands on its own; it’s a
complete statement. In that case, the lexer consumes the input and produces
the YADAYADA token for the parser to process appropriately.
Here’s the relevent part of the patch for the parser:
diff --git a/perly.y b/perly.y
index ad7b552..22790f9 100644
--- a/perly.y
+++ b/perly.y
@@ -72,7 +72,7 @@
%token FORMAT SUB ANONSUB PACKAGE USE
%token WHILE UNTIL IF UNLESS ELSE ELSIF CONTINUE FOR
%token GIVEN WHEN DEFAULT
-%token LOOPEX DOTDOT
+%token LOOPEX DOTDOT YADAYADA
%token FUNC0 FUNC1 FUNC UNIOP LSTOP
%token RELOP EQOP MULOP ADDOP
%token DOLSHARP DO HASHBRACK NOAMP
@@ -106,7 +106,7 @@
%left ','
%right ASSIGNOP
%right '?' ':'
-%nonassoc DOTDOT
+%nonassoc DOTDOT YADAYADA
%left OROR DORDOR
%left ANDAND
%left BITOROP
@@ -1227,6 +1227,11 @@ term : termbinop
}
| WORD
| listop
+ | YADAYADA
+ {
+ $$ = newLISTOP(OP_DIE, 0, newOP(OP_PUSHMARK, 0),
+ newSVOP(OP_CONST, 0, newSVpvs("Unimplemented")));
+ }
;
/* "my" declarations, with optional attributes */
The first two patch chunks register the YADAYADA token as a
valid token and give it no particular associativity. The last chunk is more
interesting; it’s a new branch of the term rule in the grammar.
Wherever a term is valid in the Perl 5 grammar, ... is valid.
Like the LOP macro, all that’s necessary here is to produce a
valid branch of the optree consisting of the die op with two
children: a pushmark op and a string (in SvPV form). That’s
exactly what this code does.
The rest of the patch is tests, documentation, and changes to generated files. It took eighteen lines of code (being very generous about whitespace and braces counting as lines) to add three new features to Perl 5. Not all features are this easy to add, and I’ve certainly written prettier code, but this was a modest task for a Friday evening, and a relatively easy way to demonstrate how a VM-hosted programming language works at the parsing level.

Will toke.c understand unicode? I once tried to get Ruby to understand "в‰Ў" as a bonafide operator (instead of just a method), but I couldn't make it work.
@Daniel, it does, though I believe you have to use the utf8 pragma to notify the parser to interpret characters with high bits set as UTF-8. I've just browsed toke.c and can't explain what's going on to my satisfaction. See lambda.pm on the CPAN, for example.
Хотите купить девочку,недорого.