I believe that a programming language should never crash, even given bad input. There may be cases where it reports obscure syntax errors that are difficult to understand, but crashing is unacceptable.

One way to make sure that there are no crashes is to feed your parser as much invalid input as you can imagine and check that you only ever get syntax errors. (I suppose another way is to write formal proofs for your parser, but even then you may have bugs in your implementation.)

To do that, you need a large corpus of valid programs and a way to generate a large corpus of mostly-valid programs that aren’t quite right.

I installed Algorithm::MarkovChain and set to work.

Valid Parrot Programs

The Parrot distribution includes several libraries and example programs written in its native PIR language. As well, most of the test suite uses PIR code internally. After you run the test suite, the t/ directory will contain several additional files with the .pir extension.

I decided to use all of this code together as the initial corpus to seed the Markov module, so I merged it all into one file:

  $ cat $( find . -name '*.pir' ) > markov_train.pir

I realize that there may be better ways to write that code, but it was easier to write once than to find the most optimal approach.

Potentially Interesting PIR Programs

Algorithm::MarkovChain is easy to use. Create an object, seed it with a list of valid symbols, and ask it to spew out some results.

  use Algorithm::MarkovChain;

  open( my $fh, '<', 'train_markov.pir' ) or die "Cannot read file: $!\n";
  chomp(my @symbols = <$fh>);

  my $chain = Algorithm::MarkovChain->new();
  $chain->seed( symbols => \@symbols );

  my $prog  = ".sub main\n" . join( "\n", $chain->spew() ) . ".end\n";

I added the standard PIR subroutine start and end tokens to prevent Parrot from reporting the first and obvious syntax error that there’s code outside of a subroutine.

Automatically Confusing the Bird

That gave me one possibly-invalid PIR program. I wanted to find as many as possible, but only keep those that actually crashed Parrot. I decided to take the generated program and actually run it through Parrot, then look for a normal syntax error message. Without that, I can assume that the generated program is interesting in some sense and can keep it to debug the program and add to the test suite. I used IPC::Open3 to manage this process.

I had two options. I could feed the program to Parrot on its standard input, or I could write the program to a temporary file and invoke Parrot on it normally. The second option seemed conceptually simplest, so I used File::Temp to generate a temporary file. If that PIR program does something interesting, I use File::Copy to rename the program to something I can investigate later.

I also decided to let the program run indefinitely, while using a SIGINT handler to clean up effectively. I also added a one-second sleep() after running Parort and waiting for it to exit, to allow me to run other programs in the foreground and this in the background.

After a few thousand explorations, I didn’t find a single interesting PIR program. That improved my confidence somewhat. It might be worth culling the corpus for the initial seed to avoid the problem of several similar programs, and it might be worth tuning the generator to provide longer or more divergent programs. Still, it took half an hour (including installing the module and figuring out how to use it) to start this project and get useful results–even if those results are, so far, “No crashes.”

Here’s the full program.

  #! perl

  use strict;
  use warnings;

  use File::Copy;
  use File::Temp 'tempfile';

  use IPC::Open3;
  use Algorithm::MarkovChain;

  open( my $fh, '<', 'train_markov.pir' ) or die "Cannot read file: $!\n";
  chomp(my @symbols = <$fh>);

  my $running = 1;
  $SIG{INT}   = sub { $running = 0 };

  my $chain   = Algorithm::MarkovChain->new();
  $chain->seed( symbols => \@symbols );

  while ( $running )
  {
      my $prog        = ".sub main\n" . join( "\n", $chain->spew() ) . ".end\n";
      my ($fh, $file) = tempfile();
      my $pid         = open3( my ($writer, $reader), undef, 'parrot', $file );
      waitpid $pid, 0;

      sleep( 1 );

      my $err      =  <$reader>;
      next if $err =~ /imcc:syntax error/;

      my $dest = 'markov_' . time() . '.pir';
      copy( $file, $dest ) or die "Cannot copy '$dest': $!\n$prog\n";
  }