Building Recursive Descent Parsers with Pythonby Paul McGuire
What is "parsing"? Parsing is processing a series of symbols to extract their meaning. Typically, this means reading the words of a sentence and drawing information from them. When application programs need to process data that is provided as text, they must use some form of parsing logic. This logic scans the text characters and character groups (words) and recognizes patterns of groups to extract the underlying commands or information.
Software parsers are usually special-purpose programs, built to process a specific form of text. This text could be a set of encoded notations on insurance or medical forms; function declarations in a C header file; node-edge descriptions showing interconnections of a graph; HTML tags in a web page; or interactive commands to configure a network, modify or rotate a 3D image, or navigate through an adventure game. In each case, the parsers process a specific set of character groups and patterns. This set of patterns is the parser's grammar.
For instance, when parsing the string
Hello, World! you might wish to parse any greeting that follows its general pattern. Hello, World! begins with a salutation: the word Hello. There are many salutations--Howdy, Greetings, Aloha, G'day, and so on--so you could define a very narrow greeting grammar to begin with a single-word salutation. A comma character follows, and then comes the object of the greeting, the greetee, also a single word. Lastly, some form of terminating punctuation ends the greeting, such as an exclamation point. This greeting grammar looks roughly like this (reading
:: as "is composed of"):
word :: group of alphabetic characters salutation :: word comma :: "," greetee :: word endPunctuation :: "!" greeting :: salutation comma greetee endPunctuation
This is the Backus-Naur form (BNF). There are various dialects for symbols representing optional/mandatory constructs, repetition, alternation, and so on.
Once you have specified the target grammar in a BNF, you must then convert it to an executable form. A common technique is to develop a recursive-descent parser, one that defines functions that read individual terminal constructs of the grammar, and then higher-level functions that call the lower-level functions. Functions can return success or failure if they match at the current parsing position (or return matching tokens for success and raise exceptions for failure).
What Is Pyparsing?
Pyparsing is a Python class library that helps you to quickly and easily create recursive-descent parsers. Here is the pyparsing implementation of the
Hello, World! example:
from pyparsing import Word, Literal, alphas salutation = Word( alphas + "'" ) comma = Literal(",") greetee = Word( alphas ) endPunctuation = Literal("!") greeting = salutation + comma + greetee + endPunctuation
Several pyparsing features help developers create their text-parsing functions quickly:
- Grammars are native Python, so no separate grammar definition files are required
- No special syntax is required, outside
^for Or (longest, or "greedy" match),
|for MatchFirst (first match), and
- No separate code-generation step is required
- It implicitly skips white space and comments that may appear between parse elements; there is no need to clutter your grammar with markers for ignorable text
The pyparsing grammar shown will parse not only
Hello, World! but also any of:
Listing 1 contains a full
Hello, World! parser, including output of the parsed results.
from pyparsing import Word, Literal, alphas salutation = Word( alphas + "'" ) comma = Literal(",") greetee = Word( alphas ) endPunctuation = Literal("!") greeting = salutation + comma + greetee + endPunctuation tests = ("Hello, World!", "Hey, Jude!", "Hi, Mom!", "G'day, Mate!", "Yo, Adrian!", "Howdy, Pardner!", "Whattup, Dude!" ) for t in tests: print t, "->", greeting.parseString(t) =========================== Hello, World! -> ['Hello', ',', 'World', '!'] Hey, Jude! -> ['Hey', ',', 'Jude', '!'] Hi, Mom! -> ['Hi', ',', 'Mom', '!'] G'day, Mate! -> ["G'day", ',', 'Mate', '!'] Yo, Adrian! -> ['Yo', ',', 'Adrian', '!'] Howdy, Pardner! -> ['Howdy', ',', 'Pardner', '!'] Whattup, Dude! -> ['Whattup', ',', 'Dude', '!']