advertisement

Print

What's New with Regular Expressions
Pages: 1, 2

New Engines for Old Languages

Between the first and second edition, both Python and Tcl got completely new regular-expression engines. They started a trend that has increased in popularity and continues today: being "Perl5 Compatible."



The Spread of Perl's Regex Flavor

Perl version 5 broke new ground with innovative regex features so useful that they were not lost on the developers of other languages. Soon other languages had "Perl5 Compatible" regex engines, supporting such features as lazy quantifiers, non-capturing parentheses, inline mode modifiers, lookahead, and a free-spacing mode.

In the first edition of the book, these features were covered only in the Perl-specific chapter. By 2002, though, these features could be found in Perl, Python, PHP, Ruby, the .NET Framework, several Java packages, Tcl, and more, so these features are now given "first class" treatment in the main body of the book.

Of course, none of the new implementations exactly mimicked Perl, sometimes for better, sometimes for worse. And at the same time, Perl itself evolved, adding new features and modifying how old ones worked. This only exasperates how "Perl5 Compatible" really ends up meaning "has at least a few features that were introduced in Perl version 5." Just which features are supported, and how, remain important issues. The semantics of how a particular new construct works in such-and-such a situation may not seem important on paper, but becomes paramount if you actually want to use the construct.

The second edition, in moving the discussion of these new issues from the Perl-specific chapter to the general discussion, highlights and explores the differences found in the wild so that you can turn the particulars of the implementation you use to your advantage.

New Features in Perl (and Others!)

As I mentioned, over the years Perl has changed as well. When the first edition came out, Perl 5.003 had just been released. The second edition covers Perl 5.8. Major new features include lookbehind, atomic grouping, conditionals, embedded code, regex objects, regex overloading, the /gc modifier, Unicode-related features, and the special variables $^N, @-, and @+.

It took just one sentence to list the major new features, but they contain a lot of power. Some of the more advanced languages have picked up on some of these features, which means that there's a lot of variation in semantics. For example, among the regex implementations covered in the second edition, there are three different sets of semantics for what lookbehind can be applied to. (Did you know that you can do much more with .NET lookbehind than with Perl lookbehind? In fact, Perl's lookbehind is the most restrictive of them all.)

Perl is not the only cradle of innovation. Sun's java.util.regex package introduces modes to help with Unicode matching and possessive quantifiers, which are related to atomic grouping. (The Sun engineers actually added possessive quantifiers after having seen my suggestion in the first edition that they would be useful -- and indeed, in practice, they are quite useful.) java.util.regex also allows you to do set operations (AND, OR, INTERSECTION) when specifying characters within a character class -- very nice! The .NET package even has an innovative "right-to-left" match mode, although one that's fraught with a certain uncertainty.

Tcl's regex engine is different from all others in that at times it acts as if it's been implemented with one type of technology, while at other times it acts completely differently. Of course, the second edition still covers in depth the workings and practical ramifications of the ways to implement a regex engine, but Tcl's engine is a hybrid with the best of both worlds.

Unicode

Unicode has been around for a long time, but it's now really starting to catch on and achieve a critical mass. Unicode isn't a regex-specific concept, but it has many issues important to the regular-expression user. Many, it seems, don't really understand what "Unicode" really means, and they confuse it with "UTF-8," "UTF-16," and "USC2."

The second edition covers the background on Unicode with a concise treatment that demystifies it for the reader. Also discussed are the concepts of code points, combining characters, properties, blocks, scripts, and line terminators. It looks at the differences among the Unicode versions supported by various tools, Unicode-related regex issues, and how "rich" the regex support for Unicode really is. Does dot match a Unicode character, or a Unicode code point? Does \w match Unicode word-related characters? Does \w's idea of what a "word-related character" is match with \b's? How does a character class treat a Unicode character? Can you include Unicode characters within a class? Within the rest of the regex? Do ^ and $ understand Unicode line terminators? Does dot? These are just a few of the many questions the book both teaches you to pose, and answers.

The Result

I was very pleased with how the first edition turned out, and the enthusiastic response it got has been gratifying, but with the passage of time, some of the particulars slowly moved out of date. There's no doubt that this will happen again over the next several years, but this second edition is much better suited to hold its weight than the first edition ever was. I'm quite happy with how it turned out. Whether you program in Perl or Java or VB.NET or Python or PHP or C# or Ruby or any language with regular-expression support, I hope and believe that the second edition will provide you with a wealth of practical information and helpful examples.

O'Reilly & Associates recently released Mastering Regular Expressions, 2nd Edition.


Jeffrey E. F. Friedl did kernel development for Omron Corporation in Kyoto, Japan for eight years before moving in 1997 to Silicon Valley to apply his regular-expression know-how to financial news and data for a little-known company called "Yahoo!