What's New with Regular Expressionsby Jeffrey E. F. Friedl, author of Mastering Regular Expressions, 2nd Edition
Not long after finishing the first edition of Mastering Regular Expressions in early 1997, I started to work for Yahoo, writing programs that processed and managed financial news and data. I worked the industry-standard 20-hour days, using regular expressions day in and day out to parse data feeds.
Yet despite this long term, intensive use of regular expressions, my job didn't leave time for keeping up on how regular expressions evolved in the larger world beyond the one I spent my time in. So, when I started to work on the second edition of Mastering Regular Expressions and started refocusing on the field, I was rather shocked to find out how much had really changed. Originally, I'd naively thought that the second edition would require only a short update (perhaps three months), mostly consisting of adding HTML-related examples. In the end, it became almost a complete rewrite, taking two years.
Yes, the new edition now has many more HTML-related examples, but there's so much more than that. This article touches on some of the high-level changes between the first and second edition of the book.
Perhaps the largest and easiest-to-notice change is the new coverage of languages that have risen in prominence since the first edition.
Five years ago, there were no regular-expression packages for Java, but
today there are many. Sun now even includes
as of Java 1.4. Is it the best one? What others are popular? Which are good
for what I want to do? What are the tradeoffs?
Java wasn't even mentioned in the first edition, but it receives a thorough treatment throughout the second, with its own chapter devoted to Java-specific issues. In it, I look at no less than seven different packages (including the popular Apache Regexp and Jakarta ORO packages), and help guide you in choosing which is best for you.
Whether you love Microsoft or hate it, there's no denying the popularity of Visual Basic. With the regular-expression package in the .NET Framework, Microsoft provides a package that can be used by VB.NET, C#, Visual C++, and any other language that wants to link to it -- even Python and Perl! The consistency is appealing, but even more important is the package itself: it's powerful and fast, and can it can hold its head up high next to Perl or any other regex package out there.
Like any package, it has its good points and bad points, and its share of bugs. A full chapter on .NET-specific regex issues helps to clarify things, and helps to make up for the exceedingly poor documentation that Microsoft provides with the package.
Other languages touched on in the second edition that were not mentioned in the first
include Ruby, PHP, and even
procmail and MySQL.
Between the first and second edition, both Python and Tcl got completely new regular-expression engines. They started a trend that has increased in popularity and continues today: being "Perl5 Compatible."
Perl version 5 broke new ground with innovative regex features so useful that they were not lost on the developers of other languages. Soon other languages had "Perl5 Compatible" regex engines, supporting such features as lazy quantifiers, non-capturing parentheses, inline mode modifiers, lookahead, and a free-spacing mode.
In the first edition of the book, these features were covered only in the Perl-specific chapter. By 2002, though, these features could be found in Perl, Python, PHP, Ruby, the .NET Framework, several Java packages, Tcl, and more, so these features are now given "first class" treatment in the main body of the book.
Of course, none of the new implementations exactly mimicked Perl, sometimes for better, sometimes for worse. And at the same time, Perl itself evolved, adding new features and modifying how old ones worked. This only exasperates how "Perl5 Compatible" really ends up meaning "has at least a few features that were introduced in Perl version 5." Just which features are supported, and how, remain important issues. The semantics of how a particular new construct works in such-and-such a situation may not seem important on paper, but becomes paramount if you actually want to use the construct.
The second edition, in moving the discussion of these new issues from the Perl-specific chapter to the general discussion, highlights and explores the differences found in the wild so that you can turn the particulars of the implementation you use to your advantage.
As I mentioned, over the years Perl has changed as well. When the first
edition came out, Perl 5.003 had just been released. The second edition covers
Perl 5.8. Major new features include lookbehind, atomic grouping, conditionals,
embedded code, regex objects, regex overloading, the
Unicode-related features, and the special variables
It took just one sentence to list the major new features, but they contain a lot of power. Some of the more advanced languages have picked up on some of these features, which means that there's a lot of variation in semantics. For example, among the regex implementations covered in the second edition, there are three different sets of semantics for what lookbehind can be applied to. (Did you know that you can do much more with .NET lookbehind than with Perl lookbehind? In fact, Perl's lookbehind is the most restrictive of them all.)
Perl is not the only cradle of innovation. Sun's
package introduces modes to help with Unicode matching and possessive
quantifiers, which are related to atomic grouping. (The Sun engineers actually
added possessive quantifiers after having seen my suggestion in the first
edition that they would be useful -- and indeed, in practice, they are quite
java.util.regex also allows you to do set operations
specifying characters within a character class -- very nice! The .NET package
even has an innovative "right-to-left" match mode, although one that's fraught
with a certain uncertainty.
Tcl's regex engine is different from all others in that at times it acts as if it's been implemented with one type of technology, while at other times it acts completely differently. Of course, the second edition still covers in depth the workings and practical ramifications of the ways to implement a regex engine, but Tcl's engine is a hybrid with the best of both worlds.
Unicode has been around for a long time, but it's now really starting to catch on and achieve a critical mass. Unicode isn't a regex-specific concept, but it has many issues important to the regular-expression user. Many, it seems, don't really understand what "Unicode" really means, and they confuse it with "UTF-8," "UTF-16," and "USC2."
The second edition covers the background on Unicode with a concise treatment
that demystifies it for the reader. Also discussed are the concepts of code
points, combining characters, properties, blocks, scripts, and line terminators.
It looks at the differences among the Unicode versions supported by various
tools, Unicode-related regex issues, and how "rich" the regex support for
Unicode really is. Does dot match a Unicode character, or a Unicode
code point? Does
\w match Unicode word-related characters? Does
\w's idea of what a "word-related character" is match with
\b's? How does a character class treat a Unicode character? Can you
include Unicode characters within a class? Within the rest of the regex? Do
$ understand Unicode line terminators? Does dot? These
are just a few of the many questions the book both teaches you to pose, and
I was very pleased with how the first edition turned out, and the enthusiastic response it got has been gratifying, but with the passage of time, some of the particulars slowly moved out of date. There's no doubt that this will happen again over the next several years, but this second edition is much better suited to hold its weight than the first edition ever was. I'm quite happy with how it turned out. Whether you program in Perl or Java or VB.NET or Python or PHP or C# or Ruby or any language with regular-expression support, I hope and believe that the second edition will provide you with a wealth of practical information and helpful examples.
O'Reilly & Associates recently released Mastering Regular Expressions, 2nd Edition.
Jeffrey E. F. Friedl did kernel development for Omron Corporation in Kyoto, Japan for eight years before moving in 1997 to Silicon Valley to apply his regular-expression know-how to financial news and data for a little-known company called "Yahoo!
Copyright © 2009 O'Reilly Media, Inc.