Subject: Re: the state of regex(3)
To: NetBSD Userlevel Technical Discussion List <tech-userlevel@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-userlevel
Date: 09/28/2004 17:26:29
[ On Tuesday, September 28, 2004 at 17:38:21 (+0100), Alistair Crooks wrote: ]
> Subject: Re: the state of regex(3)
>
> With thanks to Greg for his benchmarking, which I've deleted, but is in
> the archive.
Well just to throw another monkey wrench into the mix I should point out
that since I made that post back in January I've discovered that a
version of Henry Spencer's more recent re-implementation of the regex(3)
library is available, for example in recent releases of TCL. It has a
very liberal copyright license (despite the fact that some folks,
e.g. even Jeffrey Friedl, have mistakenly assumed he's only released it
for use in TCL to date -- that's simply not true -- the only released
code is embedded in TCL, but it is separately licensed from TCL and it
can be extracted into a separate package by anyone with the programming
skills necessary to do so, and this has already been done by the
PostgreSQL folks, among others).
Henry's new library implements what he calls "_advanced_ REs". The new
library also supports POSIX Basic REs and POSIX Extended REs. His
documentation sasy that "POSIX EREs are almost an exact subset of AREs."
Most of the extensions in AREs are based on ideas from Perl5's RE
extensions, though it seems AREs are not quite compatible with full
Perl5 REs, nor with PCRE's implementation of Perl5 REs. AREs are in
some senses a superset of Perl5 REs, especially the metasyntax.
As for speed, well I've not yet done a careful benchmark of this new
library vs. anything else. Henry's new library has been described as a
hybrid DFA/NFA engine that dynamically uses the best algorithm for any
given RE. Henry posted the following note about its capabilities:
In general, the Tcl RE-matching engine is much less sensitive to the exact
form of the RE than traditional matching engines. Things that it does
quickly will be fast no matter how you write them; things that it does
slowly will be slow no matter how you write them. The old folklore about
hand-optimizing your REs simply does not apply.
(from <URL:http://groups.google.ca/groups?selm=Ft0Lqv.76o%40spsystems.net&output=gplain>)
In his book "Mastering Regular Expressions" Jeffrey Friedl goes so far
as to say "If this technology [Henry's hybrid DFA/NFA engine] were more
widespread, much of this chapter would not be needed." (referring to
chapter 6 "Crafting an Efficient Expression")
One thing I can say about what I've learned of PCRE performance over the
past six months or so is that it's extremely easy to accidentally cause
PCRE do do a lot of backtracking (especially on large strings) and not
always easy or sometimes even possible to rewrite the epxression to stop
or limit backtracking, at least not given my meager understanding of
PCRE optimisations.
Size-wize Henry's new code, including the locale-specific code added by
Scriptics Corp., is only ~9500 lines (PCRE-4.5 is over 13,300 lines),
though the i386 object compares as follows (PCRE has more internal
documentation :-):
$ size libregex.a
text data bss dec hex filename
160 0 0 160 a0 regfronts.o (ex libregex.a)
28 0 0 28 1c regfree.o (ex libregex.a)
7936 0 0 7936 1f00 regexec.o (ex libregex.a)
1056 288 0 1344 540 regerror.o (ex libregex.a)
32264 936 0 33200 81b0 regcomp.o (ex libregex.a)
=====
42668
$ size libpcre.a libpcreposix.a
text data bss dec hex filename
736 0 0 736 2e0 maketables.o (ex libpcre.a)
756 0 0 756 2f4 get.o (ex libpcre.a)
1856 0 0 1856 740 study.o (ex libpcre.a)
29796 1108 0 30904 78b8 pcre.o (ex libpcre.a)
3340 0 0 3340 d0c pcreposix.o (ex libpcreposix.a)
=====
37592
I guess I should ask Henry directly whether he's done any work recently
to create a proper separate release or not....
I should also update all of the regex codebases I've benchmarked, add
Henry's latest code to the list, and at least re-run my basic benchmark
again. (Apparently TRE has improved vastly since my last run too!)
BTW, Doug McIlroy wrote a regex(3) test harness (since updated and now
maintained by Glenn Fowler (AT&T Labs)) (Google for testregex.c +
Fowler) which we should use in our regression tests. There's a good
sized set of test data for it too.
If I had more time I would try pulling Henry's latest library into a
copy of my netbsd-1-6 tree too. Doing that may even be much easier to
do than integrating PCRE....
--
Greg A. Woods
+1 416 218-0098 VE3TCP RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com> Secrets of the Weird <woods@weird.com>