Subject: Re: the state of regex(3)
To: None <tech-userlevel@netbsd.org>
From: Christos Zoulas <christos@zoulas.com>
List: tech-userlevel
Date: 09/28/2004 18:40:55
In article <20040928163821.GA8789@nef.pbox.org>,
Alistair Crooks <agc@pkgsrc.org> wrote:
>On Fri, Jan 02, 2004 at 05:48:18PM -0500, Greg A. Woods wrote:
>> I had forgotten that I had a basic bit of test harness for doing simple
>> regex testing and benchmarking with the egrep implementation by James
>> Howard and Dag-Erling Sm?rgrav (which is just a wrapper around any POSIX
>> regex library). Remembering this prompted me to fetch and compile the
>> latest versions of the various libraries mentioned so far and give them
>> each a test run.
>> [...]
>> For the rest here are some timing results from the following silly test
>> I use to find obvious viruses in e-mail, as run across about 64MB of
>> accumulated virus e-mail. So far PCRE is the clear winner by a country
>> mile and TRE is way ahead of the rest of the pack. TRE will probably
>> also improve quite a bit more before there's a 1.x release of it. TRE
>> has become very much more interesting in the latest release too -- it
>> now has true support for approximate pattern matching using real EREs
>> (i.e. in a manner vastly superior to the old agrep).
>
>With thanks to Greg for his benchmarking, which I've deleted, but is in
>the archive.
>
>Thomas Klausner has just updated the PCRE package to 5.0. It's
>interesting to note that this update says:
>
> Log Message:
> Update to 5.0:
>
> Release 5.0 13-Sep-04
> ---------------------
>
> The licence under which PCRE is released has been changed to the more
> conventional "BSD" licence.
>
> In the code, some bugs have been fixed, and there are also some major changes
> in this release (which is why I've increased the number to 5.0). Some changes
> are internal rearrangements, and some provide a number of new facilities.
>
>Assuming that the internal rearrangements have not clobbered the performance
>in any way, is there any reason to stay with the old regex(3) implementation?
>Shouldn't we just move to pcre?
Well,
1. The license is indeed BSD, but formatted differently.
2. The code is indented in a GNUish style with the following differences:
- code starts at column 0
- compound statements are sometimes in the same line:
if (blaf) { foo; }
- sometimes if/then/else statements are formatted like:
if (blaf) a = b; else
{
c = d;
}
- othertimes the indentation rules are more complex:
if (a)
{
if (a == b) c = d;
else if (a == d) f = g;
else
{
e=h;
}
}
3. The documentation looks ok, but will need some cleanup.
4. POSIX conformance: REG_NEWLINE will not follow POSIX, according to the docs.
So license is fine, code is not our style and not my favorite to maintain,
but not a real showstopper (although it would be nice if the author was
convinced to follow a more traditional style). Docs are ok, but the real
stickler is POSIX conformance, or isn't it?
christos