Subject: Re: the state of regex(3) (was: Policy questions)
To: Jaromir Dolecek <jdolecek@NetBSD.org>
From: Greg A. Woods <woods@weird.com>
List: tech-userlevel
Date: 01/04/2004 16:05:06
[ On Sunday, January 4, 2004 at 13:16:25 (+0100), Jaromir Dolecek wrote: ]
> Subject: Re: the state of regex(3) (was: Policy questions)
>
> Would be nice to evaluate PCRE as possible regex engine replacement.
> It's definitely much more widely used, and thus hopefully faster
> and even more stable than what we have now. Anyone would want to
> do some performance comparisons?
Hopefully you've already seen the basic performance comparison I've
already posted.
PCRE is definitely a lot faster than all the other implementations at
what realy counts: matching. It might be slower than some alternatives
at compiling expressions (though probably not slower than Henry's code :-).
> What is missing in PCRE from true POSIX EREs, BTW?
I'm not aware that there are any POSIX ERE features missing in PCRE
since Perl EREs are vastly more featureful than POSIX EREs.
However there are some minor differences in behaviour.
From the pcreposix(3) manual page:
MATCHING NEWLINE CHARACTERS
This area is not simple, because POSIX and Perl take dif-
ferent views of things. It is not possible to get PCRE to
obey POSIX semantics, but then PCRE was never intended to
be a POSIX engine. The following table lists the different
possibilities for matching newline characters in PCRE:
Default Change with
. matches newline no PCRE_DOTALL
newline matches [^a] yes not changeable
$ matches \n at end yes PCRE_DOLLARENDONLY
$ matches \n in middle no PCRE_MULTILINE
^ matches \n in middle no PCRE_MULTILINE
This is the equivalent table for POSIX:
Default Change with
. matches newline yes REG_NEWLINE
newline matches [^a] yes REG_NEWLINE
$ matches \n at end no REG_NEWLINE
$ matches \n in middle no REG_NEWLINE
^ matches \n in middle no REG_NEWLINE
PCRE's behaviour is the same as Perl's, except that there
is no equivalent for PCRE_DOLLARENDONLY in Perl. In both
PCRE and Perl, there is no way to stop newline from match-
ing [^a].
The default POSIX newline handling can be obtained by set-
ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no
way to make PCRE behave exactly as for the REG_NEWLINE
action.
--
Greg A. Woods
+1 416 218-0098 VE3TCP RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com> Secrets of the Weird <woods@weird.com>