Subject: Re: the state of regex(3) (was: Policy questions)
To: Jaromir Dolecek <jdolecek@NetBSD.org>
From: Greg A. Woods <woods@weird.com>
List: tech-userlevel
Date: 01/04/2004 16:05:06
[ On Sunday, January 4, 2004 at 13:16:25 (+0100), Jaromir Dolecek wrote: ]
> Subject: Re: the state of regex(3) (was: Policy questions)
>
> Would be nice to evaluate PCRE as possible regex engine replacement.
> It's definitely much more widely used, and thus hopefully faster
> and even more stable than what we have now. Anyone would want to
> do some performance comparisons?

Hopefully you've already seen the basic performance comparison I've
already posted.

PCRE is definitely a lot faster than all the other implementations at
what realy counts:  matching.  It might be slower than some alternatives
at compiling expressions (though probably not slower than Henry's code :-).

> What is missing in PCRE from true POSIX EREs, BTW?

I'm not aware that there are any POSIX ERE features missing in PCRE
since Perl EREs are vastly more featureful than POSIX EREs.

However there are some minor differences in behaviour.

From the pcreposix(3) manual page:

    MATCHING NEWLINE CHARACTERS

       This area is not simple, because POSIX and Perl take  dif-
       ferent views of things.  It is not possible to get PCRE to
       obey POSIX semantics, but then PCRE was never intended  to
       be a POSIX engine. The following table lists the different
       possibilities for matching newline characters in PCRE:

                                 Default   Change with

         . matches newline          no     PCRE_DOTALL
         newline matches [^a]       yes    not changeable
         $ matches \n at end        yes    PCRE_DOLLARENDONLY
         $ matches \n in middle     no     PCRE_MULTILINE
         ^ matches \n in middle     no     PCRE_MULTILINE

       This is the equivalent table for POSIX:

                                 Default   Change with

         . matches newline          yes      REG_NEWLINE
         newline matches [^a]       yes      REG_NEWLINE
         $ matches \n at end        no       REG_NEWLINE
         $ matches \n in middle     no       REG_NEWLINE
         ^ matches \n in middle     no       REG_NEWLINE

       PCRE's behaviour is the same as Perl's, except that  there
       is  no  equivalent for PCRE_DOLLARENDONLY in Perl. In both
       PCRE and Perl, there is no way to stop newline from match-
       ing [^a].

       The default POSIX newline handling can be obtained by set-
       ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there  is  no
       way  to  make  PCRE  behave exactly as for the REG_NEWLINE
       action.



-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>