Subject: Re: the state of regex(3)
To: Christos Zoulas <christos@zoulas.com>
From: Alistair Crooks <agc@pkgsrc.org>
List: tech-userlevel
Date: 09/28/2004 22:17:44
On Tue, Sep 28, 2004 at 06:40:55PM +0000, Christos Zoulas wrote:
> 4. POSIX conformance: REG_NEWLINE will not follow POSIX, according to the docs.
>
> So license is fine, code is not our style and not my favorite to maintain,
> but not a real showstopper (although it would be nice if the author was
> convinced to follow a more traditional style). Docs are ok, but the real
> stickler is POSIX conformance, or isn't it?
My reading of the docs shows that the default POSIX behaviour is the
same, and I know of no way to change the POSIX REG_NEWLINE regex
engine behaviour from the command line on egrep(1) or awk(1) (for
example).
pcre.txt says on this matter:
This area is not simple, because POSIX and Perl take different views of
things. It is not possible to get PCRE to obey POSIX semantics, but
then PCRE was never intended to be a POSIX engine. The following table
lists the different possibilities for matching newline characters in
PCRE:
Default Change with
. matches newline no PCRE_DOTALL
newline matches [^a] yes not changeable
$ matches \n at end yes PCRE_DOLLARENDONLY
$ matches \n in middle no PCRE_MULTILINE
^ matches \n in middle no PCRE_MULTILINE
This is the equivalent table for POSIX:
Default Change with
. matches newline yes REG_NEWLINE
newline matches [^a] yes REG_NEWLINE
$ matches \n at end no REG_NEWLINE
$ matches \n in middle no REG_NEWLINE
^ matches \n in middle no REG_NEWLINE
PCRE's behaviour is the same as Perl's, except that there is no equiva-
lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is
no way to stop newline from matching [^a].
The default POSIX newline handling can be obtained by setting
PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE
behave exactly as for the REG_NEWLINE action.
So default POSIX newline handling is possible with PCRE.
In our whole src tree, I can find the following uses of REG_NEWLINE:
usr.bin/m4/gnum4.c: REG_NEWLINE | REG_EXTENDED);
usr.bin/nl/nl.c: &argstr[1], REG_NEWLINE|REG_NOSUB)) != 0) {
usr.sbin/user/user.c: if (regcomp(&r, line, REG_EXTENDED|REG_NEWLINE) != 0) {
and these could be converted to PCRE fairly easily, I would have said.
If Jason could help me out and tell me exactly what the sticking point
is, I'd be grateful. Is it any worse than defining POSIX_MISTAKE for
libc builds? (and, yes, I know what POSIX_MISTAKE is for, I'm talking
about the whole area of POSIX regular expressions).
Regards,
Alistair