NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
misc/40387: Serious bug in regular expression library (regex) affected sed
>Number: 40387
>Category: misc
>Synopsis: Serious bug in regular expression library (regex) affected sed
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: misc-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Jan 13 14:10:02 +0000 2009
>Originator: Chris Kuklewicz
>Release: 4.0 and 4.99.59
>Organization:
Not Applicable
>Environment:
I have no output to paste here
>Description:
I have tracked the source of a bug originally seen on OS X back to the same bug
in FreeBSD and NetBSD.
The bug is in the libc implementation of regular expressions, used via the API
in "regex.h" specifically regexec.
Programs which use this library are affected, including "sed" which I will use
to illustrate the bug below.
The bug effects both the whole text matched and the choice of text reported as
matching parenthesized subexpressions. The bug violates the leftmost longest
matching rule for POSIX regular expressions. The bug violates the principle
that the order of subexpressions around the "|" operator should not affect the
whole text matched.
A terminal sessions using three sed commands with extended regular expressions:
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/"
a[b]
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/"
[ab]
[chrisk@m-net ~]$ echo XababaYababaZ | sed -E
's/((X)(aba|b|ab)(aba|b|ab)(Y)(aba|b|ab)*(Z))/[\1] => [\2][\3][\4][\5][\6][\7]/'
[XababaYababaZ] => [X][ab][aba][Y][b][Z]
The first two commands differ only in the order of "()" and "." around the "|"
operator. According to the POSIX specification this must not affect the whole
match, which is report by "&" in sed's replacement specification. The leftmost
longest match is "ab" and is correctly reported by the second command. The
first command incorrect does not match starting with the "a" and matches just
the "b".
This establishes that the bug violates the POSIX specification and can affect a
program that only cares about the whole text matched (ignoring captured
parenthesized subexpressions).
The third command shows an internal inconsistency in the output. The
(aba|b|ab)* operator should capture the text matched by the last repetition of
the (aba|b|ab) subexpression and report that in \6. In particular \6 should
match "aba" just like \3 does.
As shown above the FreeBSD output is [XababaYababaZ] => [X][ab][aba][Y][b][Z]
The correct output should have been [XababaYababaZ] => [X][ab][aba][Y][abc][Z]
Could there be a way to explain the FreeBSD output? Consider the following
questions:
If the last repetition of (aba|b|ab)* matched \6 to "b" then how is the text
between that "b" skipped so that "(Z)" matched \7 against the final "Z" in the
text?
If the "b" reported by \6 is the last "b" in the text then there is no part of
the pattern that can match the next "a".
If the "b" reported by \6 is the next-to-last "b" in the text then there is no
part of the pattern that can match the preceding "a".
Checking the indices returned by regex shows that \6 was matched against the
last "b" in the text.
Thus reporting \6 as matching "b" is contradictory, there is no consistent
interpretation of what regexec returned.
This establishes that the bug is can violate not only the POSIX specification
but also violate any rational and consistent alternate specification.
I found the first example by using Haskell's QuickCheck to randomly generate
test cases. The second example was found by using Glenn Fowler's "AT&T Labs
Research regex(3) regression tests" from
http://www.research.att.com/~gsf/testregex/
I "tested" this bug on NetBSB by asking on the #netbsd channel of
irc.freenode.net and ASau and sjamaan answered.
I tested FreeBSD 6.3-PRERELEASE thanks to the free shell account at
m-net.arbornet.org and FreeBSD 8.0-CURRENT was tested by "methi" on the
#freebsd channel at irc.freenode.net
>How-To-Repeat:
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(()|.)(b)/[&]/"
[chrisk@m-net ~]$ echo "ab" | sed -E "s/(.|())(b)/[&]/"
[chrisk@m-net ~]$ echo XababaYababaZ | sed -E 's/((X)(aba|b|ab)(aba|b|ab)
If there is no bug then the first two both output "[ab]" and the last output
"[XababaYababaZ] => [X][ab][aba][Y][abc][Z]".
Currently the bug causes the first the output "a[b]" and the last to output
"[XababaYababaZ] => [X][ab][aba][Y][b][Z]".
>Fix:
I do not have the time to be able to fix the regex.h part of libc.
Home |
Main Index |
Thread Index |
Old Index