NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
bin/59057: /usr/bin/grep includes '[' in named character classes for UTF-8; also affects grep -w
>Number: 59057
>Category: bin
>Synopsis: /usr/bin/grep includes '[' in named character classes for UTF-8; also affects grep -w
>Confidential: no
>Severity: non-critical
>Priority: low
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Feb 07 22:00:00 +0000 2025
>Originator: Mike Burrows
>Release: NetBSD 10.1
>Organization:
>Environment:
NetBSD wombat 10.1 NetBSD 10.1 (GENERIC) #0: Mon Dec 16 13:08:11 UTC 2024 mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys
/arch/amd64/compile/GENERIC amd64
>Description:
/usr/bin/grep mis-parses named character classes (such as [:alnum:])
when using multibyte characters, such as UTF-8.
The mis-parsing leads to it adding the '[' to the character class.
Thus, when using en_US.UTF-8, the character class [:digit:] will
include '[' in addition to '0','1',2','3','4','5','6','7','8','9'.
The problem affects "grep -w" because "-w" causes grep to surround the
user's pattern with
\(^\|[^[:alnum:]_]\)\( and \)\([^[:alnum:]_]\|$\)
This means that, when using a UTF-8 character set, "grep -w foo" won't
find instances of " foo[", because the '[' is treated incorrectly as a
character that is part of a word. (I first noticed the problem because
grep was failing to find instances of array names in .c files when
using "-w".) The problem affects egrep also, because it uses the same
code. It does not affect "fgrep -w" even though it's the same binary
because fgrep does not use the DFA code that contains the problem.
>How-To-Repeat:
echo ' foo[' | env -i LC_CTYPE=en_US.UTF-8 /usr/bin/grep -w foo
This should output the line ' foo[', but does not.
To see that the problem is to do with named character classes:
echo '[' | env -i LC_CTYPE=en_US.UTF-8 /usr/bin/grep '[[:digit:]]'
which outputs the line '[', even though the line contains no digits.
To see that this affects only multibyte characters:
echo '[' | env -i LC_CTYPE=C /usr/bin/grep '[[:digit:]]'
which outputs nothing, correctly.
>Fix:
I believe that /usr/bin/grep is built from the sources in
/usr/src/external/gpl2/grep/dist/src
and that the error is in the file dfa.c, in the routine
parse_bracket_exp_mb()
Line 508: the code notices the start of a character class: if (wc == L'[' && ...
Line 512: wc ('[') is copied into wc1: wc1 = wc;
Line 516: start of parse of named character class: if (cur_mb_len == 1 && (wc == L':' || wc == L'.' || wc == L'='))
Line 592: wc is set to -1, but wc1 continues to hold '[': wc = -1;
Lines 593-648: uses of wc1 here do not modify it
Line 649: the '[' in wc1 is copied into wc, for the next iteration: while ((wc = wc1) != L']');
And so the '[' that started the named character class is effectively
appended to it on the next iteration of the do-while loop.
This does not affect single-byte character sets, because their named
character classes are handled separately, starting at line 1021.
I believe that if wc1 were set to -1 at line 592, it would fix the
problem. e.g., make line 592 be: wc1 = wc = -1;
That change seems to work for the cases given above, but I have not
done enough testing to be certain that there are no unwanted
side-effects.
Home |
Main Index |
Thread Index |
Old Index