bin/59057: /usr/bin/grep includes '[' in named character classes for UTF-8; also affects grep -w

To: gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: bin/59057: /usr/bin/grep includes '[' in named character classes for UTF-8; also affects grep -w
From: ym3by-nb%yahoo.com@localhost
Date: Fri, 7 Feb 2025 22:00:01 +0000 (UTC)

>Number:         59057
>Category:       bin
>Synopsis:       /usr/bin/grep includes '[' in named character classes for UTF-8; also affects grep -w
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Feb 07 22:00:00 +0000 2025
>Originator:     Mike Burrows
>Release:        NetBSD 10.1
>Organization:
>Environment:
NetBSD wombat 10.1 NetBSD 10.1 (GENERIC) #0: Mon Dec 16 13:08:11 UTC 2024 mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys
/arch/amd64/compile/GENERIC amd64
>Description:
/usr/bin/grep mis-parses named character classes (such as [:alnum:])
when using multibyte characters, such as UTF-8.
The mis-parsing leads to it adding the '[' to the character class.
Thus, when using en_US.UTF-8, the character class [:digit:] will
include '[' in addition to '0','1',2','3','4','5','6','7','8','9'.
The problem affects "grep -w" because "-w" causes grep to surround the
user's pattern with
        \(^\|[^[:alnum:]_]\)\(     and     \)\([^[:alnum:]_]\|$\)
This means that, when using a UTF-8 character set, "grep -w foo" won't
find instances of " foo[", because the '[' is treated incorrectly as a
character that is part of a word.  (I first noticed the problem because
grep was failing to find instances of array names in .c files when
using "-w".)  The problem affects egrep also, because it uses the same
code.  It does not affect "fgrep -w" even though it's the same binary
because fgrep does not use the DFA code that contains the problem.

>How-To-Repeat:
        echo ' foo[' | env -i LC_CTYPE=en_US.UTF-8 /usr/bin/grep -w foo
This should output the line ' foo[', but does not.
To see that the problem is to do with named character classes:
        echo '[' | env -i LC_CTYPE=en_US.UTF-8 /usr/bin/grep '[[:digit:]]'
which outputs the line '[', even though the line contains no digits.
To see that this affects only multibyte characters:
        echo '[' | env -i LC_CTYPE=C /usr/bin/grep '[[:digit:]]'
which outputs nothing, correctly.

>Fix:
I believe that /usr/bin/grep is built from the sources in
        /usr/src/external/gpl2/grep/dist/src
and that the error is in the file dfa.c, in the routine
        parse_bracket_exp_mb()
Line 508: the code notices the start of a character class:  if (wc == L'[' && ...
Line 512: wc ('[') is copied into wc1:  wc1 = wc;
Line 516: start of parse of named character class:  if (cur_mb_len == 1 && (wc == L':' || wc == L'.' || wc == L'='))
Line 592: wc is set to -1, but wc1 continues to hold '[':  wc = -1;
Lines 593-648:  uses of wc1 here do not modify it
Line 649: the '[' in wc1 is copied into wc, for the next iteration:  while ((wc = wc1) != L']');
And so the '[' that started the named character class is effectively
appended to it on the next iteration of the do-while loop.
This does not affect single-byte character sets, because their named
character classes are handled separately, starting at line 1021.
I believe that if wc1 were set to -1 at line 592, it would fix the
problem.  e.g., make line 592 be: wc1 = wc = -1;
That change seems to work for the cases given above, but I have not
done enough testing to be certain that there are no unwanted
side-effects.

Prev by Date: Re: Re: bin/59046: dhcpd issue
Next by Date: PR/59054 CVS commit: src/share/man/man3
Previous by Thread: kern/59056: poll POLLHUP bugs
Next by Thread: PR/59054 CVS commit: src/share/man/man3
Indexes:

Home | Main Index | Thread Index | Old Index