NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: bin/57544: sed(1) and regex(3) problem with encoding
The following reply was made to PR bin/57544; it has been noted by GNATS.
From: tlaronde%polynum.com@localhost
To: gnats-bugs%netbsd.org@localhost
Cc: RVP <rvp%SDF.ORG@localhost>, Martin Husemann <martin%duskware.de@localhost>,
Taylor R Campbell <campbell+netbsd-tech-userlevel%mumble.net@localhost>
Subject: Re: bin/57544: sed(1) and regex(3) problem with encoding
Date: Mon, 31 Jul 2023 10:52:07 +0200
RVP has indeed found the culprit so the above diff:
Index: regcomp.c
===================================================================
RCS file: /pub/NetBSD-CVS/src/lib/libc/regex/regcomp.c,v
retrieving revision 1.46
diff -u -r1.46 regcomp.c
--- regcomp.c 11 Mar 2021 15:00:29 -0000 1.46
+++ regcomp.c 31 Jul 2023 08:32:56 -0000
@@ -900,10 +900,10 @@
handled = false;
assert(MORE()); /* caller should have ensured this */
- c = GETNEXT();
+ c = (unsigned char)GETNEXT();
if (c == '\\') {
(void)REQUIRE(MORE(), REG_EESCAPE);
- cc = GETNEXT();
+ cc = (unsigned char)GETNEXT();
c = BACKSL | cc;
#ifdef REGEX_GNU_EXTENSIONS
if (p->gnuext) {
solves the problem.
Explanation: the regex(3) is decorating a char or a sequence treatment
by using an int and, in p_simp_re() was setting in the int the bit
immediately left to the bits needed for a char to 1:
# define BACKSL (1<<CHAR_BIT)
when it was an escaped sequence before accessing the next char. And the
treatment was after, testing for this flag.
On a machine with signed chars and two-complement, where the sign bit
is "extended", every negative char was then tested as been an escaped
sequence.
From a cursory look, the difference between setting LC_CTYPE=C (no
problem) or LC_CTYPE=fr_FR.ISO8859-15 (just as an example) is perhaps
that in the first case extended RE are assumed, while in the latter case
legacy is used, hence not following the same path (legacy using
p_simp_re() while ERE uses p_ere_exp()).
But the whole code should be reviewed by someone knowing the
intrincasies between the locales and ctype, and the problem of
signed/unsigned (and to add more, two-complement) needs also a more
thorough review.
Ironically, in WHATSNEW (dating BSD 4.4...) there is this:
Most uses of "uchar" are gone; it's all chars now. Char/uchar
parameters are now written int/unsigned, to avoid possible portability
problems with unpromoted parameters. Some unsigned casts have been
introduced to minimize portability problems with shifting into sign
bits.
So signed/unsigned and portability problems are not new...
--
Thierry Laronde <tlaronde +AT+ polynum +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Home |
Main Index |
Thread Index |
Old Index