tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: bin/57544: sed(1) and regex(3) problem with encoding
In article <ZO955+sp0H7eZlwp%polynum.com@localhost>, <tlaronde%polynum.com@localhost> wrote:
>On Wed, Aug 30, 2023 at 02:32:25PM -0000, Christos Zoulas wrote:
>> In article <a2ba5261-bf4a-f3c7-c614-c54088391e0f%SDF.ORG@localhost>,
>> RVP <rvp%SDF.ORG@localhost> wrote:
>> >On Wed, 26 Jul 2023, tlaronde%polynum.com@localhost wrote:
>> >
>> >> $ export LC_CTYPE=fr_FR.ISO8859-15
>> >>
>> >> and then:
>> >>
>> >> $ echo "??" | sed 's/??\é/g'
>> >> sed: 1: "s/??\é/g": RE error: trailing backslash (\)
>> >>
>> >
>> >Not running NetBSD right now, but, FreeBSD 13.2 has the same issue which
>> >can be seen even with a plain grep(1)--as it relies on the libc regexp
>> >engine.
>> >
>> >Can you try the patch below (it is for NetBSD):
>>
>> Why don't we make next and end unsigned char so that all instances are fixed?
>
>Because one needs to review all the macros and all the invocations of
>the macros because there are comparison between next and other
>characters, and comparing unsigned char on one side and signed char on
>the other is sure to introduce another can of worms.
>
>I think RVP and I are in agreement about this: the whole lib should be
>carefully reviewed. The patch proposed by RVP (the two casts, last patch
>attached to the PR) is safe, correcting a fault and not modifying
>something else; perhaps---and even probably--- not correcting all
>the faults but at least, immediately, not introducing new ones.
>
>I would have preferred that the library be "eight bits" clean
>, i.e. handling correctly the C language---ASCII---
>and treating the extra range as is, with higher level libraries, if user
>wants them, dealing with extended character sets and regex in order to
>"compile" them to basic ones running on the core library, the way
>microcode is converting CISC into RISC, with a core more simple (no
>extended chars), sticking to C, and so more easy to make or prove
>correct (the higher library explaining character classes and so on
>according to the lang and the encoding etc.).
>
>This whole "i18n" and "l10n" is a nightmare---and this is a not english
>native speaker who writes it...
It is not that much code to review; I reviewed it and committed the minimal
change. There were 3 places where GETNEXT was promoted and not assigned to
a char.
Best,
christos
Home |
Main Index |
Thread Index |
Old Index