NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
bin/58619: nawk 2024-08-17 broken and incompatible for non-UTF-8 and non-C locales
>Number: 58619
>Category: bin
>Synopsis: nawk 2024-08-17 broken and incompatible for non-UTF-8 and non-C locales
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: bin-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Aug 20 07:10:00 +0000 2024
>Originator: Rin Okuyama
>Release: 10.99.11
>Organization:
Internet Initiative Japan Inc.
>Environment:
NetBSD rp64 10.99.11 NetBSD 10.99.11 (GENERIC64) #2: Tue Aug 20 13:15:56 JST 2024 rin@dancena:/home/rin/src/sys/arch/evbarm/compile/GENERIC64 evbarm
>Description:
nawk 2024-08-17 has recently been imported as /usr/bin/awk.
This version is based on "2nd edition", but compatibility for
8-bit-clean single-byte locales like "C" seems to be improved:
https://github.com/onetrueawk/awk/commit/1087d46
(BTW, their documentation is *REALLY* poor.)
However, still, it gives broken results for non-UTF-8 multibyte
locales. Not only broken, results are incompatible with older
versions, at least for non-8-bit-clean multibyte locales.
For example, in the previous versions, length() builtin counts
number of bytes for, e.g., ja_JP.eucJP. However, the new version
counts number of characters, misinterpreted as UTF-8 :(
>How-To-Repeat:
Try euc.txt, which I converted to EUC-JP from
http://www.jp.netbsd.org/ja/JP/index.html
---
$ ftp https://www.netbsd.org/~rin/euc.txt
...
$ env LC_CTYPE=ja_JP.eucJP \
awk 'BEGIN{sum = 0} {sum += length($0)} END{print sum}'
---
Older versions and 2024-08-17 give 10978 and 10418, respectively.
>Fix:
Just for example above:
https://gist.github.com/rokuyama/c7e6d12b6a7bcad0704f706c4f7e9569
However, still, I'm not very sure whether "2nd edition" of
nawk should be used or not...
Home |
Main Index |
Thread Index |
Old Index