Re: bin/59029: cut(1) -n argument doesn't work (presently unsupported, though documented)

To: gutteridge%netbsd.org@localhost, gnats-admin%netbsd.org@localhost, netbsd-bugs%netbsd.org@localhost, david%gutteridge.ca@localhost
Subject: Re: bin/59029: cut(1) -n argument doesn't work (presently unsupported, though documented)
From: "Robert Elz via gnats" <gnats-admin%NetBSD.org@localhost>
Date: Thu, 13 Feb 2025 04:50:02 +0000 (UTC)

The following reply was made to PR bin/59029; it has been noted by GNATS.

From: Robert Elz <kre%munnari.OZ.AU@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: gutteridge%netbsd.org@localhost
Subject: Re: bin/59029: cut(1) -n argument doesn't work (presently unsupported, though documented)
Date: Thu, 13 Feb 2025 11:46:16 +0700

     Date:        Thu, 13 Feb 2025 02:55:01 +0000 (UTC)
     From:        "David H. Gutteridge via gnats" <gnats-admin%NetBSD.org@localhost>
     Message-ID:  <20250213025501.B38C81A923C%mollari.NetBSD.org@localhost>

   |  It would be good to find an illustration of
   |  where the two approaches give varied output.)

 My guess (without testing it) would be that if we had a file where
 at some point in the file, which has up to this point all been
 single byte (eg: ascii) chars, we have, at offset (say) 100

 100	A  B  XX  YYY  ZZ   C   D   E   F   G   H

 where the duplicated chars mean a character that has a multi-byte
 encoding, not two X chars, and the spaces are just padding for this e-mail.

 In that scheme, using -b the bytes would count

 100     A   B   XX  YYY  ZZ   C   D   E   F   G   H
         ^   ^   ^   ^    ^    ^   ^   ^   ^   ^   ^
        100 101 102 104  107  109 110 111 112 113 114

 (with the missing bytes numbers being the additional bytes
 needed to encode the multi-byte characters, which don't easily
 fit in this display, unless I added more lines).

 But using -c the counts would be

 100     A   B   XX  YYY  ZZ   C   D   E   F   G   H
         ^   ^   ^   ^    ^    ^   ^   ^   ^   ^   ^
        100 101 102 103  104  105 106 107 108 109 110

 Specifying 109 as position in a -b list means cut at the 'C', whereas
 specifying it in the -c list means cut at the 'G'.   In this case -n
 is irrelevant, as no multi-byte character would be broken, but it is
 clear that using code for -c to implement the user's -b is simply wrong,
 regardless of -n being given or not.

 I'd assume the "special logic" you noted in the FreeBSD code is to handle
 the case where a -b list includes 105 - that is, a byte offset right in the
 middle of the Y character.   In that case, without -n, the cut would
 just happen there, right in the middle of the Y, but with -n the cut needs
 to either be before Y or after it, that is, offset 104 or 107 (which is
 selected probably is entirely up to the coder).

 Neither the standard -b nor -c algorithm would get that right.  If you're
 looking for an implementation to import to improve ours, pick FreeBSD's.

 kre

Prev by Date: Re: bin/59029: cut(1) -n argument doesn't work (presently unsupported, though documented)
Next by Date: bin/59073: make(1) sets $* / $(.PREFIX) wrong in -j mode
Previous by Thread: Re: bin/59029: cut(1) -n argument doesn't work (presently unsupported, though documented)
Next by Thread: Re: bin/59029: cut(1) -n argument doesn't work (presently unsupported, though documented)
Indexes:

Home | Main Index | Thread Index | Old Index