NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: bin/59029: cut(1) -n argument doesn't work (presently unsupported, though documented)



The following reply was made to PR bin/59029; it has been noted by GNATS.

From: Robert Elz <kre%munnari.OZ.AU@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: gutteridge%netbsd.org@localhost
Subject: Re: bin/59029: cut(1) -n argument doesn't work (presently unsupported, though documented)
Date: Thu, 13 Feb 2025 11:46:16 +0700

     Date:        Thu, 13 Feb 2025 02:55:01 +0000 (UTC)
     From:        "David H. Gutteridge via gnats" <gnats-admin%NetBSD.org@localhost>
     Message-ID:  <20250213025501.B38C81A923C%mollari.NetBSD.org@localhost>
 
   |  It would be good to find an illustration of
   |  where the two approaches give varied output.)
 
 My guess (without testing it) would be that if we had a file where
 at some point in the file, which has up to this point all been
 single byte (eg: ascii) chars, we have, at offset (say) 100
 
 100	A  B  XX  YYY  ZZ   C   D   E   F   G   H
 
 where the duplicated chars mean a character that has a multi-byte
 encoding, not two X chars, and the spaces are just padding for this e-mail.
 
 In that scheme, using -b the bytes would count
 
 100     A   B   XX  YYY  ZZ   C   D   E   F   G   H
         ^   ^   ^   ^    ^    ^   ^   ^   ^   ^   ^
        100 101 102 104  107  109 110 111 112 113 114
 
 (with the missing bytes numbers being the additional bytes
 needed to encode the multi-byte characters, which don't easily
 fit in this display, unless I added more lines).
 
 But using -c the counts would be
 
 100     A   B   XX  YYY  ZZ   C   D   E   F   G   H
         ^   ^   ^   ^    ^    ^   ^   ^   ^   ^   ^
        100 101 102 103  104  105 106 107 108 109 110
 
 Specifying 109 as position in a -b list means cut at the 'C', whereas
 specifying it in the -c list means cut at the 'G'.   In this case -n
 is irrelevant, as no multi-byte character would be broken, but it is
 clear that using code for -c to implement the user's -b is simply wrong,
 regardless of -n being given or not.
 
 I'd assume the "special logic" you noted in the FreeBSD code is to handle
 the case where a -b list includes 105 - that is, a byte offset right in the
 middle of the Y character.   In that case, without -n, the cut would
 just happen there, right in the middle of the Y, but with -n the cut needs
 to either be before Y or after it, that is, offset 104 or 107 (which is
 selected probably is entirely up to the coder).
 
 Neither the standard -b nor -c algorithm would get that right.  If you're
 looking for an implementation to import to improve ours, pick FreeBSD's.
 
 kre
 
 


Home | Main Index | Thread Index | Old Index