tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: sh(1) read: add LINE_MAX safeguard and "-n" option



On Fri, Sep 27, 2024 at 07:03:15AM +0700, Robert Elz wrote:
>     Date:        Tue, 24 Sep 2024 17:30:42 +0200
>     From:        tlaronde%kergis.com@localhost
>     Message-ID:  <ZvLbItYTlanIgVgV%kergis.com@localhost>
> 
>   | Furthermore the continuation test on:
>   |
>   | 		if (c != '\n')	/* \ \n is always just removed */
>   | 			goto wdch;
>   |
>   | seems wrong. Shouldn't it be?:
>   |
>   | 	 if (c != end)
>   | 		goto wdch;
> 
> Actually no, what is there now is what is intended.
> 
> The idea is that the input might need to be divided into many lines
> to meet the requirement that it be a text file, which means a max
> line length (as you're aware), and that max length is from the first
> char in the line to the next \n char (read's delimiter char has
> nothing to do with that use of \n).  To allow that, while not restricting
> the length of a record, the sequence \ \n is allowed to indicate
> continuation lines, regardless of what the delimiter is, and is simply
> removed from the input stream (just as in cpp and sh - and more).
> 
> Other than that usage, a \ also escapes the following char, avoids
> it being anything special (not a field (word) separator, not the
> delimiter, and of course, as \\ not the escape char either).
> 
> If the delimiter was \n (the default, or -d $'\n') then the end of line
> continuation removal causes it to vanish before the code checks if the
> delimiter has appeared, if the delimiter is something else, we don't want
> it to vanish, there is no point in that -- say we use "-d :", why would
> we then ever write \: in the input if those pair of chars are simply
> deleted?  Makes no sense.  What we would want is the escaped : there
> to be a regular char, not deleted, and not the delimiter either.
> 

I have an algebraic mind: I always think of rule. A line, sometime
ago, was considered a sequence of bytes ending by the first appearance
of '\n'. If a "line" is defined more generally as a sequence of bytes
ending by the first appearance of whatever byte delimiter, then a "continuation
line" is the escaped delimiter. And if the delimiter is not '\n',
'\\''\n' yields a '\n'.

But this is all fuzzy because read was intended for text files,
meaning essentially with lines defined against '\n' and all the rest
has been added, if not at random, by usage (ignoring that it can't be
a general binary read because it can't handle the nul byte).

So, it's obviously up to you. But could you state it clearly (not \`a la POSIX :-^)
in the man page?

>[about option '-n']

Other corner case: when specifying a limit (-n) that is "end reading at the
first appearance of either eof, not escaped delimiter or that amount
of bytes read", what do you do when the last byte read (reaching the
count) is '\\'? Do you absorb in every case the following byte even if
the "read -n num" leads to reading "num + 1"?---and this is not what
the user required---; Or do you allow the stray backslash in the last
variable, convert it to the sequence "\\", or remove it?

>[...]
> 
> I have also added -z (currently, for not very important backward compat
> with the current impl) to issue an error if a \0 is encountered in the
> input (other than as the record delimiter).  Inverting the
> sense of that option probably makes more sense (-z to allow \0
> chars, and error without that option).   Either way this is very
> very simple and cheap to implement, as the code has to check for
> the \0 chars anyway.   (The error would cause the read to terminate
> with exit status 2, as does any other error).
> 
> Or that option could just go away again.    Opinions please? (everyone)

IMHO, the reverse: since the nul byte is ignored (at current time),
user is not getting what he wants, perhaps not even knowing it. So signaling problem
(erroring is better) and forcing to explicitely set -z meaning "I'm
aware I'm not getting all".

Would it make sense to add a '-Z' option that translates a nul byte
into the sequence '\000' with the specification that such a sequence
is a constant one and is never interpreted, except by printf?

-- 
        Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
                     http://www.kergis.com/
                    http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


Home | Main Index | Thread Index | Old Index