Subject: bin/18738: tr(1) includes broken example
To: None <gnats-bugs@gnats.netbsd.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: netbsd-bugs
Date: 10/20/2002 08:42:21
>Number: 18738
>Category: bin
>Synopsis: tr(1) includes broken example
>Confidential: no
>Severity: non-critical
>Priority: low
>Responsible: bin-bug-people
>State: open
>Class: doc-bug
>Submitter-Id: net
>Arrival-Date: Sun Oct 20 05:43:00 PDT 2002
>Closed-Date:
>Last-Modified:
>Originator: der Mouse
>Release: -current
>Organization:
Dis-
>Environment:
Any
>Description:
The tr(1) manpage includes an example
Translate the contents of file1 to upper-case.
tr "[:lower:]" "[:upper:]" < file1
which is slightly broken in that it will misbehave if the
character set in use has lowercase letters with no
corresponding uppercase letter, or vice versa. While ASCII
does not have any such, one of the commonest non-ASCII
character sets, ISO 8859-1, does - there is no uppercase
version of 0xff (y with double-dot diacritic). 0xdf (German
ss) might be another example, though I'm not sure.
8859-7 (Greek) is much more likely to be an example; it has at
least three characteristics any one of which is liable to break
that example:
- I see no uppercase versions of 0xc0 or 0xe0 (iota and upsilon
with a diacritic I don't know any name for).
- 0xd3 has two lowercase versions, 0xf2 and 0xf3 (sigma).
- 0xb6, 0xb8, 0xb9, 0xba, 0xbc, 0xbe, and 0xbf are all
uppercase, and appear before the body of the uppercase
alphabet, but their corresponding lowercase versions, 0xdc,
0xdd, 0xde, 0xdf, 0xfc, 0xfd, and 0xfe, appear partly before
and partly after the body of the lowercase alphabet. (These
are vowels with what looks a bit like an acute accent but I
think is a breathing mark of some sort.)
The manpage says that [:upper:] and [:lower:] are in "ascending
order", but does not clearly indicate whether this means
alphabetical order, codeset numeric order, or something else.
However, as far as I can see no choice of order can finesse an
issue like the two variants of lowercase sigma; the only way to
handle that and still make things like the manpage example work
would be to have [:upper:] include two copies of uppercase
sigma. And not even that helps any with 8859-1's 0xff or
8859-7's 0xc0 and 0xe0, where the set simply doesn't have any
corresponding uppercase character (perhaps because it doesn't
exist; I'm not sure in any of those three cases whether there
exists any uppercase version in the relevant languages). I
suppose you could decree that 8859-1 0xff and 8859-7 0xc0 and
0ex0 are neither uppercase nor lowercase, but quite aside from
violating least surprise, I don't think that could reasonably
be done with the lowercase sigmas.
>How-To-Repeat:
Read the manpage. Think about character sets.
>Fix:
Removing the example is the simplest fix, but the most
dangerous, because the note about ordering for [:lower:] and
[:upper:] implies that something very much like that example
could be expected to work. I'd prefer to change the wording of
the example, perhaps something like
When using a character set with lowercase and uppercase
versions of all letters appearing in the same order (such as
ASCII, but not common non-ASCII sets like ISO 8859-1 or
8859-7), a command such as
tr "[:lower:]" "[:upper:]"
can be used to translate data to upper case.
/~\ The ASCII der Mouse
\ / Ribbon Campaign
X Against HTML mouse@rodents.montreal.qc.ca
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
>Release-Note:
>Audit-Trail:
>Unformatted: