Subject: Re: utf-8 and userland
To: None <tech-userlevel@NetBSD.org>
From: der Mouse <mouse@Rodents.Montreal.QC.CA>
List: tech-userlevel
Date: 03/12/2004 14:20:26
> If anyone needs convincing that utf-8 is a *good thing* this should
> do it.
I'd rather just use 32-bit chars. But that's probably just me.
> Now for the netbsd content. UTF-8 is designed so that it should have
> no impact on most programs that touch utf-8 content unless they are
> themselves drawing the screen content or arranging that screen output
> is nicely justified. [...file names...] Things "just work" from
> 8-bit clean programs.
Not when trying to interoperate with output from other 8-bit-clean
programs that use non-UTF-8 (eg, 8859-1). What will your UTF-8-aware
ls-and-uxterm do with a file named "École" created by an 8859-1 user
program? (Mangle it, almost undoubtedly, since the octet c9 looks like
the first octet of a two-octet UTF-8 sequence but the following octet,
63, is not a valid second octet for such a sequence. "׫foo»×" will
get mangled too, but differently, and the ׫ fundamentally differently
from the »×.)
This differs from the mangling performed by (say) using 8859-8 to
access files named using 8859-1; the latter will show the wrong
characters, but will preserve them. The former will mangle them
irreversibly - that file named École, if read into a UTF-8-name-aware
editor and written back out again, isn't going to be named with the
same octet sequence.
Of course, this is just a time-delayed version of the interoperability
problems encountered when (say) trying to pipe output from a program
that writes 8859-1 into a program expecting UTF-8, only done by saving
the "output" octet sequences as a file name for the second program to
read.
But yes, I agree that setting a UTF-8 locale should cause programs like
ls to consider UTF-8 octet streams as safe to print, just as setting an
8859-* locale should cause programs like ls to consider 8859-*
printable octet streams as safe to print. (In the case of UTF-8 this
is more involved than it is for 8859; that's not directly relevant.)
/~\ The ASCII der Mouse
\ / Ribbon Campaign
X Against HTML mouse@rodents.montreal.qc.ca
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B