tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: wide characters and i18n
Joerg Sonnenberger <joerg%britannica.bec.de@localhost> wrote:
> On Wed, Jul 14, 2010 at 07:38:42PM -0700, Erik Fair wrote:
> > I commend this well written paper to your attention:
> >
> > http://plan9.bell-labs.com/sys/doc/utf.html
>
> ...which is also simplistic in the assumption and problems faced. If you
> want to know about the issues with I18N and Unicode in specific, don't
> ask Americans. Don't ask Europeans either, they only have slightly more
> exposure to the problems.
I suppose you shouldn't ask Australians either, although I'm
Australian and have been mixed up with I18N issues on and off
including for Asian languages over the last 20 years or so,
and have got to see some of the problems first hand.
Since the Plan9 URL has been mentioned, I hope it's not too
off topic to say that I concur that that paper is too
simplistic about the advantages of Unicode and UTF-8, and that
the very same problems are present in Google's new Go
language, several of whose designers participated in the Plan9
work.
For anyone who's not interested in the gory details of this
sort of stuff, please stop reading now. It only gets uglier;
the world is a complex place, my Japanese friends have even
more objections to Unicode as "one size fits all" than I do
which I won't attempt to explain here, even if I were sure I
remembered them all.
For anyone who is interested in why s/ASCII/Unicode/ isn't
quite enough to write applications for worldwide use (even
worldwide use only in a single language, or even only for
worldwide use only in English!) here are a few points I find
left out of most discussions of Unicode.
The first two are points on which I disagree specifically with
the Plan 9 paper:
1. the decision not to address Unicode combining characters
2. the idea that the use of Unicode is sufficient excuse to
provide any of the functionality of locales
#1 means applications dealing with arbitrary Unicode data
(whether UTF-8 or not) must handle normalistion before even
being able to compare two strings for equality. (This is
progress?)
Even English has _some_ characters with accents, although they
are rare and English speakers have seemingly become very
tolerant of their loss in the computer age, so this isn't
"just" a problem for European languages. (Never mind the
rudeness of arbitrarily dropping accents from characters in
peoples' names.)
For #2, the glaring breakages in almost any application are
threefold:
a) how do you sort anything?
Even presuming English-only I'd like dictionary order
sometimes, and other times ASCII for consistency with
other applications or printed material, if it has used
ASCII order.
Non-English languages of course have their own rules
which should be respected, and given the number of
languages in the world and variations in local
preferences it is only practical to allow _users_ to
define collation order if no pre-existing order matches
their preference or has been created for their
language.
b) how can you (ever) localise error messages?
It would be a reasonable argument to say that an error
message catalogue can be implemented indepdently of
POSIX style locales, but localisation of an application
certainly requires translation of error messages and
indeed most of a typical application's user interface.
c) how do you handle varying date formats?
If I had a dollar (anyone's dollar -- Australian,
Canadian, Singaporean, USD, whatever) for each time
I've seen a date and had to stop and evaluate whether
it was more likely MM/DD/YY or DD/MM/YY I imagine I
could have retired long since.
3. An issue of current day importance (although not relevant
to Plan9, as it was an operating system) is how file
systems handle Unicode.
For #3 Unix -- in theory -- isn't too bad: most of its file
systems will take a series of bytes, disallowing only '/'
(which is represented as itself in UTF-8, so not typically a
problem) and '\0' (which UTF-8 avoids, so not a problem
either).
Where problems arise is where file systems (such as the
default file system on OS X) transform file names: the file
name you passed as valid UTF-8 to open() or creat() may not be
the same series of bytes you get back when you use readdir()
to examine the files in the directory. This makes for
"interesting times" for any software which wants to store a
list of file names and then access them.
> Itojun mentioned some of the issues in
> ftp://ftp.itojun.org/pub/paper/itojun-freenix2001-presen.ps.gz
Recommended.
My personal expectation is that -- like it or not -- Unicode
in the form of UTF-8 will be (if it isn't already) "the new
ASCII", but I _do_ wish that language (and operating system)
designers and vendors would:
i. specify the normal form of "their" UTF-8 strings
(and perhaps allow programmers to override the default)
ii. provide support for conversion to and from "foreign"
UTF-8 normalisation forms
iii. handle -- as gracefully as possible -- the existing file
system file name issues, and vendors should be encouraged
(severely, if that's what it takes) to allow file names
in _any_ Unicode encoding, and provide means to read
those file names "as written" (presumably: "as bytes,
trust me, I know what I'm doing") as well as "in my
preferred encoding" and with a choice of errors or "best
effort" conversion where file names are unrepresentable
(e.g. invalid UTF-8 sequence, code point doesn't fit into
UTF-16, etc).
Which still leaves open the problem of locales and issues of
multi-lingual documents and applications where a single
Unicode glyph really should be represented differently
depending upon what language it is being used for, but I did
say at the start of this too-lengthy message that the issues
get ugly.
The problems are hard; naÃve (that's "naive" with a diaeresis
above the 'i', in case it was garbled en-route to you)
solutions will always be incomplete. Sweeping the
incompleteness under the carpet with the words "Well, it works
for me" is ... unimpressive.
Giles
Home |
Main Index |
Thread Index |
Old Index