tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: wide characters and i18n
On Jul 15, 2010, at 13:42, David Holland wrote:
> The problem with UTF-8 in Unix is that it doesn't actually solve the
> labeling problem: given comprehensive adotpion you no longer really
> need to know what kind of text any given file or string is, but you
> still need to know if the file contains text (UTF-8 encoded symbols)
> or binary (octets), because not all octet sequences are valid UTF-8.
>
> I don't see a viable way forward that doesn't involve labeling
> everything.
If your goal is to be in deterministic file content nirvana, yes, that's the
way to get there, but I'd argue it's an awful lot of work to deal with the M x
N software problem I mentioned (and we'll have to add a type field to inodes
which will trigger a very old debate about whether UNIX files should be just
bags of bytes; the required changes for the full M x N is pretty pervasive and
invasive), and the easy counter argument in an open source OS community is:
"OK, who's going to write and test all that code?"
The Plan 9 people didn't shoot for a utopia - as is often their wont, they
improved the situation a whole lot (Unicode/UTF-8 is a lot more expressive and
encompassing of the possible space of human communications than ASCII or
ISO-8859-1) with a relatively modest effort, and it's "good enough" for a much
wider range of applications than the previous default of ASCII or ISO-8859-1
(does sort(1) even work right with ISO-8859-1? The man page in NetBSD 5.0 is
silent on that question, but given where the diacritical characters are in the
ISO-8859-1 codeset space, I bet it doesn't collate properly with a straight
byte-numerical sort).
The more I ponder this, the more I think that:
1. the ASCII default status quo isn't good enough any more (and I'm sure our
users in south & east Asia, not to mention eastern Europe, would agree),
2. Unicode/UTF-8 as a new default offers backward compatibility while expanding
the character space quite broadly, and without anywhere near as much work (or
as much paradigm shift, i.e. breaking "Unix files are a bag of bytes") on our
software,
3. the "change the base software default" approach can allow us to examine and
call out our software's implicit assumptions (e.g. "I'm operating on ASCII" or
"I need to parse these bytes semantically") so that if/when we decide to make a
run a the bigger "let's handle all character sets" M x N problem, we'll know
much better what needs to be done.
4. we even have "later mover" advantage - the Plan 9 paper describes what they
did, and there's standards work (hopefully sane) that we can use if we deem it
correct.
Think of it as a stepwise refinement in the direction of character set
processing nirvana. My concern is that if we scope the problem too large by
trying to do everything, we'll never get it done, with lots of sturm und drang
in the process.
Erik <fair%netbsd.org@localhost>
Home |
Main Index |
Thread Index |
Old Index