tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: A draft for a multibyte and multi-codepoint C string interface
tlaronde%polynum.com@localhost wrote:
|For Unicode complexity, all is not gratuitous. I found this when
|thinking about the next step for kerTeX: adding Unicode/UTF-8 for TeX.
|
|Example: in Occident, we use arabic digits. If in occidental languages
|and Arabic the "individual" digits are the same, considering that they
|are part of a special set, they are not identical. If the digits are,
|for occidental languages, in the ASCII range, in the arabic language
|they should not. Because, based on the code, one can deduce the
|language, and for example the direction of composition. Hence, TeX---to
|take this example---could deduce the direction of composition from the
|Unicode range.
It's even worse, since some languages use different conversion
systems (like base 20), don't know about the value 0 and/or have
special symbols/characters for several importan numbers, like
"1000" etc. Of course a digittoi() cannot handle these cases (and
afaik Unicode didn't put any effort in this, a digit value is only
defined if a direct mapping is possible).
So, for this, some locale-dependent pre/after parser is or would
be necessary -- neither do i know of any implementation that
really does, nor does the current POSIX / C environment offer
a way to implement such pre/postprocessors. But i also wouldn't
really worry about that, since the Innuit and the Indians and the
like have brand new writing systems that they didn't invent on
their own, and which use a LATIN-ish notation, and other languages
are dead and buried, and the rest also doesn't matter. So for the
computer programs we talk about, at least.
--steffen
Home |
Main Index |
Thread Index |
Old Index