tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: A draft for a multibyte and multi-codepoint C string interface
> Also "cooperation" used to have a 'double dot' over the second 'o' to
> indicate that there are two short 'o' sounds.
Yeah, one of the few cases where LATIN SMALL LETTER O WITH DIAERESIS is
actually what Unicode calls it. (In some languages it's an umlaut, not
a diaeresis; in others, it's a separate letter, not a modified o in any
sense except typographically.)
Come to think of it, that's another issue with Unicode, for some
purposes: it not only provides multiple ways to represent some things,
it conflates semantically distinct but typographically identical things
(like `o modified by adding a diaeresis', `o modified by adding an
umlaut', and `distinct letter graphically identical to either of the
foregoing two'). It's a confused mess that sometimes appears to be
designed for typography, drawing typographically significant but
semantically irrelevant distinctions (such as having a separate
codepoint for the fi ligature) and sometimes appears to be designed to
draw semantically important but graphically irrelevant distinctions
(such as having different codepoints for LATIN CAPITAL LETTER A and
GREEK CAPITAL LETTER ALPHA).
And then there are cases where it's not possible to know how a glyph
(and/or codepoint, eg, 0xe6 in 8859-1 or Unicode 00e6) should be
handled without knowing the language in question. `æ', to continue
that example, is just a typographical frill in English, somewhat akin
to tlaronde's description of oe in French (`encyclopædia' and
`encyclopaedia' are linguistically the same thing) but a distinct
letter, with its own position in the alphabet and everything, in Danish
or Norwegian.
/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mouse%rodents-montreal.org@localhost
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Home |
Main Index |
Thread Index |
Old Index