tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: wide characters and i18n
See, I think I am understanding things, but then ....
>> b) doesn't use NUL ('\0'),
>
>Wrong. It uses a 0x00 octet (which is what I assume you're talking
>about) to represent U+0000. It does not use a 0x00 octet under any
>other conditions, though.
Okay ... I'm not up on nomenclature. U+0000 means .... a particular
Unicode codepoint? I guess that's a Unicode NULL, according to what
I've seen online.
I guess the real question is ... I'm used to C-style strings, where I
don't have to care about the length, but 0x00 is the terminator. Can
I still do that with Unicode? I mean, I see that U+0000 is a valid
Unicode code point, but it's not actually anything PRINTABLE, right?
Sure, I should be passing around lengths to everything, but I'm just
thinking of the amount of code that would need to be changed.
>> But this brings up some possibly dumb questions: say I have a UTF8
>> byte sequence I want to display on standard out; do I simply use
>> printf("%s") like I have always been? Do I have to do something
>> different? If so, what?
>
>"That depends". It depends on whether printf tries to be smart (most
>printfs I'm familiar with treat strings as opaque octet sequences for
>things like %s, but I'd be surprised if there weren't some that went to
>the trouble to process characters rather than octets). It depends on
>how the octet sequence produced by your program is interpreted
>(terminal or terminal emulator handling UTF-8 or 8859-1 or what). It
>depends on what exactly you mean by "display on standard out", too.
I'm just thinking of the basic example of, "I want my command-line program
to print out something to the defined Unix standard output", which is what
most of them do. From what people are saying ... there's not really a way
of telling, today, if your terminal supports UTF-8, or 8859-1, or anything
else (unless it's embedded in locale information, somehow).
Also, Aleksej says:
>Sorry, this is wrong. This assumes that you don't use anything ASCII
>compatible (more or less). I do, and "UTF-8 by default" will cause major
>pain to me and to many users here.
>
>The main reason for it is that UTF-8 wastes half of bandwidth on wire,
>and some of NetBSD tools don't tolerate long file names. E.g. pax.
>I meet border cases already, and UTF-8 by default will double on-wire
>length of file names in consideration.
This brings up a couple of questions:
- Isn't UTF-8 already ASCII compatible?
- How does UTF-8 waste half of the bandwidth?
- What would you prefer we do instead?
--Ken
Home |
Main Index |
Thread Index |
Old Index