tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: wide characters and i18n
On Fri, 16 Jul 2010 10:56:36 -0400
Ken Hornstein <kenh%pobox.com@localhost> wrote:
> But this brings up some possibly dumb questions: say I have a UTF8
> byte sequence I want to display on standard out; do I simply use
> printf("%s") like I have always been? Do I have to do something
> different? If so, what?
>
That's the good thing about utf-8, you can treat it as a sequence of
normal char objects. If your terminal supports utf-8, then any sequence
of non-ascii chars should be displayed correctly.
> Sad Clouds suggested using wchar_t (and I am assuming functions like
> wprintf()) everywhere. I see the functions to translate character
> strings into wchar_t ... but what do I use if I know that I have
> UTF-8? And the reason I asked earlier about locale is that the
> locale affects the way the multibyte character routines behave, which
> makes me think that the locale setting affects the encoding all of
> those routines are using.
I use wchar_t when I need to know that each character is represented by
a fixed size object. This way you can have a pointer to a string and
look at every character individually just by incrementing the pointer.
Sometimes I do it from left to right, but occasionally I may need to do
it from right to left. For example if you have a filename:
some_long_file_name.txt
To quickly extract the suffix '.txt' you just scan the string from
right to left, until you hit '.' char. I think with utf-8 this type of
string manipulation would be quite messy and you would have to use a
special library that understands utf-8 encodings, etc.
The multi-byte conversion functions are affected by the current locale.
Normally you would call
setlocale(LC_CTYPE, "");
at the start of your program and during your program run you don't
change locale. Setting empty locale will make multi-byte conversion
functions query users locale environment variable and perform
conversion based on that. So different users can use different locales,
which may result in different character encoding schemes, however C
library wide character functions should transparently handle that.
There are two problems with C wide characters:
1. Switching do different locales while the program is running is not
thread-safe and may result in weird errors. This means you can only use
one locale during program run time.
2. The interfaces for C library multi-byte to wide, and wide to
multi-byte conversion functions are so badly designed, it's not even
funny. The biggest problem with those functions is the fact they expect
NULL terminated strings. If you have a partial (not NULL terminated)
string in the buffer, you cant call string conversion function on it,
because it won't stop until it finds a NULL and you end up with buffer
overrun. You cannot "artificially" NULL terminate the string, because
after reading NULL char, the function will reset mbstate_t object to the
initial state. This will mess up the next sequence of multi-byte
characters if the encoding had state.
I spent two days, jumping through the hoops and trying to figure out
how to convert partial strings. I think I nailed it in the end with 30%
performance penalty, but still 3.5 times faster than iconv().
If anyone is interested, I can post the code for the wrapper
functions...
Home |
Main Index |
Thread Index |
Old Index