tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Unicode programming
Greetings all,
I want to preface this right up front by saying that these questions aren't
technically NetBSD-related (although it involves software that is portable
to NetBSD), but I know there are bunch of people here who are much smarter
than I who understand all of this stuff and since we've talked about this
stuff before here, I hope this won't be out of line.
I've read up on Unicode, and of course I've read the stuff that was on this
mailing list last year (great help!). I have some additional questions that
I was hoping someone would be willing to answer.
Let's assume I have an application that I want to add Unicode support
to; let's also assume that in this application, I already know when I
get a sequence of bytes I know what the encoding of these bytes are
(they won't necessarily be UTF-8). This is a command-line application,
so I'm going to punt the heavy lifting in terms of displaying Unicode
glyphs to something else like xterm. I'm at the Plan 9 level of
Unicode support; by that I mean I mostly only care about stuff in the
Basic Multilingual Plane, and I'm not worried about text with different
orientations.
- I'm aware of the multibyte functions like mbrtowc(), and I know that the
these functions depends on the encoding set in your environment as to
how they interpret their input. But what I don't quite see is what these
functions are supposed to output in terms of "wide characters"; it seems
like this is unspecified. I gather that if the C language implementation
defines the macro __STDC_ISO_10646__ then you know that "wide" characters
are Unicode codepoints. If that macro isn't defined ... then I guess
what wide characters are is undefined? Is that correct?
- Assuming the above is correct ... what do programmers do in terms of
parsing things like UTF-8 into Unicode codepoints, since you don't
necessarily know that mbrtowc() will give you a Unicode codepoint on
some (looks like many) systems. I guess iconv() looks like something
that handles a lot of encodings, and it seems to be lots of places;
I'm also aware of icu. I'm also wondering what people do about things
like finding out how many columns a particular series of Unicode codepoints
occupies; I know about things like wcswidth(), but again you're not
guaranteed that wide characters are Unicode codepoints.
- Internally to your programs, do you use UTF-8 as your representation?
UTF-16? UTF-32? I know, this depends on what you're doing; I'm just
trying to get a sense of what is common.
Thanks for any advice you can give me,
--Ken
Home |
Main Index |
Thread Index |
Old Index