Subject: Re: codeset recoding engine
To: Erik Bertelsen <erik@mediator.uni-c.dk>
From: None <itojun@iijlab.net>
List: tech-kern
Date: 11/14/1999 17:56:22
I did not have enough coffee. Let me rephrase.
>Please be careful about the terminology: In my understanding, UTF-8 is -not-
>a character code (character set), but an encoding of multibyte characters into
>a sequence of bytes that are safely transmittable over a pure 7-bit ASCII
>channel.
>
>UTF-8 may be used to encode characters in several character codes (sets), e.g.
>LATIN-1 and UNICODE. Note that even for LATIN-1, UTF-8 is not the identity mapping.
This statement is not really true. You seem to assume UCS-4 here.
(Note that Latin-1 has 1-to-1 mapping with UCS-2 or UCS-4)
If this observation is wrong, correct me...
>I also think (but am not 100% sure) that UTF-8 is able to encode full ISO 10646
>characters if needed.
In the above, you say that you are going to assume UCS-4 (or it
seems so). Please don't ever, ever hardcode something to UCS-2 (ISO
10646) nor UCS-4.
There are character sets that contain characters that cannot be
converted into characters in UCS-2, or UCS-4. Hence, you can't
put that character into UTF-8 stream.
> >> > I think you need two conversons:
> >> > kernel: filesystem-charset to utf-8
> >> > then
> >> > userland: utf-8 to LC_CHARSET.
The above two-step conversion assumes the following items:
- every characters in any character set can be converted into UCS-4
In other word, you are assuming that there'll be no information
loss in "filesystem-charset -> utf-8" conversion.
- locale library uses UCS-4 as internal encoding for wchar_t
(or, every runelocale internal encodings for rune_t in BSD
runelocale library uses UCS-4).
The above two assumptions does not hold.
Also, there's no good way for runelocale library to handle characters
outside of what LC_CHARSET capable to handle (for example, if
you mount Chinese filesystem while your LC_CHARSET is for Japanese,
you wil be in a big trouble).
itojun