Subject: Re: codeconv v3 - kernel code set recoding engine
To: None <dolecek@ics.muni.cz>
From: Noriyuki Soda <soda@sra.co.jp>
List: tech-kern
Date: 03/08/2000 01:04:50
> > For example:
> > (1) VFAT vs SJIS userland.
> > codeconv_t *k2u = codeconv_open("UTF-16LE", "SJIS");
> > codeconv_t *u2k = codeconv_open("SJIS", "UTF-16LE");
> > (2) SJIS MS-DOS fs (not VFAT, but FAT) vs UTF-8 userland:
> > codeconv_t *k2u = codeconv_open("SJIS", "UTF-8");
> > codeconv_t *u2k = codeconv_open("UTF-8", "SJIS");
>
> FAT used really used SJIS ? EUC-encoded ?
FAT of Japanese MS-DOS doesn't support eucJP, only supports SJIS.
> I always though that FAT supports only subset of ASCII - namely
> [A-Z0-9_-?] + one dot.
FAT (before VFAT age) of Japanese MS-DOS only supports SJIS.
Yet another surprising thing is that SJIS FAT contains "\" (0x5c) as
pathname character. (as second byte of kanji).
> > (3) NFSv4 with UTF-8 vs SJIS userland
> > codeconv_t *k2u = codeconv_open("UTF-8", "SJIS");
> > codeconv_t *u2k = codeconv_open("SJIS", "UTF-8");
> > I think there is no reason to use one codeconv_t for opposite
> > direction conversion.
>
> As I said, I though it would be convenient. That's the only
> reason I've done it this way for now :)
If so, please do not do like that.
> > No, it does cost.
> > There are cases that only one direction conversion is needed.
>
> But typically, caller would need conversion in both directions,
> so why not provide it with what is commonly needed ?
The assumption is wrong.
For example, Japanese console i/o often only requires one directional
conversion (i.e. for output only). Because input side is covered by
userland input method. (The input method is typically > 1MB process
size, and > 5MB dictionary size).
> Furthermore, separate codeconv_enc() & codeconv_dec() (or whatever
> they would be named) provide better type checking, FWIW.
No.
u = codeconv_k2u(cc, k);
k = codeconv_u2k(cc, u);
isn't different from
u = codeconv(k2u_cc, k);
k = codeconv(u2k_cc, u);
about type checking.
> > IMHO, passing endiannes is wrong abstraction. Why passing endianess is
> > needed although more general function like iconv(3) doesn't need that?
>
> I imagine there might be other options which might be "configurable"
> per-codeconv and usable for several code sets. But using unique
> code set name (like "Unicode-LE") is also ok.
Mm, "Unicode-*" is bad name, too. :-)
There are many unicode variants, e.g.
UTF-7
UTF-8
UTF-16 little endian
UTF-16 big endian
UTF-16 with byte order mark
So, please don't just use "Unicode", but please use "UTF-16XX" or
something.
> > It makes sense to use/share same function and implementation for NTFS
> > and Joliet extension.
> > But it doesn't make sense to implement it on codeconv layer.
>
> To me, it makes good sense - codeconv has all information it needs.
> It knows both the "source" and "target" code set. It knows best how to
> compare individual codes in a string.
Hmm, I'll try to think about better way to define name comparison
functions. Could you wait for a while?
> > Case folded comparison is quite difficult than what you thought.
> > For example, I've heard that there is a difference between MS-Windows
> > 98 and MS-Windows NT about filename comparison. (e.g. handling of
> > Cyrillic characters)
>
> Well, we don't need to emulate case comparison as done by specific
> operating systems - we can do it right :) The only case where code
> depends on case folded comparison is in NTFS - file names in NTFS
> directory are indexed case-insensitively.
No. (at least for case conversion functions)
If we don't use same way with original OS, we might make a filename
which cannot be accessed from orignal OS. :-<
> > you cannot use following codeconv_t:
> > codeconv_t *cc = codeconv_open("SJIS", "UTF-16LE");
> > rather, you have to use this:
> > codeconv_t *cc = codeconv_open("SJIS", "UTF-16LE-Win95");
> > for Windows 98
> > codeconv_t *cc = codeconv_open("SJIS", "UTF-16LE-WinNT");
> >
> > Do you really want to do this?
>
> If Win95 Unicode and WinNT Unicode are really different, we need to do
> this anyway, as you've noted in a followup mail.
Yup.
But that doesn't mean code conversion layer should support case folding.
--
soda