tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: localeio
hi,
> > 1. please don't install LC_* those codeset is not supported by
> > iconv_open(3) yet,
> > such as ISCII-DEV(LC_CTYPE that i maintain keep this rule).
>
> Fine, I have no problem with this. Do you have such a list?
as far as i glanced:
* be_BY.CP1131
iconv(3) is ok, but LC_CTYPE is not.
we can get be_BY.CP1131's LC_CTYPE src from FreeBSD too,
but request we got is only CP1251.
http://mail-index.netbsd.org/tech-userlevel/2006/03/14/0000.html
(hi, cheusov!)
* am_ET.UTF-8, he_IL.UTF-8, mn_MN.UTF-8
LC_CTYPE support is missing.
yes, we can add en_US.UTF-8 -> {am_ET,he_IL,mn_MN}.UTF-8 alias.
* hi_IN.ISCII-DEV
LC_CTYPE and iconv(3) support missing.
i need conversion table, i have been looking for.
* zh_CN.GB2312
zh_CN.GB2312 should an alias of zh_CN.eucCN,
this is FreeBSD's redundancy.
> Hmm, I'm not sure that anything that we do will be compatible with
> GNU's ideas. Do we have to be constrained by GNU?
but it seems hat Free Standards Group / Linux Standard Base comes to have the
influence power in ISO/IEC WG14(C) and WG15(POSIX),
# some glibc2 extension becomes ISO/IEC's Techical Report(such as TR24731-2).
> > at this point, no magic, no version controlled locale-db format
> > is not good idea.
>
> I'm still not convinced. A version might make some things easier but
> will also add complexity. I'm still not convinced 100% of the benefit.
> However, I'm starting to lean that way. I think I may have a method
> of encoding at least a rudimentary header with standard tools.
i can't remember well, but glibc2 has some wide-string version of
LC_*'s string fields(if i'm wrong, correct me).
if it is reasonable for implementing our libc's locale function, we might too.
yes, i know those field can be generated *on the fly*
at the time setlocale(3) was called by using mbsrtowcs(3).
but string -> wide-string conversion costs much on run-time action,
localedef(1) can generate those wide-string fields, and store locale-db
in advance.
i prefer later, but wide-string in locale-db may require the care for
byteorder(3) etc.
that's why i proposed to using citrus_db* stuffs.
> > to introduce flexibility, i think it's better to use key-value pair db
> > format.
> > src/lib/libc/citrus/citrus_db*.[ch] stuff may good for this purpose.
> > # easy to use as match as plain-text, i believe.
>
> I'm not sure I agree. The database format should be trivially
> parsed or rather loaded by the library. All of the work would be
> done up front by the tools that create it. The missing localedef(1)
> would normally do the job. But in the interim a simple plain-text
> file is far easier to create. Well, plain-text is an over
> simplification. The file is really a sequence of bytes. The strings
> and string like things are newline terminated. I think this keeps
> things MI. I'm just not sure about multi-byte sequences. You'll
> forgive me I don't deal with multi-byte characters in my day-to-day.
most important thing, we have to keep ABI, and allow file-format changable.
i don't opposed to import FreeBSD's text locale-db
as long as we can change format later.
so i would prefer introducing sub-directory like:
/usr/share/locale/*/LC_*/*
> Also what does the citrus_db* stuff gain over say using db(3)?
no, berkeley-db is too heavy to implement iconv(3) and iconvdata,
so that tshiozak-san wrote a tiny, first, on disk format by mmap(2),
nestable hashmap implementation, that's citrus_db* struff.
so that we can't use db(1), we have to write
new tool like iconvdata, mkcsmapper(1) and mkesdb(1).
> Where is any of the citrus stuff documented?
sorry, not documented(hi, tshiozak-san).
src/usr.bin/{mkcsmapper,mkesdb} may be a good example, i hope.
> Is it used anyplace other than iconv(3)?
no, but tshiozak-san intended to use it for LC_* implementation
when he done it, AFIK.
> > files under /usr/share should be MI,
> > because these can be shared among different MACHINE_ARCH by NFS etc.
> > of course db file generated by citrus_db*[c.h] is MI.
>
> Exactly. This is why they haven't been encoded as anything other
> than plain-text. I'm not sure it is practice to share files across
> different OSes. The citrus stuff maybe MI but citrus isn't everywhere.
i think sharing LC_* databases across different OSes is not required.
only sharing all different version, architecture of NetBSD.
currently we keep forwad-compatibility with FreeBSD's LC_CTYPE format,
but they changed the format at 6.0, i think it is better to remove
_ReadCTypeAsRune() in src/lib/libc/locale/setrunelocale.c
and, we might have to move LC_* stuff to /usr/libdata/locale or /usr/lib/locale.
> Your proposal is to add the additional "indirection" (sub-directory)
> to all of the categories. This might be reasonable. It would allow
> for backward compatibility.
i think LC_CTYPE were too, but... :(
> > SUSv3 spec is very ambigious about ``where do we *copy* from information?''
> > if this means:
> > ``copy from /usr/share/locale/en_US.UTF-8/* that compliled by localedef(1)''
> > we have to restore from (multi)byte-sequence in plain-text db to
> > charmap's symbol-name, it is *impossible*(yes i know LC_CTYPE too).
>
> Huh? Why would that be the case? Either copy means take from the
> source (which seems to be GNU's method and that used by IRIX) or
> directly from the "compiled" binary.
the term *ambiguous*(sorry, misspelling), i intends to mention about
semantics of copy instruction had been changed between SUSv3
and ISO/IEC TR14652.
SUSv3 said that:
<cite>
the copy statement names a valid, existing locale, then localedef shall
behave as if the source definition had contained a valid category source
definition for the named locale.
</cite>
it is clear at this point, "existing" and "valid" locale.
this means that if we wrote following localedef src:
charmap "UTF-8"
LC_CTYPE
copy "ja_JP.eucJP"
END LC_CTYPE
following code must work fine:
#include <assert.h>
#include <locale.h>
main(void)
{
char *loc = setlocale(LC_CTYPE, "ja_JP.eucJP");
assert(loc != NULL);
}
and we only copy from installed "compiled" locale-db.
but in ISO/IEC TR14652, copy instruction's semantics has been changed.
charmap "UTF-8"
LC_CTYPE
copy "i18n"
END LC_CTYPE
<cite>
4.1.3 Names for copy keyword
In most of the categories a "copy" keyword is allowed.
The name specified with this copy keyword is one of:
- "i18n" which indicate the "i18n" FDCC-set defined in this specification,
- the name of a FDCC-set or POSIX locale registered by the process defined
in ISO/IEC 15897,
- any other name which may be recognized in some local context - not being
recommended as an international specification.
</cite>
"i18n" is not a existing locale and valid locale name.
setlocale(LC_CTYPE, "i18n") may not work.
copy means take from source.
> I'm not sure I understand the conversion back to "plain-text". Note
> that the current plain-text database isn't really plain-text. It
> is actually a sequence of bytes. I think the multi-byte sequences
> just happen to come out "right".
think following case:
charmap "UTF-8"
LC_TIME
copy "ja_JP.eucJP"
END LC_TIME
if our wchar_t were UCS4 codepoint(this means we can define
__STDC_ISO10646__ like glibc2), we can easily convert
ja_JP.eucJP -> ja_JP.UTF-8 directly such instruction:
- open ja_JP.eucJP locale-db and read multibyte(=eucJP) sequences.
- loading eucJP encoding module.
- convert multibyte(=eucJP) to wchar_t(=UCS4) by mbrtowc() in eucJP mod.
- loading UTF-8 encoding module.
- convert wchar_t(=UCS4) to multibyte(=UTF-8) by wcrtomb() in UTF-8 mod.
- save multibyte(=UTF-8) to newly created ja_JP.UTF-8 locale-db.
but our wchar_t is not UCS4, because we're CSI(=Codeset Independent) policy.
# UCS4 hardwired wchar_t is not enough, read itojun-san's paper:
#
http://www.usenix.org/events/usenix01/freenix01/full_papers/hagino/hagino_html/index.html
we can't directly convert from eucJP's wchar_t(=JIS) to UTF-8's wchar_t(=UCS4),
because encoding module don't know how to mapping JIS <-> UCS4 codepoint
(it require huge conversion table).
that's why i think it is *impossible*, but...
> > Solaris, they don't copy information from
> > /usr/lib/localedef/src/en_US.UTF-8/*.src
> > but /usr/share/locale/en_US.UTF8/* stuffs, as far as i know from
> > truss(1)'s output.
>
> Which version of Solaris? I don't have /usr/share/locale on my
> Solaris 9 box. I've got /usr/lib/localedef/src and /usr/lib/locale.
> The latter has dynamic shared objects created via localedef(1).
sorry, s/\/usr\/share/\/usr\/lib/; please.
Solaris's wchar_t is not UCS4 but CSI, they don't define __ISO_STDC10646__.
this is same policy as NetBSD.
it seems that my Solaris 8 box's localedef(1) load ja_JP.eucJP encoding module
and call it's __{mbtowc,wctomb}_dense_eucjp function
(from truss(1) information, i don't read their CDDL source.
correct me if i'm wrong).
i guess following conversion is happend in its internal:
multibyte(eucJP) -> wchar_t(JIS) -> ? -> wchar_t(UCS4) -> multibyte(UTF-8)
i once said that converting wchar_t(JIS) -> wchar_t(UCS4) is *impossible*.
but Solaris people the find the way, i think they uses iconv(3)'s tables.
of course, we can adapt the same way of Solaris.
but we might not, since localedef(1) assumed as to be a part of toolchain,
cross-build capablility is required.
- iconv(3) and iconvdata(up to 10MB) into libnbcompat is quite a hell.
- dynamic loading encoding module is not portable.
and i think it is quite overhead that convert each other different codeset.
so, using symbol-name that stored LC_* database is very very simple way
and reasonable.
yes, you may think it is useless information for libc-runtime, waste of memory.
my idea is "split locale-db into pieces" like:
/usr/share/locale/*/LC_*/
localedb.1 => libc's locale function only read this.
localedefdb.1 => store localedef src's symbol-name for
localedef(1)'s copy instruction.
charmapdb.1 => (LC_CTYPE only) used by iconv(3),
build from charmap + repertoiremap.
> > # localedef(1) is quite a beast from ``spec then code'' outer space.
> > # please read my past tech-userlevel's post:
>
> No kidding. But the spec wasn't created in a vacuum. It just
> tried to codify existing stuff. In this case I think the stuff
> that originally came from System V. I also don't believe it
> was ``spec then code'' as the System V stuff probably existed
> before the spec. From what I've seen, these standards are pretty
> much codify existing. Unlike some others...
"spec then buy code" :)
my intention, localedef(1) spec apparently lacks care for
stateful encoding that uses locking-shift such as ISO-2022, hz-gb2312 and so on.
in spite of mbrtowc/wcrtomb was designed to support them.
SysV MNLS spec is too old to support them, but they don't brush-up til TR14652.
(and TR14652 have some problems...)
> I know I read it when you originally replied to me. Not sure I
> understood all of it. I meant to read it again and get back to
> you.
i believe ISO/IEC TR14652's charmap extension, <escseq2022> is not
enough to support existing ISO-2022 locale stuff.
the solution is:
1. introduce our own extension(like <netbsd:xxx> tag) stuffs.
2. temporary revoke ISO-2022 locale, such as
ja_JP.ISO2022-JP, ja_JP.ISO2022-JP2, ja_JP.ct(i'm not willing...)
very truly yours.
--
Takehiko NOZAKI <tnozaki%NetBSD.org@localhost>
--
Takehiko NOZAKI<takehiko.nozaki%gmail.com@localhost>
Home |
Main Index |
Thread Index |
Old Index