tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: libcodecs(3), take 3
Hi, Al
It seems some points are still not addressed.
Probably because my description was too sketchy. (sorry for that).
The problems are:
* duplicated functionality
- libcodecs(3) has duplicated functionality with iconv(3), this is
undesirable because of code size and inconsistency.
It should call iconv(3) internally.
For example, current implementation of utf8_to_unicode16() and
unicode16_to_utf8() doesn't support the surrogate pair feature of
the Unicode standard.
* limited extensibility
- The code conversion feature often needs more codesets.
With current traslation naming scheme, libcodecs has to be changed
at each addition of a codeset. That's undesirable.
If the translation name for code coversion is the follwoing format,
and if libcodecs internally calls iconv(3), many codesets will be
supported by libcodec automatically:
current naming:
ascii2ebcdic
ebcdic2ascii
to_unicode
to_utf8
desirable naming scheme:
iconv(FROM_CODESET,TO_CODESET)
- The mapping name for wctrans(3) will be added in future,
With current traslation naming scheme, libcodecs has to be changed
at each addition of a mapping. That's undesirable.
If the translation name for code coversion is the follwoing
format, no change is necessary at an addition of a mapping:
current naming:
to_lower
to_upper
desirable naming scheme:
wctrans(MAPPING_NAME)
Providing "to_lower"/"to_upper" as an alias of wctrans("tolower")
wctrans("toupper") may be a good idea due to its frequenst use, though.
* naming inconsistency
- Just using "EBCDIC" is inconsistent with existing NetBSD
installation, because we have already supported the following
EBCDIC variants:
$ iconv -l | grep -i ebcdic | tr '\012' ' '
ebcdic-at-de ebcdic-at-de-a ebcdic-be ebcdic-br ebcdic-ca-fr
ebcdic-cp-ar1 ebcdic-cp-ar2 ebcdic-cp-be ebcdic-cp-ca
ebcdic-cp-ch ebcdic-cp-dk ebcdic-cp-es ebcdic-cp-fi
ebcdic-cp-fr ebcdic-cp-gb ebcdic-cp-gr ebcdic-cp-he
ebcdic-cp-is ebcdic-cp-it ebcdic-cp-nl ebcdic-cp-no
ebcdic-cp-roece ebcdic-cp-se ebcdic-cp-tr ebcdic-cp-us
ebcdic-cp-wt ebcdic-cp-yu ebcdic-cyrillic ebcdic-dk-no
ebcdic-dk-no-a ebcdic-es ebcdic-es-a ebcdic-es-s ebcdic-fi-se
ebcdic-fi-se-a ebcdic-fr ebcdic-int ebcdic-it ebcdic-jp-e
ebcdic-jp-kana ebcdic-pt ebcdic-uk
* naming ambiguity
- The name "to_unicode" and "to_utf8" are ambiguous because
they don't indicate which codeset it converts from.
- The name "EBCDIC" itself is ambiguous.
(c.f. "iconv -l | grep -i ebcdic")
- The name "unicode" itself is ambiguous.
It is possible that "unicode" means:
- UCS-4
- UTF-8
- UTF-8 with byte order mark
- UTF-16 Big Endian without byte order mark
- UTF-16 Big Endian with byte order mark
- UTF-16 Little Endian without byte order mark
- UTF-16 Little Endian with byte order mark
- and many more.
* bugs
- mixed2lower() and mixed2upper() are using a cast for passing "char *"
towctrans(3). This doesn't work for multibyte codesets.
- As written above,, current implementation of utf8_to_unicode16() and
unicode16_to_utf8() doesn't support the surrogate pair feature of
the Unicode standard.
--
soda
Home |
Main Index |
Thread Index |
Old Index