tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: libcodecs(3), take 2
>>>>> On Tue, 21 Sep 2010 08:02:40 +0200, Alistair Crooks
>>>>> <agc%pkgsrc.org@localhost> said:
> ascii2ebcdic
> [charset] convert the input from ASCII character encodings
> to EBCDIC character encodings.
> ebcdic2ascii
> [charset] convert the input from EBCDIC character encodings
> to ASCII character encodings.
I guess those are not so good names, because EBCDIC has so many variants.
> to-lower [charset] change any uppercase letters in the input string
> to lowercase.
> to-upper [charset] change any lowercase letters in the input string
> to uppercase.
Those are problematic, because to-lower/to-upper conversion
are affected by current locale setting.
Also, it's better to use "wctrans(tolower)"/"wctrans(toupper)" or
something like those, to allow all character mapping names in
wctrans(3) in future. Although NetBSD currently only supports
tolower/toupper. (wctrans(3) is affected by current locale too.)
> to-unicode [charset] translate to unicode-16 from UTF-8
>
>
> to-utf8 [charset] translate from unicode-16 to UTF-8
Those are bad names, since unicode is a concept which includes
UTF-16+BOM, UTF-16BE, UTF-16LE, UTF-8, UTF-8+BOM, UCS-4 and others.
What does the "to-unicode" really do?
Does it convert to UTF-8 to UTF-16LE? or UTF-16BE? or UTF-16LE+BOM
or UTF-16BE+BOM?
Does "to-utf8" remove BOM from UTF-16? or add BOM in the case when
UTF-16 didn't have BOM?
For code conversion, I think libcodec(3) shouldn't handle codeset names
by itself. Maybe it makes sense to provde a transformation
"iconv(from_codeset,to_codeset)", though. In that case libcodec(3)
internally can call iconv(3) for the actual conversion, and
ascii2ebcdic, ebcdic2ascii, to-unicode and to-utf8 are all unnecessary.
--
soda
Home |
Main Index |
Thread Index |
Old Index