Re: libcodecs(3), take 2

To: Alistair Crooks <agc%pkgsrc.org@localhost>
Subject: Re: libcodecs(3), take 2
From: SODA Noriyuki <soda%yuruyuru.net@localhost>
Date: Tue, 21 Sep 2010 15:57:15 +0900

>>>>> On Tue, 21 Sep 2010 08:02:40 +0200, Alistair Crooks 
>>>>> <agc%pkgsrc.org@localhost> said:

>      ascii2ebcdic
>                   [charset] convert the input from ASCII character encodings
>                   to EBCDIC character encodings.

>      ebcdic2ascii
>                   [charset] convert the input from EBCDIC character encodings
>                   to ASCII character encodings.

I guess those are not so good names, because EBCDIC has so many variants.

>      to-lower     [charset] change any uppercase letters in the input string
>                   to lowercase.

>      to-upper     [charset] change any lowercase letters in the input string
>                   to uppercase.

Those are problematic, because to-lower/to-upper conversion
are affected by current locale setting.

Also, it's better to use "wctrans(tolower)"/"wctrans(toupper)" or
something like those, to allow all character mapping names in
wctrans(3) in future.  Although NetBSD currently only supports
tolower/toupper.  (wctrans(3) is affected by current locale too.)


>      to-unicode   [charset] translate to unicode-16 from UTF-8
> 
> 
>      to-utf8      [charset] translate from unicode-16 to UTF-8

Those are bad names, since unicode is a concept which includes
UTF-16+BOM, UTF-16BE, UTF-16LE, UTF-8, UTF-8+BOM, UCS-4 and others.

What does the "to-unicode" really do?
Does it convert to UTF-8 to UTF-16LE? or UTF-16BE? or UTF-16LE+BOM
or UTF-16BE+BOM?

Does "to-utf8" remove BOM from UTF-16? or add BOM in the case when
UTF-16 didn't have BOM?

For code conversion, I think libcodec(3) shouldn't handle codeset names
by itself.  Maybe it makes sense to provde a transformation
"iconv(from_codeset,to_codeset)", though.  In that case libcodec(3)
internally can call iconv(3) for the actual conversion, and
ascii2ebcdic, ebcdic2ascii, to-unicode and to-utf8 are all unnecessary.
-- 
soda

Follow-Ups:
- Re: libcodecs(3), take 2
  - From: Takehiko NOZAKI
- Re: libcodecs(3), take 2
  - From: Matthew Mondor

References:
- libcodecs(3), take 2
  - From: Alistair Crooks

Prev by Date: libcodecs(3), take 2
Next by Date: Re: struct terminal in /usr/include/term.h
Previous by Thread: libcodecs(3), take 2
Next by Thread: Re: libcodecs(3), take 2
Indexes:

Home | Main Index | Thread Index | Old Index