Re: libcodecs(3), take 3

To: SODA Noriyuki <soda%yuruyuru.net@localhost>
Subject: Re: libcodecs(3), take 3
From: Alistair Crooks <agc%pkgsrc.org@localhost>
Date: Wed, 29 Sep 2010 06:44:39 +0200

Hi Soda-san,

Thanks for your mail.

On Wed, Sep 29, 2010 at 08:29:27AM +0900, SODA Noriyuki wrote:
> Hi, Al
> 
> It seems some points are still not addressed.
> Probably because my description was too sketchy. (sorry for that).
> 
> The problems are:
> 
> * duplicated functionality
> 
>   - libcodecs(3) has duplicated functionality with iconv(3), this is
>     undesirable because of code size and inconsistency.
>     It should call iconv(3) internally.
>     For example, current implementation of utf8_to_unicode16() and
>     unicode16_to_utf8() doesn't support the surrogate pair feature of
>     the Unicode standard.

The original mail I sent mentioned that there were going to be
duplications, and that was because I wanted a single, regular way of
doing a number of transformations.

Nevertheless, I can understand your opposition to a number of the
codeset items.  I don't disagree with them -- you know way more than
me in these matters -- and so, to avoid elongating this design review
any further, I'll delete the ebcdic and utf8 functions from
libcodecs(3), and use the simpler ctype tolower(3) and toupper(3)
macros to modify alphabetic case.  I know this is a backwards step,
but the utility value (to me, and, I am assured by their mail, to
others) outweighs the minimal extra value to be had from ebcdic/ascii
conversion, and the simple utf-8 conversion functions.

> * limited extensibility
> 
>   - The code conversion feature often needs more codesets.
>     With current traslation naming scheme, libcodecs has to be changed
>     at each addition of a codeset.  That's undesirable.
>     If the translation name for code coversion is the follwoing format,
>     and if libcodecs internally calls iconv(3), many codesets will be
>     supported by libcodec automatically:
>       current naming:
>               ascii2ebcdic
>               ebcdic2ascii
>               to_unicode
>               to_utf8
>       desirable naming scheme:
>               iconv(FROM_CODESET,TO_CODESET)

No, not really -- codecs_add() can easily be used to add an external
transformation routine; transformations do not have to be compiled in
to libcodecs(3) to be used. Different tables of transformations can
be used by adding different groups of transformation functions. By
default, if no transformations have been loaded by the time the first
transformation is attempted, all of the compiled-in transformations
will be loaded.

In this way, it's easy to add iconv-based transformations externally.

>   - The mapping name for wctrans(3) will be added in future,
>     With current traslation naming scheme, libcodecs has to be changed
>     at each addition of a mapping.  That's undesirable.
>     If the translation name for code coversion is the follwoing
>     format, no change is necessary at an addition of a mapping:
>       current naming:
>               to_lower
>               to_upper
>       desirable naming scheme:
>               wctrans(MAPPING_NAME)
>     Providing "to_lower"/"to_upper" as an alias of wctrans("tolower")
>     wctrans("toupper") may be a good idea due to its frequenst use, though.

Yes, not sure I see it like that. The operation is to convert the case
of the alphabet in question. The C operation may be (using wctype.h ops)
called wctrans, but 1. I don't think like that, 2. if I want to convert
to upper or lower case, I'd really like libcodecs to be able to do that,
and 3. since I've removed the <wctype.h> header file, this part is a bit
moot.

> * naming inconsistency
> 
>   - Just using "EBCDIC" is inconsistent with existing NetBSD
>     installation, because we have already supported the following
>     EBCDIC variants:
> 
>       $ iconv -l | grep -i ebcdic | tr '\012' ' '
>       ebcdic-at-de ebcdic-at-de-a ebcdic-be ebcdic-br ebcdic-ca-fr
>       ebcdic-cp-ar1 ebcdic-cp-ar2 ebcdic-cp-be ebcdic-cp-ca
>       ebcdic-cp-ch ebcdic-cp-dk ebcdic-cp-es ebcdic-cp-fi
>       ebcdic-cp-fr ebcdic-cp-gb ebcdic-cp-gr ebcdic-cp-he
>       ebcdic-cp-is ebcdic-cp-it ebcdic-cp-nl ebcdic-cp-no
>       ebcdic-cp-roece ebcdic-cp-se ebcdic-cp-tr ebcdic-cp-us
>       ebcdic-cp-wt ebcdic-cp-yu ebcdic-cyrillic ebcdic-dk-no
>       ebcdic-dk-no-a ebcdic-es ebcdic-es-a ebcdic-es-s ebcdic-fi-se
>       ebcdic-fi-se-a ebcdic-fr ebcdic-int ebcdic-it ebcdic-jp-e
>       ebcdic-jp-kana ebcdic-pt ebcdic-uk

EBCDIC was a simple utility transformation that I added. I have no
real use for it, and no desire to enter into a charset naming war,
so I've deleted any reference to EBCDIC from the latest version.

> * naming ambiguity
> 
>   - The name "to_unicode" and "to_utf8" are ambiguous because
>     they don't indicate which codeset it converts from.
>       
>   - The name "EBCDIC" itself is ambiguous.
>     (c.f. "iconv -l | grep -i ebcdic")
> 
>   - The name "unicode" itself is ambiguous.
>     It is possible that "unicode" means:
>       - UCS-4
>       - UTF-8
>       - UTF-8 with byte order mark
>       - UTF-16 Big Endian without byte order mark
>       - UTF-16 Big Endian with byte order mark
>       - UTF-16 Little Endian without byte order mark
>       - UTF-16 Little Endian with byte order mark
>       - and many more.

Yes, I've deleted the UTF-8 and EBCDIC functionality to address this
point.

> * bugs
> 
>   - mixed2lower() and mixed2upper() are using a cast for passing "char *"
>     towctrans(3).  This doesn't work for multibyte codesets.

Indeed. I've decided not to use the multibyte codeset so that this is no
longer an issue. Personally, I think this is a step backwards, but this
will move the process forward.

>   - As written above,, current implementation of utf8_to_unicode16() and
>     unicode16_to_utf8() doesn't support the surrogate pair feature of
>     the Unicode standard.

Yes, this has been addressed by deleting those 2 transformations.

I would have got more use out of EBCDIC than UTF-8, I think, but neither
is strong enough for me to want to force the issue. At the same time, I
recognise the sincere problems you have with the transformations mentioned
above -- so it's best just to delete them.

Thanks once again for your mail, and for educating me much more thoroughly
than I had been in the ways of non-ASCII character sets.

With best wishes,
Alistair

References:
- libcodecs(3), take 3
  - From: Alistair Crooks
- Re: libcodecs(3), take 3
  - From: SODA Noriyuki

Prev by Date: Re: Proposed addition of strcodecs(3) library - review requested
Next by Date: Re: libcodecs(3), take 3
Previous by Thread: Re: libcodecs(3), take 3
Next by Thread: Re: libcodecs(3), take 3
Indexes:

Home | Main Index | Thread Index | Old Index