tech-userlevel: Re: behaviour of iconv in NetBSD and pkgsrc libiconv

Subject: Re: behaviour of iconv in NetBSD and pkgsrc libiconv
To: None <tech-userlevel@NetBSD.org>
From: None <joerg@britannica.bec.de>
List: tech-userlevel
Date: 04/03/2006 17:08:12

On Mon, Apr 03, 2006 at 03:26:45PM +0200, Bruno Haible wrote:
> > > In contrast, converters/libiconv stops the conversion at this point,
> > > returns an error and gives the application a chance to do something
> > > about the unconvertible character [1].
> >
> > The GNU implementation is clearly broken.
> 
> The GNU implementations of iconv() - both the one in glibc and libiconv -
> stop when an unconvertible character is encountered. This is not POSIX
> compliant, but I would qualify it as "useful", not "broken".

Actually, it isn't. Think a moment about transliterations, e.g.
substituting the Euro symbol by "EUR" in ISO-8859-1. It is still
technically an inexact conversion, but it is also a best approximiation
for the given situations.

> If you want to distinguish between invalid input and valid but unconvertible
> input, perform a conversion to "UTF-8".

Read Itojun's paper why Unicode is *not* enough to classify this. In
short, you can loose information when converting to Unicode and such
information could be considered as inexact representation by the iconv
backend.

> Alternatively you can feed bytes one by one into the conversion descriptor.
> But this is slow as well. You see that the POSIX spec is inadequate.

It was not intended for this usage. If you want to convert to Unicode
only and *expect* to have unconvertible characters in the input, use an
API more appropiate for this purpose, e.g. the wchar interface.

Joerg