Subject: behaviour of iconv in NetBSD and pkgsrc libiconv
To: None <tech-userlevel@NetBSD.org>
From: Klaus Heinz <k.heinz.apr.sechs@onlinehome.de>
List: tech-userlevel
Date: 04/02/2006 17:53:06
Hi,
yesterday, I encountered a problem working with libxml2/libxslt for the
NetBSD web site. It boils down to different behaviour of iconv() in
NetBSD 3.0 (probably in NetBSD > 1.6) compared to converters/libiconv
which the XML tools are using on NetBSD 1.6.2 and where the problem does
not show.
The root of the problem is the fact that there are some characters
(eg mdash, —) which cannot be converted from one code set (eg UTF-8)
to a different one (eg ISO-8859-2) because the destination code set
does not know those characters.
This is even mentioned in our man page iconv(3):
"If no conversion exists for a particular character, an
implementation-defined conversion is performed on this character."
NetBSD's iconv() completes the conversion of the whole buffer and maps
such characters to a question mark. The return value of iconv() shows
how many of those non-reversible conversions happened.
In contrast, converters/libiconv stops the conversion at this point,
returns an error and gives the application a chance to do something
about the unconvertible character [1].
As far as I could discover, some other systems behave in a similar way
as NetBSD does (Solaris 10, HP-UX 11i, [2]), while Linux/glibc
(Debian, RedHat) does it the same way as converters/libiconv.
The man page available at
http://www.opengroup.org/onlinepubs/009695399/functions/iconv.html
shows that our implementation probably does the right thing as far as
standards are concerned but to me, the behaviour of converters/libiconv
appears to be more sensible: If there is some character that cannot be
converted, tell the application and provide pointers (see the arguments
for iconv()) where the problematic character can be found.
Maybe you can even argue that "implementation defined conversion" could
include "interrupt the conversion and return an error".
The authors of libxml2 used this behaviour (probably unconsciously,
all the world is Linux :-/) for replacing such characters with the
equivalent XML notation for Unicode characters.
Because one of our goals is to follow standards "as much as is
practical" I would like to know whether we should really follow this
standard.
ciao
Klaus
[1] http://mail.nl.linux.org/linux-utf8/2001-09/msg00050.html
[2] Surprisingly, on FreeBSD 5/6 I could not even find /usr/include/iconv.h
although it is mentioned in their man page for iconv().
/*
NetBSD 3:
cc -g -o iconvtest iconvtest.c
ret: 1, errno 2
utf8string: â
outstring : ?
srclen : 0
destlen : 255
NetBSD 3 with converters/libiconv:
cc -I/usr/pkg/include -o iconvtest iconvtest.c -L/usr/pkg/lib -R/usr/pkg/lib -liconv
ret: -1, errno 85
utf8string: â
outstring :
srclen : 3
destlen : 256
Debian:
ret: -1, errno 84
utf8string: â
outstring :
srclen : 3
destlen : 256
RedHat:
ret: -1, errno 84
utf8string: â
outstring :
srclen : 3
destlen : 256
Solaris 10:
ret: 1, errno 2
utf8string: â
outstring : ?
srclen : 0
destlen : 255
HP-UX 11i
ret: 1, errno 0
utf8string: â
outstring :
srclen : 0
destlen : 255
*/
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <iconv.h>
/* mdash (—) cannot be converted to ISO-8859-* */
unsigned char utf8string[] = { (unsigned char)'\xE2',
(unsigned char)'\x80',
(unsigned char)'\x94',
'\0' };
unsigned char outbuffer[256];
int main(int argc, char * argv[]) {
(void) memset(outbuffer, 0, sizeof(outbuffer));
iconv_t cd = iconv_open("ISO-8859-2", "UTF-8");
if ( cd != (iconv_t)-1 ) {
unsigned char * src = utf8string;
unsigned char * dest = outbuffer;
size_t srclen = strlen((char *)src);
size_t destlen = sizeof(outbuffer);
size_t ret = iconv(cd, (const char **) &src, &srclen, (char **) &dest, &destlen);
printf("ret: %d, errno %d\n", ret, errno);
printf("utf8string: %s\n", utf8string);
printf("outstring : %s\n", outbuffer);
printf("srclen : %d\n", srclen);
printf("destlen : %d\n", destlen);
(void) iconv_close(cd);
return 0;
} else {
perror("could not obtain conversion descriptor");
return 1;
}
}