Subject: lib/36938: mbtowc misbehaving after invalid char sequence
To: None <lib-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: None <neil@daikokuya.co.uk>
List: netbsd-bugs
Date: 09/06/2007 13:15:00
>Number: 36938
>Category: lib
>Synopsis: mbtowc fails converting valid sequences after invalid one
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: lib-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Thu Sep 06 13:15:00 +0000 2007
>Originator: neil@daikokuya.co.uk
>Release: NetBSD 4.99.23
>Organization:
>Environment:
System: NetBSD duron.akihabara.co.uk 4.99.23 NetBSD 4.99.23 (GENERIC) #0: Sun Jul 15 10:39:38 JST 2007 root@duron.akihabara.co.uk:/usr/src/sys/arch/i386/compile/GENERIC i386
libc.so.12.150
Architecture: i386
Machine: i386
>Description:
See commented example below. After the invalid sequence, it fails
to convert a valid sequence. This is not limited to UFT-8; it also
happens for other encodings so I believe the problem is generic,
if indeed it is a bug. If it's not a bug, mbtowc would
seem to be useless in practice. Code below succeeds on Linux.
#include <assert.h>
#include <locale.h>
#include <stdlib.h>
/* Valid 2-byte shift-JIS character, not valid UTF-8 sequence. */
const char sjis[] = "\x95\x5c";
/* Valid UTF-8, of course. */
const char space[] = " ";
int main (void)
{
wchar_t wc;
setlocale (LC_CTYPE, "ja_JP.UTF-8");
/* Assert it is not state-dependent. */
assert (mbtowc (&wc, 0, 1) == 0);
/* Assert my charset beliefs. */
assert (mbtowc (&wc, space, sizeof space) == 1);
assert (mbtowc (&wc, sjis, sizeof sjis) == -1);
/* Unnecessary assertion that we're not state-dependent, but
just in case some state needs resetting. */
assert (mbtowc (&wc, 0, 1) == 0);
/* This assertion fails - I believe incorrectly. */
assert (mbtowc (&wc, space, sizeof space) == 1);
return 0;
}
>How-To-Repeat:
Compile and run above.
>Fix:
Unknown
>Unformatted:
Around Jul15 2007