Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
To: None <lib-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Takehiko NOZAKI <th-nozaki@netwrk.co.jp>
List: netbsd-bugs
Date: 11/21/2007 16:25:02
The following reply was made to PR lib/36938; it has been noted by GNATS.
From: Takehiko NOZAKI <th-nozaki@netwrk.co.jp>
To: gnats-bugs@NetBSD.org
Cc: neil@daikokuya.co.uk
Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
Date: Thu, 22 Nov 2007 01:17:39 +0900
hi, Neil.
> Nozaki-san, reading the C standard again I think NetBSD is not
> behaving properly here, in the case of a non-state-dependent
> encoding. The standard says that calls to mbtowc alter the internal
> state "as necessary" (7.20.7). However, one of the assertions of
> the code I posted is that UTF-8 is not state-dependent; hence the
> converter should always be in the initial shift state (from the
> language user's point of view; I understand this may not be the case in
> the implementation). So I believe that, since UTF-8 is not a
> state-dependent encoding, we should be able to call mbtowc at any
> time and expect it to be in the initial shift state.
>
> I would agree that it is not 100% clear though.
>
now i read ISO/IEC 9899:TC2 7.20.7 again too,
http://open-std.org/JTC1/SC22/WG14/www/docs/n1124.pdf
mmm, i might have been misunderstanding the reason why
mbtowc(3) doesn't return -2(restartable) but -1(not restartable).
mbrtowc(3) should store mbstate_t with partial characters of
feeded multibyte sequence for restarting purpose.
but for mbtowc(3), restart is not required. there's no need
to store internal-state with bytes.
if it is true, it means that internal-state != mbstate_t structure,
stateless encoding must not make internal-state dirty.
you might be right.
...but following code may fail with glibc2(SuSE 10.0) at line 22,
they have same problem too, or assumes internal-state == mbstate_t.
01 #include <assert.h>
02 #include <locale.h>
03 #include <stdlib.h>
04
05 /* partial UTF-8 string, mbtowc may return -1(mbrtowc may -2)*/
06 const char partial_utf8[2] = { 0xe3, 0x80 };
07 /* valid UTF-8 string */
08 const char good_utf8[1] = { 0x20 };
09
10 int
11 main(void)
12 {
13 wchar_t wc;
14
15 setlocale(LC_CTYPE, "en_US.UTF-8");
16 assert(mbtowc(&wc, NULL, 0) == 0);
17 assert(mbtowc(&wc, good_utf8, sizeof(good_utf8)) == 1);
18 assert(mbtowc(&wc, partial_utf8, sizeof(partial_utf8)) == -1);
19 #if 0 /* omit re-initialzation */
20 assert(mbtowc(&wc, NULL, 0) == 0);
21 #endif
22 assert(mbtowc(&wc, good_utf8, sizeof(good_utf8)) == 1);
23 return 0;
24 }
anyway, i'll consider to change current behavior.
very truly yours.
--
Takehiko NOZAKI <tnozaki@NetBSD.org>