netbsd-bugs: Re: lib/36938: mbtowc misbehaving after invalid char sequence

Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
To: None <lib-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Takehiko NOZAKI <th-nozaki@netwrk.co.jp>
List: netbsd-bugs
Date: 11/21/2007 16:25:02

The following reply was made to PR lib/36938; it has been noted by GNATS.

From: Takehiko NOZAKI <th-nozaki@netwrk.co.jp>
To: gnats-bugs@NetBSD.org
Cc: neil@daikokuya.co.uk
Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
Date: Thu, 22 Nov 2007 01:17:39 +0900

 hi, Neil.
 
 > Nozaki-san, reading the C standard again I think NetBSD is not
 > behaving properly here, in the case of a non-state-dependent
 > encoding.  The standard says that calls to mbtowc alter the internal
 > state "as necessary" (7.20.7).  However, one of the assertions of
 > the code I posted is that UTF-8 is not state-dependent; hence the
 > converter should always be in the initial shift state (from the
 > language user's point of view; I understand this may not be the case in
 > the implementation).  So I believe that, since UTF-8 is not a
 > state-dependent encoding, we should be able to call mbtowc at any
 > time and expect it to be in the initial shift state.
 > 
 > I would agree that it is not 100% clear though.
 > 
 
 now i read ISO/IEC 9899:TC2 7.20.7 again too,
   http://open-std.org/JTC1/SC22/WG14/www/docs/n1124.pdf
 
 mmm, i might have been misunderstanding the reason why
 mbtowc(3) doesn't return -2(restartable) but -1(not restartable).
 
 mbrtowc(3) should store mbstate_t with partial characters of
 feeded multibyte sequence for restarting purpose.
 but for mbtowc(3), restart is not required. there's no need
 to store internal-state with bytes.
 
 if it is true, it means that internal-state != mbstate_t structure,
 stateless encoding must not make internal-state dirty.
 you might be right.
 
 ...but following code may fail with glibc2(SuSE 10.0) at line 22,
 they have same problem too, or assumes internal-state == mbstate_t.
 
 01 #include <assert.h>
 02 #include <locale.h>
 03 #include <stdlib.h>
 04 
 05 /* partial UTF-8 string, mbtowc may return -1(mbrtowc may -2)*/
 06 const char partial_utf8[2] = { 0xe3, 0x80 };
 07 /* valid UTF-8 string */
 08 const char good_utf8[1] = { 0x20 };
 09 
 10 int
 11 main(void)
 12 {
 13	wchar_t wc;
 14 
 15	setlocale(LC_CTYPE, "en_US.UTF-8");
 16 	assert(mbtowc(&wc, NULL, 0) == 0);
 17 	assert(mbtowc(&wc, good_utf8, sizeof(good_utf8)) == 1);
 18 	assert(mbtowc(&wc, partial_utf8, sizeof(partial_utf8)) == -1);
 19 #if 0 /* omit re-initialzation */
 20     assert(mbtowc(&wc, NULL, 0) == 0);
 21 #endif
 22	assert(mbtowc(&wc, good_utf8, sizeof(good_utf8)) == 1);
 23 	return 0;
 24 }
 
 anyway, i'll consider to change current behavior.
 
 very truly yours.
 -- 
 Takehiko NOZAKI <tnozaki@NetBSD.org>