netbsd-bugs: Re: lib/36938: mbtowc misbehaving after invalid char sequence

Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
To: None <lib-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Christos Zoulas <christos@zoulas.com>
List: netbsd-bugs
Date: 11/21/2007 18:25:02
The following reply was made to PR lib/36938; it has been noted by GNATS.

From: christos@zoulas.com (Christos Zoulas)
To: gnats-bugs@NetBSD.org, lib-bug-people@netbsd.org,
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, neil@daikokuya.co.uk
Cc: 
Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
Date: Wed, 21 Nov 2007 13:21:30 -0500

 On Nov 21,  4:25pm, th-nozaki@netwrk.co.jp (Takehiko NOZAKI) wrote:
 -- Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
 
 | The following reply was made to PR lib/36938; it has been noted by GNATS.
 | 
 | From: Takehiko NOZAKI <th-nozaki@netwrk.co.jp>
 | To: gnats-bugs@NetBSD.org
 | Cc: neil@daikokuya.co.uk
 | Subject: Re: lib/36938: mbtowc misbehaving after invalid char sequence
 | Date: Thu, 22 Nov 2007 01:17:39 +0900
 | 
 |  hi, Neil.
 |  
 |  > Nozaki-san, reading the C standard again I think NetBSD is not
 |  > behaving properly here, in the case of a non-state-dependent
 |  > encoding.  The standard says that calls to mbtowc alter the internal
 |  > state "as necessary" (7.20.7).  However, one of the assertions of
 |  > the code I posted is that UTF-8 is not state-dependent; hence the
 |  > converter should always be in the initial shift state (from the
 |  > language user's point of view; I understand this may not be the case in
 |  > the implementation).  So I believe that, since UTF-8 is not a
 |  > state-dependent encoding, we should be able to call mbtowc at any
 |  > time and expect it to be in the initial shift state.
 |  > 
 |  > I would agree that it is not 100% clear though.
 |  > 
 |  
 |  now i read ISO/IEC 9899:TC2 7.20.7 again too,
 |    http://open-std.org/JTC1/SC22/WG14/www/docs/n1124.pdf
 |  
 |  mmm, i might have been misunderstanding the reason why
 |  mbtowc(3) doesn't return -2(restartable) but -1(not restartable).
 |  
 |  mbrtowc(3) should store mbstate_t with partial characters of
 |  feeded multibyte sequence for restarting purpose.
 |  but for mbtowc(3), restart is not required. there's no need
 |  to store internal-state with bytes.
 |  
 |  if it is true, it means that internal-state != mbstate_t structure,
 |  stateless encoding must not make internal-state dirty.
 |  you might be right.
 |  
 |  ...but following code may fail with glibc2(SuSE 10.0) at line 22,
 |  they have same problem too, or assumes internal-state == mbstate_t.
 |  
 |  01 #include <assert.h>
 |  02 #include <locale.h>
 |  03 #include <stdlib.h>
 |  04 
 |  05 /* partial UTF-8 string, mbtowc may return -1(mbrtowc may -2)*/
 |  06 const char partial_utf8[2] = { 0xe3, 0x80 };
 |  07 /* valid UTF-8 string */
 |  08 const char good_utf8[1] = { 0x20 };
 |  09 
 |  10 int
 |  11 main(void)
 |  12 {
 |  13	wchar_t wc;
 |  14 
 |  15	setlocale(LC_CTYPE, "en_US.UTF-8");
 |  16 	assert(mbtowc(&wc, NULL, 0) == 0);
 |  17 	assert(mbtowc(&wc, good_utf8, sizeof(good_utf8)) == 1);
 |  18 	assert(mbtowc(&wc, partial_utf8, sizeof(partial_utf8)) == -1);
 |  19 #if 0 /* omit re-initialzation */
 |  20     assert(mbtowc(&wc, NULL, 0) == 0);
 |  21 #endif
 |  22	assert(mbtowc(&wc, good_utf8, sizeof(good_utf8)) == 1);
 |  23 	return 0;
 |  24 }
 |  
 |  anyway, i'll consider to change current behavior.
 |  
 
 Either way, it would be useful to put parts of this discussion in comments
 in the code, so that the next guy who touches it, is aware of the issues.
 
 christos