Subject: Re: CVS commit: src/sys/dev/usb
To: Bill Studenmund <wrstuden@netbsd.org>
From: Tom Spindler <dogcow@babymeat.com>
List: tech-kern
Date: 02/26/2007 15:45:30
> > Please note that fs/unicode.h does not handle UTF-16 surrogates
> > correctly. What's worth, the API does not allow this to be fixed.
> >
> > (Unicode defines more characters than fit in a 16 bit int. In
> > UTF-16, a character with a code above 0xffff is represented as two
> > surrogate values. In UTF-8, it is encoded as a 5 byte sequence.
> > Encoding/decoding one 16 bit value at a time does not allow for this
> > conversion to be done correctly.)
Huh? You can encode 0x10000-0x10ffff in four UTF-8 bytes.
CESU-8, on the other hand, encodes each surrogate pair as six bytes -
but its usage is discouraged; see http://unicode.org/faq/utf_bom.html#30