Subject: Re: utf-8 and userland
To: None <tech-userlevel@NetBSD.org>
From: Dave Huang <khym@azeotrope.org>
List: tech-userlevel
Date: 03/12/2004 19:30:52
On Fri, Mar 12, 2004 at 08:02:45PM -0500, James K. Lowden wrote:
> On Sat, 13 Mar 2004, Noriyuki Soda <soda@sra.co.jp> wrote:
> > Yes, you can use iswprintf(3) by converting the multibyte characters
> > to wide characters.
>
> I don't see how that can be right. iswprint(3) takes a wint_t argument;
> the UTF-8 character will be a sequence of 1-4 bytes. Even if you redefine
> the argument, how is ls(1) supposed to know where the character boundaries
> are?
I think he means you can use iswprint(3) _after_ converting the
multibyte characters to wide characters, not "by converting...". I.e.,
use mbtowc(3) to convert from UTF-8 to a wide character first, then
use iswprint(3) to check the result. mbtowc knows where the boundaries are.
> It's my understanding that "wide characters" refer to a class of encodings
> that predate Unicode and UTF-8. New times, new features....
A wide character is just a character that's bigger than a byte. While
they may predate Unicode, they're not obsolete or superseded by UTF-8.
Semi-offtopic, but tcsh's builtin ls-F handles UTF-8 properly.
However, commandline editing doesn't work right. I posted a bug report
to the tcsh-bugs mailing list, but I think it was ignored...
--
Name: Dave Huang | Mammal, mammal / their names are called /
INet: khym@azeotrope.org | they raise a paw / the bat, the cat /
FurryMUCK: Dahan | dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 28 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++