tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: A draft for a multibyte and multi-codepoint C string interface
> I'm interested in useability. We already *permit* filenames to be
> encoded with UTF-8, but we don't *support* them.
That's what "support" _means_, in a lot of cases. "Do you support
filenames longer than 14 characters?"
> We permit two filenames in one directory whose letter sequence is
> identical if the byte sequence differs.
Sure. Like "HeLLo and "hello". Some filesystems consider those to be
the same name.
> The sort order is arbitrary: "coeur" and "c?ur" don't sort next to
> each other, although they should.
(a) that's encoding-dependent (whatever octet sequence it is that you
think of as the oe ligature may mean something completely different to
whoever created the file); (b) they can be made to by using
encoding-aware sorting code in whatever program is doing the sorting.
(Which actually has to be language-, or at least locale-, aware too;
consider the ae-vs-æ example, where the linguistically-appropriate sort
order for æ differs between English or Norwegian (and maybe others).)
> The user has no way to know nor reason to care whether "året" uses
> four Unicode code points or five.
Or no Unicode anything, if the user doesn't happen to find Unicode
appropriate for the task at hand.
I think this is one of the most fundamental disagreements between us:
you seem to want user interfaces to hide such details, while I want
full visibility into what's really there (see below). And, you push
the hiding line so far that it actually crosses into the kernel; I
think that is user interface stuff and belongs in userland, in, well,
user interface code.
> If he types "vi året", I think the file should open if the character
> strings match regardless of the byte-sequences, but today the odds
> are 1:4 against.
Where'd you get 1:4? That seems to me to presume probabilities for a
number of things which I doubt you have even moderately precise numbers
for, such as the chance that the file was created using a different
encoding from the one used on the vi command line.
vi and/or the shell could actually do pretty much this now, if they
felt like it, by using a glob(3)-alike that considers any character
with more than one representation - or, more precisely, any octet
sequence which represents a character which has more than one
representation - to be a globbing wildcard matching any of its
representations.
> Who considers that state of affairs good?
Me, for one. I want vi and shells, and command-line tools in general,
to give me visibility into what's really there. Not some
equivalence-class mangling of it. Nor do I want them, and even less
stuff on the kernel side of the privilege divide, to inflict any
particular encoding, especially not one as broken (for many purposes)
as UTF-8, on me.
I don't want "vi året" to match a filename in the filesystem whose
octet sequence is different from the one generated when I typed. Not
even if the octet sequences in question represent the same character in
an encoding which you think should have some kind of special status.
> I'm confident that glob(3) could be adapted to Unicode, that open(2)
> could canonicalize, that ffs could be changed to reflect the
> encoding, and mount(2) to enforce it.
Could be, yes.
> That's just a small matter of programming.
I think it's less small than you seem to. But until/unless someone
tries to do it, we can't really know.
> For it to happen, though, we need consensus that's it's good and
> necessary. A consensus that seems surprisingly hard to establish.
I would say, _reassuringly_ hard to establish. But that difference
probably reflects nothing but which sides of the issue we're on.
/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML mouse%rodents-montreal.org@localhost
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Home |
Main Index |
Thread Index |
Old Index