tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: A draft for a multibyte and multi-codepoint C string interface
On Tue, 2 Apr 2013 18:08:01 +0200
tlaronde%polynum.com@localhost wrote:
> UTF-8 has the same role as UTC time. There is one and only one
> canonical representation, fixed. And the display of the information
> is customized according to user level rules.
UTC is a simpler problem. With UTF-8, the same set of characters may be
represented by more than one set of bytes. And, while NetBSD may
prevent non-canonical sequences in filenames, it must be able to mount
and cope with filesystems that were not so carefully managed by
other systems.
> So that the kernel interface should take and give UTF-8, and that
> filesystem drivers should take and give UTF-8, user level utilities
> converting from the current encoding to unicode and UTF-8.
>
> But that's all. If one user really wants to take into account
> acrobatics about collating sequences and the like, he can use/develop
> a program to do so.
You can't fob it off to userspace. At least I don't think so.
Consider open(2). Every element in the pathname needs
canonicalization. OK, userspace can do that. But what if the
filesystem doesn't conform? Say, because it's a CD-ROM, or a camera,
never mind NFS/sshfs/samba/PUFFs.
ISTM that to open a file, the kernel needs a more sophisticated
definition of string equality than a byte-for-byte comparison. At the
very least, it has to be able to canonicalize extant names on the disk,
and to deal somehow with duplicates.
--jkl
Home |
Main Index |
Thread Index |
Old Index