Re: Internationaliztion: UTF-8 for file names

To: "Joseph Galbraith" <galb-list%vandyke.com@localhost>
Subject: Re: Internationaliztion: UTF-8 for file names
From: nisse%lysator.liu.se@localhost (Niels Möller)
Date: 22 Mar 2002 14:55:08 +0100

"Joseph Galbraith" <galb-list%vandyke.com@localhost> writes:

> I name a directory using cyrillic, and put files named
> in Japenese inside of it.  Not a problem.  That's because
> NT uses a unicode to store file names.

Then I'd say you're not using different character sets, all parts of
the filename is stored as unicode in the filesystem. There's a system
wide policy that filenames are always stored in unicode.

In this case (now I'm talking about a generic OS with unicode
filenames and locales; I don't know if and how windows locales work in
detail), if I use a locale that says that I only understand filenames
in latin-x, filenames I create will be converted from latin x to
unicode before they reach the real filesystem. When I access the
filesystem, names that are representable in latin-x get converted
before I see them, and if I want to access files using japanese
filenames, I'll simply lose in one way or the other.

I claim this problem is *irrelevant* for NFS and sftp: It's good
enough if things that *work* using a local filesystem works over the
network. For configurations that cause trouble even on a local
filesystem, it's fine if we don't get it right either (although we should
fail as gracefully as possible).

> My guess is that there are other filesystems
> out there that use unicode for file names.
> BeOS's filesystem did, if I remember correctly.

Unix filesystems (or any eight-bit clean filesystem) support unicode,
the only thing you need to do is to create a system wide policy that
says that all filenames should use utf-8 using some particular
normalization form, in particular normalization is needed for the "/"
character.

> Well, we could go that route, however, in that case, 
> when UTF-8 is not in use, we must specify what charset
> is in use, according to my reading of RFC2277 3.1.

I'll not argue about IETF policies, as I'm not familiar enough with
them.

> I really think just specifying filenames as being encoded
> in UTF-8 is the best solution.  UTF-8 is already used
> throughout the other pieces of SSH; it complies with RFC2277,
> and it allows systems that can support multiple char-sets to
> work.

That is acceptable to me IF and ONLY IF the spec says which party is
responsible for implementing the unicode equivalence requirements.
(I don't recall off hand what the other ssh specs says about
normalization of passwords and usernames, but they are also *badly*
broken if they neglect the normalization issues).

> I really think my users would be happiest if this stuff just
> worked, which is possible with UTF-8 required, but seems
> unlikely otherwise.

As I said above, I can accept utf-8 on the wire, iff the normalization
stuff is addressed properly. But you should be aware that it does
*not* magically solve all character set problems problems.

For instance, consider a unix system where one user uses latin-1 for
his filenames, and another user uses utf-8. The server can't know
this, perhaps there's a system wide policy that utf-8 should be used,
and the latin-1 user ignores it. Then he will not be able to access
his files over sftp (and he actually has a *better* chance with the
current sftp protocol: sftp will work perfectly as long as he uses it
between similar systems).

Don't get me wrong, I'm not saying that handling this situation right
should be a requirement for sftp (it's similar to the situation
described in my first paragraphs that I argue we need not solve), I'm
just giving an example of a real problem that isn't solved by adopting
utf8 on the wire. utf-8 is a more or less universal character set. It
is not a magic wand.

Regards,
/Niels

References:
- Internationaliztion: UTF-8 for file names
  - From: Joseph Galbraith
- Re: Internationaliztion: UTF-8 for file names
  - From: Niels Möller
- Re: Internationaliztion: UTF-8 for file names
  - From: Joseph Galbraith

Prev by Date: RE: closing a channel
Next by Date: Re: closing a channel
Previous by Thread: Re: Internationaliztion: UTF-8 for file names
Next by Thread: Justification of SFTP
Indexes:

Home | Main Index | Thread Index | Old Index