Re: SFTP and unicode file names...

To: Niels Möller <nisse%lysator.liu.se@localhost>, Joseph Galbraith <galb-list%vandyke.com@localhost>
Subject: Re: SFTP and unicode file names...
From: Jeffrey Hutzelman <jhutz%cmu.edu@localhost>
Date: Fri, 08 Oct 2004 12:05:30 -0400

On Friday, October 08, 2004 13:28:45 +0200 Niels Möller<nisse%lysator.liu.se@localhost> wrote:

Joseph Galbraith <galb-list%vandyke.com@localhost> writes:

Here are some possibilities:

1. Let the server say what it is going to use,
    UTF-8 or 'undefined-raw' at the beginning
    of the sftp session.

    pro: simple.  really simple.
    con: Doesn't address cases where the server
         might be able to do utf-8 for some file
         systems (ext-3 under fedora, I think is
         one example) but not others.


I'd much prefer the simple way. Just let the server say what character
set it uses, and do no conversions at the server end (a server that
really knows what it is doing MAY handle a file system with mixed
character sets by converting them to and from utf-8, and use utf-8
exclusively on the wire, but I don't think that is very practical).
Some possible values are "unknown", "usascii", "iso-8859-1", "utf-8",
and "utf-8-with-normalization-form-c".

So far, so good. Note that the appropriate set of possible values is thatcontained in the IANA Character Set registry(http://www.iana.org/assignments/character-sets). When a character set inthat registry has more than one name or alias, the alias designated as"preferred MIME name" SHOULD be used.


But see below...

(Note that utf-8 with undefined normalization is suboptimal for
filenames and for identifiers in general).

Yes. I'd normally argue that in any case where UTF-8 will be used on thewire, it probably ought to be normalized. However, the key issue here isnot really normalization of names sent by the server but of those sent bythe client, which refer to the names in the filesystem. Unfortunately,there are filesystems that use unnormalized Unicode, so the server would berequired to do path processing itself, normalizing all the names in eachdirectory until it found one matching what the user requested. That'sgoing to get unwieldy real fast, I think.

I think the typical unix filesystem uses usascii for all "system
files", and then has some home directories with iso-8859-x names, some
with utf-8 names, which all include usascii. (I admit that I don't
have any experience with asian unix installations). Then it will work
reasonably to set the advertised charset from the user's $LC_CTYPE.

I would expect this to be more or less the case. In fact, I'd expect thaton most systems, most users will be using the _same_ character set.

There will naturally be some difficulties if different parts of the
filesystem tree uses different character sets, or if there are single
pathnames that use components (directory names) with different
character sets. However, those difficulties are precisely the same
with *local* access to the file system, so it makes absolutely no
sense to try to solve them in the sftp spec.

I agree -- solving that problem is beyond our ability, and I do not thinkit is worth the effort to try.

There is one thing I'd do differently. Whenever possible, I'd prefer thatthe server not just advertise the character set but actually do theconversions. The theory here is that the server is likely to have at itsdisposal the tools needed to convert between Unicode and any othercharacter set in use on the server. The _client_, on the other hand, canonly reasonably be expected to have those tools for character sets in useon the client. When these sets are disjoint, interoperability demands theuse on the wire of a common character set (i.e. Unicode).


Let the server advertise the character set it thinks is in use.
Let the client decide whether it wants UTF-8 or raw bytes.

-- Jeff

Follow-Ups:
- Re: SFTP and unicode file names...
  - From: der Mouse
- Re: SFTP and unicode file names...
  - From: Joseph Galbraith
- Re: SFTP and unicode file names...
  - From: Niels Möller

References:
- SFTP and unicode file names...
  - From: Joseph Galbraith
- Re: SFTP and unicode file names...
  - From: Niels Möller

Prev by Date: Re: SFTP and unicode file names...
Next by Date: Re: Message Numbers and Disconnect Codes (fwd)
Previous by Thread: Re: SFTP and unicode file names...
Next by Thread: Re: SFTP and unicode file names...
Indexes:

Home | Main Index | Thread Index | Old Index