Re: SFTP and unicode file names...

To: Jeffrey Hutzelman <jhutz%cmu.edu@localhost>
Subject: Re: SFTP and unicode file names...
From: Joseph Galbraith <galb-list%vandyke.com@localhost>
Date: Sat, 09 Oct 2004 16:48:22 -0600

Jeffrey Hutzelman wrote:

On Friday, October 08, 2004 13:28:45 +0200 Niels Möller<nisse%lysator.liu.se@localhost> wrote:
Joseph Galbraith <galb-list%vandyke.com@localhost> writes:
Here are some possibilities:

1. Let the server say what it is going to use,
    UTF-8 or 'undefined-raw' at the beginning
    of the sftp session.

    pro: simple.  really simple.
    con: Doesn't address cases where the server
         might be able to do utf-8 for some file
         systems (ext-3 under fedora, I think is
         one example) but not others.
I'd much prefer the simple way. Just let the server say what character
set it uses, and do no conversions at the server end (a server that
really knows what it is doing MAY handle a file system with mixed
character sets by converting them to and from utf-8, and use utf-8
exclusively on the wire, but I don't think that is very practical).
Some possible values are "unknown", "usascii", "iso-8859-1", "utf-8",
and "utf-8-with-normalization-form-c".
So far, so good. Note that the appropriate set of possible values isthat contained in the IANA Character Set registry(http://www.iana.org/assignments/character-sets). When a character setin that registry has more than one name or alias, the alias designatedas "preferred MIME name" SHOULD be used.
But see below...
(Note that utf-8 with undefined normalization is suboptimal for
filenames and for identifiers in general).
Yes. I'd normally argue that in any case where UTF-8 will be used onthe wire, it probably ought to be normalized. However, the key issuehere is not really normalization of names sent by the server but ofthose sent by the client, which refer to the names in the filesystem.Unfortunately, there are filesystems that use unnormalized Unicode, sothe server would be required to do path processing itself, normalizingall the names in each directory until it found one matching what theuser requested. That's going to get unwieldy real fast, I think.
I think the typical unix filesystem uses usascii for all "system
files", and then has some home directories with iso-8859-x names, some
with utf-8 names, which all include usascii. (I admit that I don't
have any experience with asian unix installations). Then it will work
reasonably to set the advertised charset from the user's $LC_CTYPE.
I would expect this to be more or less the case. In fact, I'd expectthat on most systems, most users will be using the _same_ character set.
There will naturally be some difficulties if different parts of the
filesystem tree uses different character sets, or if there are single
pathnames that use components (directory names) with different
character sets. However, those difficulties are precisely the same
with *local* access to the file system, so it makes absolutely no
sense to try to solve them in the sftp spec.
I agree -- solving that problem is beyond our ability, and I do notthink it is worth the effort to try.
There is one thing I'd do differently. Whenever possible, I'd preferthat the server not just advertise the character set but actually do theconversions. The theory here is that the server is likely to have atits disposal the tools needed to convert between Unicode and any othercharacter set in use on the server. The _client_, on the other hand,can only reasonably be expected to have those tools for character setsin use on the client. When these sets are disjoint, interoperabilitydemands the use on the wire of a common character set (i.e. Unicode).


I would definitely prefer to see the server do the translation
when it can... that's why we went to UTF-8 in the first place.
If the server knows the charset, it should just do the conversion.
For example, Windows systems don't know about EUC-JIS (for Japanese
encoding), they use an alternate encoding call Shift-JIS.  So the server
can easily translate from EUC-JIS to UTF-8, using built-in os support
(if it is using EUC-JIS anyway.)  There has to be custom code written
on the client side though to do the conversion.  That is yucky.  So
it really is better for the server to do the translation.

The problem is if the server doesn't (really) know the charset,
and tries to do the conversion anyway (as would happen, for
example, using the users LC_TYPE).  In this case, the conversion
loses data, and can not be reversed.  This means that the client
not only can't display a meaningful filename, but it can't even
open the file by sending back the garbage name (because the server
can't reverse the conversion.)

So, what I'd like is a way for the server to convert what it
can, and to give the client something that can be used to open
files even if it can't convert them.

So... how about the following--

1. A extension the server sends to advise of the charset
   it thinks is most likely in use (probably from the
   users LC_CTYPE under unix, windows will probably just
   always send UTF-8.)

   By the way, how does one get a charset out of the iana
   registry from the users NLS environment under most
   unixes?  Is it a simple thing to do?

2. An extension the client can use to control the translation
   on the server; something like translation-mask:

    SSH_TRANSLATE_KNOWN         0x00000001
               Translate files with a known charset.  If the
               translation fails, use the rawname instead.

    SSH_TRANSLATE_GUESSED       0x00000002
               Translate files if the server has a reasonable
               idea what the charset is, but doesn't know.

               (For example, using the users LC_CTYPE.)

               If the translation fails (an invalid character
               for example) the server should use the rawname
               instead.

    SSH_TRANSLATE_INCLUDE_KNOWN_RAW    0x0000004
               Include the rawname as an a part of the attrib
               if charset was known, and translation was successful.
               (If translation was unsuccessful, the rawname is
               sent as the name.)

               (New attribute field SSH_FILEXFER_ATTR_RAW_NAME.)

    SSH_TRANSLATE_INCLUDE_GUESSED_RAW    0x0000004
               Include the rawname as an a part of the attrib
               if charset was guessed, and translation was successful.

    The initial mode is SSH_TRANSLATE_KNOWN|SSH_TRANSLATE_GUESSED.

3. Allow the server and the client to prepend a 0xFF to any
   filename to indicate it should be passed down to the system
   APIs without further processing.

The thing this does is allow most clients to continue
to operate as they do under sftp v4 & 5, except that
if the translation fails, behavior is well defined
and there is a way to open the file, even if the
filename can not be displayed correctly.

If a user runs into trouble, they can set INCLUDE_GUESSED_RAW,
or turn off TRANSLATE_GUESSED.

Okay... after all that, I've just realized that the per-file
tag for rawmode is more complicated than I thought-- it has
to be per path component.  Rats.

I really wanted to be able to rely on the server as much
as possible and only punt the exceptions to the client
for processing... and the client wouldn't really process
them, because it would know it didn't know the charset
(cause if it was the one the server thought it was,
it would have worked on the server side.)  It would just
use them as opaque handles to at the file contents--
or the user would have to manually intervene to display
the filename.

Argh... well, I'm going to go off an think now.

- Joseph

Follow-Ups:
- Re: SFTP and unicode file names...
  - From: Niels Möller

References:
- SFTP and unicode file names...
  - From: Joseph Galbraith
- Re: SFTP and unicode file names...
  - From: Niels Möller
- Re: SFTP and unicode file names...
  - From: Jeffrey Hutzelman

Prev by Date: Re: SFTP and unicode file names...
Next by Date: Re: Text file type hint proposal for filexfer
Previous by Thread: Re: SFTP and unicode file names...
Next by Thread: Re: SFTP and unicode file names...
Indexes:

Home | Main Index | Thread Index | Old Index