IETF-SSH archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

SFTP and unicode file names...



Okay, I've been forced to face up to the unfortunate truth:

Unix directories can contain files encoded in multiple
different char-sets and the server has no way to tell
what these multiple char-sets are and translate them
to UTF-8.  Because the transformation is one way, once
the server has mistranslated the filename, there
is no way for the client to get back to the original
data.

So, for these file-systems, the best possible thing
to do is send the filename raw and let the client
(with help from the user decode it.)

On the other hand, maximum possible interoperability
between different language and regions is obtained
through use of UTF-8 where available.

I haven't been able to come up with a solution I
really like.

Here are some possibilities:

1. Let the server say what it is going to use,
   UTF-8 or 'undefined-raw' at the beginning
   of the sftp session.

   pro: simple.  really simple.
   con: Doesn't address cases where the server
        might be able to do utf-8 for some file
        systems (ext-3 under fedora, I think is
        one example) but not others.

2. Allow the filename to be prefixed by a flag
   byte saying it is raw.  For example, the
   byte 0xFF is an invalid UTF-8 lead byte.
   If the first byte of the filename is 0xFF,
   then the 0xFF is discarded, and the rest
   of string is the 'raw, undefined' filename
   data.

   pro: It handles the real life complexity of
        being able to tell sometimes, but not
        others.
   con: It is a little more complex, and a bit
        icky.

3. Give the filename structure.  Filenames are
   always specified in the following structure:

   uint32 length of the structure
   boolean utf-8
   byte   filename[length-1]

   pro: This also handles being able to give
        UTF-8 sometimes, but not all the time.
   pro: This isn't icky.
   con: This is a bit more complex.

4. Use the high order bit of the length field
   to flag raw more.  In practice, no file name
   will ever be more than 2 gig long :-)  We
   can safely borrow that bit for other purposes.

   pro: This also handles being able to give
        UTF-8 sometimes, but not all the time.
   con: Slightly icky, a little more complex.

Can anyone think of a better solution?

I think I prefer solution three, I could live
with 2 or 4; I'd really rather not go with 1.

What do others think?

Thanks,

Joseph



Home | Main Index | Thread Index | Old Index