IETF-SSH archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: SFTP and unicode file names...



Joseph Galbraith <galb-list%vandyke.com@localhost> writes:
> Unix directories can contain files encoded in multiple different
> char-sets and the server has no way to tell what these multiple
> char-sets are and translate them to UTF-8.  Because the transformation
> is one way, once the server has mistranslated the filename, there is
> no way for the client to get back to the original data.
> 
> So, for these file-systems, the best possible thing to do is send the
> filename raw and let the client (with help from the user decode it.)

I support this. While it would be nice for all filenames to be in UTF-8
or reliably convertible to/from it, it is beyond the power of this
working group to cause this to happen, so this requirement simply won't
be implemented.

(We've pinched stuff from NFS before. Do they have any views on this
issue?)

> On the other hand, maximum possible interoperability between different
> language and regions is obtained through use of UTF-8 where available.
> 
> I haven't been able to come up with a solution I really like.
> 
> Here are some possibilities:
> 
> 1. Let the server say what it is going to use, UTF-8 or
>     'undefined-raw' at the beginning of the sftp session.
> 
>     pro: simple.  really simple.
>     con: Doesn't address cases where the server might be able to do
>          utf-8 for some file systems (ext-3 under fedora, I think is
>          one example) but not others.

It's true that this is rather limiting. But if your directory tree has
heterogenous UTF-8-ness, don't you run into problems with having some
path components "raw" and others UTF-8? Given our approach of paths
being rather opaque things, I can't think of a way round this.

Therefore, I suspect that something like this simple approach may have
to be taken...

> 2. Allow the filename to be prefixed by a flag byte saying it is raw.
>     For example, the byte 0xFF is an invalid UTF-8 lead byte. If the
>     first byte of the filename is 0xFF, then the 0xFF is discarded,
>     and the rest of string is the 'raw, undefined' filename data.
> 
>     pro: It handles the real life complexity of being able to tell
>          sometimes, but not others.
>     con: It is a little more complex, and a bit icky.
> 
> 3. Give the filename structure.  Filenames are always specified in the
>     following structure:
> 
>     uint32 length of the structure
>     boolean utf-8
>     byte   filename[length-1]
> 
>     pro: This also handles being able to give UTF-8 sometimes, but not
>          all the time.
>     pro: This isn't icky.
>     con: This is a bit more complex.

In particular, backwards-compatible implementations now need two sets of
filename-handling code.

> 4. Use the high order bit of the length field to flag raw more.  In
>     practice, no file name will ever be more than 2 gig long :-)  We
>     can safely borrow that bit for other purposes.
> 
>     pro: This also handles being able to give UTF-8 sometimes, but not
>          all the time.
>     con: Slightly icky, a little more complex.
> 
> Can anyone think of a better solution?
> 
> I think I prefer solution three, I could live with 2 or 4; I'd really
> rather not go with 1.

I think I slightly prefer 2 to 4, if the issue I mention above can be
got round.



Home | Main Index | Thread Index | Old Index