IETF-SSH archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: SFTP and unicode file names...



Jeffrey Hutzelman <jhutz%cmu.edu@localhost> writes:

> On Monday, October 11, 2004 16:54:06 +0200 Niels Möller
> <nisse%lysator.liu.se@localhost> wrote:

> > I like Jeffrey Hutzelman's proposal: Have two modes of operation, and
> > let the client select which mode it prefers,
> >
> >  1. Server tells client the server's best guess as to what character
> >     set is used for filenames, and doesn't convert filenames in any
> >     way.
> >
> >  2. All filenames on the wire are utf-8. Server converts filenames to
> >     and from utf-8 on a best effort basis, according to it's best
> >     guess of the actual charset. (What's the right thing to do if/when
> >     conversion fails, I don't know yet).
> 
> IMHO it is important that the server identify its character set up
> front, so that the client is able to use that information in deciding
> which mode to use.

Agree.

> Also, the server needs to be able to indicate that it is incapable
> of performing Unicode conversion. I'd like to be able to say that
> the server MUST be capable of performing the conversion, but I don't
> think that's realistic.

The server can be incapable of doing a proper conversion in at least
two ways: Either the server has no idea whatsoever which charset to
use to interpret its local filenames. Or it knows somehow that it
should use a prticular characterset (say, euc-jis), but it hasn't been
compiled with support for that conversion.

I don't think it is possible to forbid the first failure mode; we must
allow the server to say charset "unknown" or "" in this case.

We could forbid the second case, in effect forcing the server to say
"unknown" when it in fact knows the character set, but isn't able to
convert it. But I don't think it is a good idea to do that; it is
better to at least tell the client what the charset is.

> I can think of only two cases in which the conversion from UTF-8 to
> the local character set can fail.

I don't think conversion utf-8 -> local is a big problem. Some
filenames can't be represented, and they must simply be treated as
non-existing files. And invalid utf-8 should be a protocol error
(under no circumstances should an implementation be allowed to say
that it's using utf-8, and then send invalid utf-8 filenames).

Conversion the other way is more difficult. The problem is, when the
server thinks it should convert from A to utf-8 (based, for example,
on examining the user's $LC_CTYPE), but in fact some file names are
stored using a different charset B, where B sequences are invalid when
interpreted as A-characters. Only example I can come up with off the
top of my head is A = utf-8 and B = latin1, but I'm pretty sure there
are some other examples where neither A nor B is utf-8.

I could think of some ways to deal with this: (i) pretend the files
don't exist, or (ii) replace unexpected characters or character
sequences with uniquely chosen private use characters. I think (i)
will be a very confusing failure mode for users, and (ii) seems quite
ugly.

Telling the client "sorry, here's some strange filename I can't
convert to utf-8, you can try again with raw filenames, if you like"
seems simpler to udnerstand. Note that this can happen even (or
perhaps it's even more likely to happen) when the local character set
is believed to be utf-8, and some file names violate this assumption.

Regards,
/Niels



Home | Main Index | Thread Index | Old Index