IETF-SSH archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: SFTP and unicode file names...





On Monday, October 11, 2004 22:23:17 +0200 Niels Möller <nisse%lysator.liu.se@localhost> wrote:

Also, the server needs to be able to indicate that it is incapable
of performing Unicode conversion. I'd like to be able to say that
the server MUST be capable of performing the conversion, but I don't
think that's realistic.

The server can be incapable of doing a proper conversion in at least
two ways: Either the server has no idea whatsoever which charset to
use to interpret its local filenames. Or it knows somehow that it
should use a prticular characterset (say, euc-jis), but it hasn't been
compiled with support for that conversion.

I don't think it is possible to forbid the first failure mode; we must
allow the server to say charset "unknown" or "" in this case.

We could forbid the second case, in effect forcing the server to say
"unknown" when it in fact knows the character set, but isn't able to
convert it. But I don't think it is a good idea to do that; it is
better to at least tell the client what the charset is.

Agree.


I can think of only two cases in which the conversion from UTF-8 to
the local character set can fail.

I don't think conversion utf-8 -> local is a big problem. Some
filenames can't be represented, and they must simply be treated as
non-existing files. And invalid utf-8 should be a protocol error
(under no circumstances should an implementation be allowed to say
that it's using utf-8, and then send invalid utf-8 filenames).

Agree.


Telling the client "sorry, here's some strange filename I can't
convert to utf-8, you can try again with raw filenames, if you like"
seems simpler to udnerstand. Note that this can happen even (or
perhaps it's even more likely to happen) when the local character set
is believed to be utf-8, and some file names violate this assumption.

I think it is more likely that the local character set is believed to be UTF-8 and actually is not, than that we will believe the local character set to be something "flat" like iso-8859-1 and then find characters which are invalid in that set.

In any case, I agree the behaviour you describe seems to be the most sane thing to do in this situation.

FWIW, it is worth noting that it is possible to tell with fairly high probability whether a particular string is UTF-8 or some flat 8-bit character set, because valid UTF-8 is fairly well structured and the sequences which represent non-7-bit characters are fairly unlikely to occur in flat character sets. So when conversion to UTF-8 is being done by the server, it is possible for an implementation to apply a heuristic to allow support of filesystems containing both UTF-8 and an 8-bit character set, even in separate components of the same pathname.

-- Jeff



Home | Main Index | Thread Index | Old Index