IETF-SSH archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: SFTP and unicode file names...





On Monday, October 11, 2004 16:54:06 +0200 Niels Möller <nisse%lysator.liu.se@localhost> wrote:

Joseph Galbraith <galb-list%vandyke.com@localhost> writes:

I would definitely prefer to see the server do the translation
when it can... that's why we went to UTF-8 in the first place.

I think there's one more use case that you need to consider, which I
expect is quite common:

The remote filesystem using the foo charset. The local system using
the same foo charset. Why do I think this is common? Because on both
sides, it's the same user's files, and the user is likely to use his
or her favourite charset (iso-8859-1, utf-8, euc-jis, whatever) on
most or all systems where he or she has an account.

What you call "raw mode" will work fine in this case, no matter if the
sftp implementation on server or client side knows about the foo
charset.

I like Jeffrey Hutzelman's proposal: Have two modes of operation, and
let the client select which mode it prefers,

 1. Server tells client the server's best guess as to what character
    set is used for filenames, and doesn't convert filenames in any
    way.

 2. All filenames on the wire are utf-8. Server converts filenames to
    and from utf-8 on a best effort basis, according to it's best
    guess of the actual charset. (What's the right thing to do if/when
    conversion fails, I don't know yet).

IMHO it is important that the server identify its character set up front, so that the client is able to use that information in deciding which mode to use. Also, the server needs to be able to indicate that it is incapable of performing Unicode conversion. I'd like to be able to say that the server MUST be capable of performing the conversion, but I don't think that's realistic.

If the server is capable of doing conversion, then conversion from the local character set to unicode should not be able to fail. If it does, something is very wrong.

I can think of only two cases in which the conversion from UTF-8 to the local character set can fail. The first is when the input is somehow invalid (bad UTF-8, contains illegal Unicode code points, etc). We could handle this in a number of ways, up to and including terminating the connection. :-)

The second case is when the input is valid, but contains characters not present in the local character set. The server's mapping should be good enough to handle cases where the same local character can be represented in multiple ways in unicode (for example, there are at least two ways to write Å in unicode, but only one way in iso-8859-1). For characters which are genuinely not in the local character set, the server can return an appropriate error depending on context. For example, trying to open a file whose name contains an untranslateable character will always fail with something ENOENT-like. Trying to create such a file should fail with an illegal filename.



-- Jeff



Home | Main Index | Thread Index | Old Index