Re: SFTP and unicode file names...

To: Niels Möller <nisse%lysator.liu.se@localhost>
Subject: Re: SFTP and unicode file names...
From: Jeffrey Hutzelman <jhutz%cmu.edu@localhost>
Date: Mon, 11 Oct 2004 17:01:55 -0400

On Monday, October 11, 2004 22:23:17 +0200 Niels Möller<nisse%lysator.liu.se@localhost> wrote:

Also, the server needs to be able to indicate that it is incapable
of performing Unicode conversion. I'd like to be able to say that
the server MUST be capable of performing the conversion, but I don't
think that's realistic.


The server can be incapable of doing a proper conversion in at least
two ways: Either the server has no idea whatsoever which charset to
use to interpret its local filenames. Or it knows somehow that it
should use a prticular characterset (say, euc-jis), but it hasn't been
compiled with support for that conversion.

I don't think it is possible to forbid the first failure mode; we must
allow the server to say charset "unknown" or "" in this case.

We could forbid the second case, in effect forcing the server to say
"unknown" when it in fact knows the character set, but isn't able to
convert it. But I don't think it is a good idea to do that; it is
better to at least tell the client what the charset is.


Agree.

I can think of only two cases in which the conversion from UTF-8 to
the local character set can fail.


I don't think conversion utf-8 -> local is a big problem. Some
filenames can't be represented, and they must simply be treated as
non-existing files. And invalid utf-8 should be a protocol error
(under no circumstances should an implementation be allowed to say
that it's using utf-8, and then send invalid utf-8 filenames).


Agree.

Telling the client "sorry, here's some strange filename I can't
convert to utf-8, you can try again with raw filenames, if you like"
seems simpler to udnerstand. Note that this can happen even (or
perhaps it's even more likely to happen) when the local character set
is believed to be utf-8, and some file names violate this assumption.

I think it is more likely that the local character set is believed to beUTF-8 and actually is not, than that we will believe the local characterset to be something "flat" like iso-8859-1 and then find characters whichare invalid in that set.

In any case, I agree the behaviour you describe seems to be the most sanething to do in this situation.

FWIW, it is worth noting that it is possible to tell with fairly highprobability whether a particular string is UTF-8 or some flat 8-bitcharacter set, because valid UTF-8 is fairly well structured and thesequences which represent non-7-bit characters are fairly unlikely to occurin flat character sets. So when conversion to UTF-8 is being done by theserver, it is possible for an implementation to apply a heuristic to allowsupport of filesystems containing both UTF-8 and an 8-bit character set,even in separate components of the same pathname.


-- Jeff

Follow-Ups:
- Re: SFTP and unicode file names...
  - From: Niels Möller
- Re: SFTP and unicode file names...
  - From: roumen

References:
- SFTP and unicode file names...
  - From: Joseph Galbraith
- Re: SFTP and unicode file names...
  - From: Niels Möller
- Re: SFTP and unicode file names...
  - From: Jeffrey Hutzelman
- Re: SFTP and unicode file names...
  - From: Joseph Galbraith
- Re: SFTP and unicode file names...
  - From: Niels Möller
- Re: SFTP and unicode file names...
  - From: Jeffrey Hutzelman
- Re: SFTP and unicode file names...
  - From: Niels Möller

Prev by Date: Re: Message Numbers and Disconnect Codes (fwd)
Next by Date: RE: Text file type hint proposal for filexfer
Previous by Thread: Re: SFTP and unicode file names...
Next by Thread: Re: SFTP and unicode file names...
Indexes:

Home | Main Index | Thread Index | Old Index