IETF-SSH archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: SFTP and unicode file names...



Jeffrey Hutzelman <jhutz%cmu.edu@localhost> writes:

> I think it is more likely that the local character set is believed to
> be UTF-8 and actually is not, than that we will believe the local
> character set to be something "flat" like iso-8859-1 and then find
> characters which are invalid in that set.

I agree, with the reservation that I don't have much experience with
character sets that use eight-bit characters, and lots of different
shift states.

> FWIW, it is worth noting that it is possible to tell with fairly high
> probability whether a particular string is UTF-8 or some flat 8-bit
> character set,

That may be correct (at least that's an advertised property of utf-8),
but as far as I can see, that doesn't help much unless you have some
clue about *which* flat 8-bit character set was used. If you know that
a file system contains mixed latin-1 and utf-8 filenames (and no other
character sets), you can probably make the heuristics work most of the
time, but if you have a filesystem with mixed utf-8, iso-8859-1,
iso-8859-5 and koi8, it gets rather difficult.

I don't think heuristic decoding is a good idea. One problem is that
you really need to be able to convert both ways:

Say you have a directory containing filenames in utf-8 and iso-8859-1.
The client asks for a directory listing, and you give it a utf-8
listing, using some working heuristic for converting the latin-1 names
to utf-8 on the fly. Next, the user selects one file using the
client's ui, and the client tries to open it. Now the server has to
figure out if the filename to be opened should be converted to latin-1
or not (in the worst case, there exists files with both the utf-8 name
and the corresponding latin1 name). I think this gets ugly.

The hack I mentioned, to replace the sequences that are invalid utf8
with uniquely chosen private use characters, have a better chance of
surviving the roundtrip to the client, but I'm sure that has other
problems.

Regards,
/Niels



Home | Main Index | Thread Index | Old Index