IETF-SSH archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
SFTP and unicode file names...
Okay, I've been forced to face up to the unfortunate truth:
Unix directories can contain files encoded in multiple
different char-sets and the server has no way to tell
what these multiple char-sets are and translate them
to UTF-8. Because the transformation is one way, once
the server has mistranslated the filename, there
is no way for the client to get back to the original
data.
So, for these file-systems, the best possible thing
to do is send the filename raw and let the client
(with help from the user decode it.)
On the other hand, maximum possible interoperability
between different language and regions is obtained
through use of UTF-8 where available.
I haven't been able to come up with a solution I
really like.
Here are some possibilities:
1. Let the server say what it is going to use,
UTF-8 or 'undefined-raw' at the beginning
of the sftp session.
pro: simple. really simple.
con: Doesn't address cases where the server
might be able to do utf-8 for some file
systems (ext-3 under fedora, I think is
one example) but not others.
2. Allow the filename to be prefixed by a flag
byte saying it is raw. For example, the
byte 0xFF is an invalid UTF-8 lead byte.
If the first byte of the filename is 0xFF,
then the 0xFF is discarded, and the rest
of string is the 'raw, undefined' filename
data.
pro: It handles the real life complexity of
being able to tell sometimes, but not
others.
con: It is a little more complex, and a bit
icky.
3. Give the filename structure. Filenames are
always specified in the following structure:
uint32 length of the structure
boolean utf-8
byte filename[length-1]
pro: This also handles being able to give
UTF-8 sometimes, but not all the time.
pro: This isn't icky.
con: This is a bit more complex.
4. Use the high order bit of the length field
to flag raw more. In practice, no file name
will ever be more than 2 gig long :-) We
can safely borrow that bit for other purposes.
pro: This also handles being able to give
UTF-8 sometimes, but not all the time.
con: Slightly icky, a little more complex.
Can anyone think of a better solution?
I think I prefer solution three, I could live
with 2 or 4; I'd really rather not go with 1.
What do others think?
Thanks,
Joseph
Home |
Main Index |
Thread Index |
Old Index