IETF-SSH archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: SFTP and unicode file names...



> Okay, I've been forced to face up to the unfortunate truth:
>
> Unix directories can contain files encoded in multiple
> different char-sets and the server has no way to tell
> what these multiple char-sets are and translate them
> to UTF-8.  Because the transformation is one way, once
> the server has mistranslated the filename, there
> is no way for the client to get back to the original
> data.
>
> So, for these file-systems, the best possible thing
> to do is send the filename raw and let the client
> (with help from the user decode it.)
>
> On the other hand, maximum possible interoperability
> between different language and regions is obtained
> through use of UTF-8 where available.

User can use cat/type/ls/dir <FILENAME> to access a local file and it
should be able to do same with SFTP.
As example "echo 'get <FILENAME>' | sftp localhost" should get same file.
In all cases <FILENAME> should be same and is encoded in user
charmap(charset/codeset/etc.).

>
> I haven't been able to come up with a solution I
> really like.
>
> Here are some possibilities:
>
> 1. Let the server say what it is going to use,
>     UTF-8 or 'undefined-raw' at the beginning
>     of the sftp session.
> [SNIP]

I'm not sure that server is responsible to do decision for encoding:
'UTF-8' or 'RAW'.
Since in a SFTP session client can request more that one file negotiation
of file name encoding should be at begining of session.
Client should request encoding from server.


I guest that a new extension "encoding" will solve problem:
1.) Client send to server list of accepted encodings and server return
prefered one or "RAW" or "UTF-8".

To do this sftp implementation MUST implement extension "encoding".
Extension should be defined in draft as "newline" is defined.

I not sure that sftp can use names like "ascii", "usascii", "C", "POSIX",
"ANSI_X3.4....", since ascii define only 7 bit charset. When SFTP server
support 7-bit encoding is should(must?) reject file names containing
symbols with code greater that 127.

When encoding is not set server should treat filenames in "raw" or "utf-8"
format.
This must be annonced in "Server Initialization".
Empty "encoding" is alias to "RAW".
When encoding is set server must convert "local filename in encoding" <->
"wire filename in utf-8".
Client may convert "wire filename in utf-8" <-> "local filename in
encoding". Note that client know name conversion on the server.

I guess that this solution is interoperable with SFTP clients version 1, 2
or 3.
For version 4(four) clients, when server support encodung it should
announce 3(three) as maximum version.
For client version 1,2,3 server must use "RAW".
Server version N(N>=5) must support "UTF-8", "RAW" and "ISO8859-1" encodings.
Server version N(N>=5) may support "ISO8859-N" encodings, where N is in
range 2-15.
Client version N(N>=5) must support the "UTF-8" and "RAW" "encoding"
extension.


P.S.:
As esample cyrillic use many encodings. Most popular are IS08859-5,
KOI8-R, CP1251.
In case of cyrillic one utf-8 file name in cyrillic can address different
files on file system and this depend of encoding.
In this case SFTP client with help from the user is responsible to select
correct encoding.
This is same case as access to local file system.

I don't have problem to adress correct file name on remote host.
My system is properly setup and I can use UTF-8, IS08859-5, KOI8-R and
CP1251 in file/directory names and in GUI termininals.
For the test in directory $HOME/tmp/cyr I have four files with name in
format f.<ENCODING>.<NAME>,
where <ENCODING> is one of mentioned above and
<NAME> is first three leter from cyrillic alphabet in uppercase followed
by same leters in lovercase in same encoding.
Content of each file is "data.<ENCODING>:<NAME>" where <ENCODING> and
<NAME> match the file name.
In four xterm for every encoding I run same command sequence.
Results are attached images in "ssh_session.UTF-8.png",
"ssh_session.ISO8859-5.png", "ssh_session.KOI8-R.png"
and "ssh_session.CP1251.png".
With command "echo get tmp/cyr/f.*.<NAME> | sftp localhost", where sftp is
openssh SFTP version 3 client, file that I get depend of my locale
charmap.


Regards,
Roumen

Attachment: ssh_session.UTF-8.png
Description: PNG image

Attachment: ssh_session.ISO8859-5.png
Description: PNG image

Attachment: ssh_session.KOI8-R.png
Description: PNG image

Attachment: ssh_session.CP1251.png
Description: PNG image



Home | Main Index | Thread Index | Old Index