On Friday, October 08, 2004 13:28:45 +0200 Niels Möller <nisse%lysator.liu.se@localhost> wrote:
Joseph Galbraith <galb-list%vandyke.com@localhost> writes:Here are some possibilities: 1. Let the server say what it is going to use, UTF-8 or 'undefined-raw' at the beginning of the sftp session. pro: simple. really simple. con: Doesn't address cases where the server might be able to do utf-8 for some file systems (ext-3 under fedora, I think is one example) but not others.I'd much prefer the simple way. Just let the server say what character set it uses, and do no conversions at the server end (a server that really knows what it is doing MAY handle a file system with mixed character sets by converting them to and from utf-8, and use utf-8 exclusively on the wire, but I don't think that is very practical). Some possible values are "unknown", "usascii", "iso-8859-1", "utf-8", and "utf-8-with-normalization-form-c".
So far, so good. Note that the appropriate set of possible values is that contained in the IANA Character Set registry (http://www.iana.org/assignments/character-sets). When a character set in that registry has more than one name or alias, the alias designated as "preferred MIME name" SHOULD be used.
But see below...
(Note that utf-8 with undefined normalization is suboptimal for filenames and for identifiers in general).
Yes. I'd normally argue that in any case where UTF-8 will be used on the wire, it probably ought to be normalized. However, the key issue here is not really normalization of names sent by the server but of those sent by the client, which refer to the names in the filesystem. Unfortunately, there are filesystems that use unnormalized Unicode, so the server would be required to do path processing itself, normalizing all the names in each directory until it found one matching what the user requested. That's going to get unwieldy real fast, I think.
I think the typical unix filesystem uses usascii for all "system files", and then has some home directories with iso-8859-x names, some with utf-8 names, which all include usascii. (I admit that I don't have any experience with asian unix installations). Then it will work reasonably to set the advertised charset from the user's $LC_CTYPE.
I would expect this to be more or less the case. In fact, I'd expect that on most systems, most users will be using the _same_ character set.
There will naturally be some difficulties if different parts of the filesystem tree uses different character sets, or if there are single pathnames that use components (directory names) with different character sets. However, those difficulties are precisely the same with *local* access to the file system, so it makes absolutely no sense to try to solve them in the sftp spec.
I agree -- solving that problem is beyond our ability, and I do not think it is worth the effort to try.
There is one thing I'd do differently. Whenever possible, I'd prefer that the server not just advertise the character set but actually do the conversions. The theory here is that the server is likely to have at its disposal the tools needed to convert between Unicode and any other character set in use on the server. The _client_, on the other hand, can only reasonably be expected to have those tools for character sets in use on the client. When these sets are disjoint, interoperability demands the use on the wire of a common character set (i.e. Unicode).
Let the server advertise the character set it thinks is in use. Let the client decide whether it wants UTF-8 or raw bytes. -- Jeff