der Mouse wrote:
As I see it, this amounts to "the IETF position is that humans think of these things as character strings, so we demand that they be handled as character strings by the protocol".
Absolutely not. The IETF position is that if I am attempting to login to machine H via SSH, I should be able to do so by knowing the necessary bits: username, password, etc. The requirement is that no matter what user interface I use to enterthese bits, I should be able to successfully authenticate. Now if I happen to be in front of a keyboard based interface which is Unicode
aware and happens to generate "SMALL LETTER u WITH DIAERESIS" as twocode points represented as two 32-bit values or 8 octets instead of the non-Unicode aware system which uses a single code point represented as
a single byte, I have a problem. I type exactly the same thing on both keyboards and get extremely different octet strings. Are you telling me that once I configure a login to work from one particular platform and user interface configuration that I should be locked into that choice exclusive of all the other system types and user input methods which are available? I would find it hard to believe that anyone could decide that this is desireable.
What is the IETF position, then, on how someone such as me should handle the situation I'm faced with: writing software specified from this point of view (ssh, in my case) for systems on which these entities are _not_ character strings (a fairly traditional Unix variant, NetBSD in my case)? I'm faced with an encoding-agnostic filesystem interface and implementation, wherein filename components are sequences of octets not including 0x00 and 0x2f, independent of any characters; I'm faced with password hashing routines that work with octet strings, not character strings; etc.
As an AFS developer I am very sympathetic to the situation. Unfortunately. there are no true raw octet strings. Octet sequences are
created within a context and without knowing the context it is not possible to properly manipulate the octets. At the present time AFS does not support a notion of storing character-set context information. This causes severe problems for users who want to access the names associated with directories and files from heterogeneous systems. File names created from most Unix user interfaces in Western Europewill produce strings using Latin-1 code points. Those from Eastern Europe will use Latin-2. Linux systems may store unnormalized UTF-8.
Windows systems will store one of the many IBM/MS DOS OEM code pages. A name created on one system not only will be displayed to users ofanother system something which is incorrect but the name may be something which is completely unparsable.
At the moment the only safe set of strings that can be used are those restricted to US-ASCII. This is because US-ASCII is the only common set of values which will be properly interpretted without additional context information which is not available. In the the long run we are going to need to fix AFS to do one of two things: (1) store context information associating the character set used to create each name AND provide the means necessary for file servers to be able to translate names from one character set to all the other possible sets. (2) provide support for a normalized character set which is inclusive of all characters which users may be able to enter. Having worked on the character set translation capabilities of C-KermitI can tell you that storing context information and providing translation is lossy and imperfect. UNICODE solves the problem in a
much nicer and heterogeneous manner. It is by no means perfect but biting the bullet and supporting it makes the end user experience oh so much nicer. In the coming year I will be adding UNICODE support to AFS. I expect that all file systems will have to provide support for it in the years to come. Operating systems which do not provide support for character set processing will find a smaller and smaller percentage of users.
Are such systems beyond the pale for the IETF, and I can do anything I want, with a suggestion that I try to stay within something like the spirit of the spec? Is it simply not possible to implement ssh (or anything else specified with similar normalization rules) on such a system within the spec without converting all the affected code (filename, username, and password handling in ssh's case) to the character-string paradigm? Am I required to reject attempted non-ASCII strings in these places for no reason other than an inability to know what the user intended the character set - if any - to be? (For that matter, what grounds are there for assuming that octets in the ASCII range are intended to correspond to ASCII characters, rather than, say, KOI-7?)
You have to draw the line somewhere if you are going to make progress at improving cross platform user experience. Systems without support for character-set processing are useful only when all of the systems they share information with are used in exactly the same context. In a distributed heterogeneous environment such as the Internet, this assumption cannot be made. If a system wants to assume that all of its local input is in KOI-7 and an SSH implementation wants to be able to support that, then the implementation must provide for character set translation from KOI-7 to UNICODE. If you need such translation tables, they are available from the UNICODE consortium and are implemented within a wide number of open source packages.
Or what? Given how common such systems are, it seems a bit odd that the IETF would take a position so apparently incompatible with them. As an implementer I find the situation rather confusing; there's obviously something I don't understand going on, and I'd like to know what the IETF's idea of the right thing for me to do here is.
You do what Kermit has done since 1981. When moving information between systems you convert from the local character set to a network neutralform and then the receiver converts its local form. Before the advent of UNICODE Kermit was forced rely on the user to choose an intermediary
character-set which would be inclusive of all characters used and be understood by both systems. When this was not possible, substitution rules and best guesses forced the data stream to become lossy. With the availability of UNICODE the available set of characters which can be sent without loss has been greatly enhanced. Normalization rules are used to prevent multiple representations of a common input form from preventing interoperability. While this has a negative impact on the ability to display strings to the end user after use; it enhances the ability to provide for cross platform comparison and computation. Jeffrey Altman
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature