Re: UTF8

To: ietf-ssh%NetBSD.org@localhost
Subject: Re: UTF8
From: Jeffrey Altman <jaltman%columbia.edu@localhost>
Date: Tue, 04 Jan 2005 02:41:26 -0500

der Mouse wrote:

As I see it, this amounts to "the IETF position is that humans think of
these things as character strings, so we demand that they be handled as
character strings by the protocol".


Absolutely not.  The IETF position is that if I am attempting to login
to machine H via SSH, I should be able to do so by knowing the necessary
bits: username, password, etc.

The requirement is that no matter what user interface I use to enter

these bits, I should be able to successfully authenticate. Now if Ihappen to be in front of a keyboard based interface which is Unicode

aware and happens to generate "SMALL LETTER u WITH DIAERESIS" as two

code points represented as two 32-bit values or 8 octets instead of thenon-Unicode aware system which uses a single code point represented as

a single byte, I have a problem.  I type exactly the same thing on
both keyboards and get extremely different octet strings.

Are you telling me that once I configure a login to work from one
particular platform and user interface configuration that I should
be locked into that choice exclusive of all the other system types
and user input methods which are available?

I would find it hard to believe that anyone could decide that this is
desireable.

What is the IETF position, then, on how someone such as me should
handle the situation I'm faced with: writing software specified from
this point of view (ssh, in my case) for systems on which these
entities are _not_ character strings (a fairly traditional Unix
variant, NetBSD in my case)?  I'm faced with an encoding-agnostic
filesystem interface and implementation, wherein filename components
are sequences of octets not including 0x00 and 0x2f, independent of any
characters; I'm faced with password hashing routines that work with
octet strings, not character strings; etc.

As an AFS developer I am very sympathetic to the situation.Unfortunately. there are no true raw octet strings. Octet sequences are

created within a context and without knowing the context it is not
possible to properly manipulate the octets.  At the present time AFS
does not support a notion of storing character-set context information.
This causes severe problems for users who want to access the names
associated with directories and files from heterogeneous systems.
File names created from most Unix user interfaces in Western Europe

will produce strings using Latin-1 code points. Those from EasternEurope will use Latin-2. Linux systems may store unnormalized UTF-8.

Windows systems will store one of the many IBM/MS DOS OEM code pages.
A name created on one system not only will be displayed to users of

another system something which is incorrect but the name may besomething which is completely unparsable.


At the moment the only safe set of strings that can be used are those
restricted to US-ASCII.  This is because US-ASCII is the only common
set of values which will be properly interpretted without additional
context information which is not available.

In the the long run we are going to need to fix AFS to do one of two
things:

(1) store context information associating the character set used to
    create each name AND provide the means necessary for file servers
    to be able to translate names from one character set to all the
    other possible sets.

(2) provide support for a normalized character set which is inclusive
    of all characters which users may be able to enter.

Having worked on the character set translation capabilities of C-Kermit

I can tell you that storing context information and providingtranslation is lossy and imperfect. UNICODE solves the problem in a

much nicer and heterogeneous manner.  It is by no means perfect but
biting the bullet and supporting it makes the end user experience oh
so much nicer.

In the coming year I will be adding UNICODE support to AFS.  I expect
that all file systems will have to provide support for it in the years
to come.   Operating systems which do not provide support for character
set processing will find a smaller and smaller percentage of users.

Are such systems beyond the pale for the IETF, and I can do anything I
want, with a suggestion that I try to stay within something like the
spirit of the spec?  Is it simply not possible to implement ssh (or
anything else specified with similar normalization rules) on such a
system within the spec without converting all the affected code
(filename, username, and password handling in ssh's case) to the
character-string paradigm?  Am I required to reject attempted non-ASCII
strings in these places for no reason other than an inability to know
what the user intended the character set - if any - to be?  (For that
matter, what grounds are there for assuming that octets in the ASCII
range are intended to correspond to ASCII characters, rather than, say,
KOI-7?)


You have to draw the line somewhere if you are going to make progress
at improving cross platform user experience.  Systems without support
for character-set processing are useful only when all of the systems
they share information with are used in exactly the same context.
In a distributed heterogeneous environment such as the Internet,
this assumption cannot be made.

If a system wants to assume that all of its local input is in KOI-7
and an SSH implementation wants to be able to support that, then the
implementation must provide for character set translation from KOI-7
to UNICODE.  If you need such translation tables, they are available
from the UNICODE consortium and are implemented within a wide number
of open source packages.

Or what?

Given how common such systems are, it seems a bit odd that the IETF
would take a position so apparently incompatible with them.  As an
implementer I find the situation rather confusing; there's obviously
something I don't understand going on, and I'd like to know what the
IETF's idea of the right thing for me to do here is.


You do what Kermit has done since 1981.  When moving information between
systems you convert from the local character set to a network neutral

form and then the receiver converts its local form. Before the adventof UNICODE Kermit was forced rely on the user to choose an intermediary

character-set which would be inclusive of all characters used and be
understood by both systems.  When this was not possible, substitution
rules and best guesses forced the data stream to become lossy.

With the availability of UNICODE the available set of characters which
can be sent without loss has been greatly enhanced.  Normalization rules
are used to prevent multiple representations of a common input form
from preventing interoperability.  While this has a negative impact on
the ability to display strings to the end user after use; it enhances
the ability to provide for cross platform comparison and computation.

Jeffrey Altman

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: UTF8
  - From: der Mouse

References:
- latest drafts
  - From: der Mouse
- UTF8
  - From: Sam Hartman
- Re: UTF8
  - From: der Mouse
- Re: UTF8
  - From: Sam Hartman
- Re: UTF8
  - From: der Mouse

Prev by Date: Re: UTF8
Next by Date: Re: UTF8
Previous by Thread: Re: UTF8
Next by Thread: Re: UTF8
Indexes:

Home | Main Index | Thread Index | Old Index