Re: UTF8

To: ietf-ssh%NetBSD.org@localhost
Subject: Re: UTF8
From: der Mouse <mouse%Rodents.Montreal.QC.CA@localhost>
Date: Tue, 4 Jan 2005 11:14:07 -0500 (EST)

>> As I see it, this amounts to "the IETF position is that humans think
>> of these things as character strings, so we demand that they be
>> handled as character strings by the protocol".
> Absolutely not.  The IETF position is that if I am attempting to
> login to machine H via SSH, I should be able to do so by knowing the
> necessary bits: username, password, etc.

But which is (say) the username?  The character string g e-acute r a r
d, or the octet string 0x67 0xe9 0x72 0x61 0x72 0x64?  A human is more
likely to think of it as the former; the reality to the computer is
more likely to be the latter.  (At least assuming an encoding-agnostic
user database such is at issue here.)  So does "entering the username"
mean typing g e-acute r a r d (for any of the various ways of typing
those characters), or does it mean typing whatever is necessary to
generate 0x67 0xe9 0x72 0x61 0x72 0x64?  (Note that either or both may
be impossible to do under reasonably plausible circumstances.)

The stated IETF position on interoperability makes no sense unless it's
based on the former of those two positions, which is why I phrased my
gloss on it the way I did.

> Are you telling me that once I configure a login to work from one
> particular platform and user interface configuration that I should be
> locked into that choice exclusive of all the other system types and
> user input methods which are available?

No; even if you go with the octet-string model, you are locked in only
to system types and input methods that permit you to generate that
octet string.

Very much the way, in fact, that the character-string model locks you
into the ability to generate the desired character string.

It's just a question of which lock you prefer to be in.

> In the the long run we are going to need to fix AFS to do one of two
> things:

> (1) store context information associating the character set [...]
> (2) provide support for a normalized character set [...]

Only if AFS is (or becomes) philosophically committed to considering
file names to be character strings.  (While this may not be a wrong
choice, it is still a choice, and you seem to be arguing from a
position that is unaware of that.)

Character strings make a lot of sense from some points of view, yes -
and that's true not only of filenames but of other things, such as
usernames.  Character strings are a better match to the way most people
think of them, if nothing else.  But they bring a whole passel of
problems with them, some of which we're discussing here.

The biggest problem is perhaps the one that got me writing to the list
about this: a large body of existing code that takes the octet-string
point of view and what the best way is to impedance-match it to a spec
that takes the character-string point of view.

> You have to draw the line somewhere if you are going to make progress
> at improving cross platform user experience.

I guess what I don't quite see is how rendering ssh unimplementable (or
implementable only crippledly, such as by restricting everything to
ASCII) on traditional Unix systems is going to improve anything.
Honestly, what I expect it to do is to create two imcompatible dialects
of ssh, one taking the character-string point of view and the other
taking the octet-string point of view, with humans rqeuired to deal
with the mismatch whenever they meet.  (There may be a third dialect
that imposes willy-nilly some guessed character set on the octet-string
environment....)

> Systems without support for character-set processing are useful only
> when all of the systems they share information with are used in
> exactly the same context.

I think that's too strong.  Rather, I would say, they allow mismatches
to show through in some form, usually in the form of text in one
character set being displayed in another and coming through as
nonsense.  This is not to say that they're _not_ useful in the face of
such things, just _less_ useful, or at least less transparently useful.

The corresponding upside, of course, is a simpler implementation and
more flexibility.

>> [...] I'd like to know what the IETF's idea of the right thing for
>> me to do here is.
> You do what Kermit has done since 1981.  When moving information
> between systems you convert from the local character set to a network
> neutral form

But this step cannot be done when I'm sending, because all I have is an
octet string.  I don't know what character set it's in; strictly
speaking, I don't even know whether it _is_ in a character set, though
for usernames and passwords it is extremely likely that it is, at least
in someone's mind (and for filenames it's reasonably likely).

> and then the receiver converts its local form.

And this is equally impossible, for similar reasons.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse%rodents.montreal.qc.ca@localhost
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B

References:
- latest drafts
  - From: der Mouse
- UTF8
  - From: Sam Hartman
- Re: UTF8
  - From: der Mouse
- Re: UTF8
  - From: Sam Hartman
- Re: UTF8
  - From: der Mouse
- Re: UTF8
  - From: Jeffrey Altman

Prev by Date: Re: UTF8
Next by Date: Re: UTF8
Previous by Thread: Re: UTF8
Next by Thread: Re: UTF8
Indexes:

Home | Main Index | Thread Index | Old Index