IETF-SSH archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Unimplementability [was Re: adding IUTF8 to encoded terminal modes in SSH Protocol Assigned Numbers]



>> The basic problem is that various strings are specified as being
>> UTF-8, but things in question are things that the systems in
>> question don't store as character strings, but rather as octet
>> strings.
> Are you thinking only about usernames and passwords, or are there
> other strings on the wire where this is a problem?

Usernames and passwords are the only things that came to mind
immediately.  On a quick look at the RFCs, I also see:

- SSH_MSG_USERAUTH_BANNER text
- SSH_MSG_USERAUTH_PASSWD_CHANGEREQ prompt
- SSH_MSG_DISCONNECT reason
- SSH_MSG_DEBUG text
- SSH_MSG_CHANNEL_OPEN_FAILURE description
- SSH_MSG_CHANNEL_REQUEST exit-signal error message

Some of these (eg, DISCONNECT reasons) are not much of an issue.
Others (eg, USERAUTH_BANNER text) are not so simple.

>> This means that either an ssh implementation has to have some
>> configuration switch telling it what the string encoding is in use
>> or it will conform only if the local admins stick to UTF-8 for those
>> things.
> I think this is the right way, at least in theory.

Which is?  I gave two alternatives; your "this is the right way" is
rather ambiguous.

> If you have non-ascii usernames and passwords on your system, and
> want to accept logins from remote systems, you can't expect any
> interoperability if you don't even know what character set you are
> using.  You'd have to either tell the other end what you're using, or
> use some specific encoding on the wire and convert locally.

There actually is a third option, which, as theroetically undesirable
as it is, has the advantage that it actually works in practice: push
that up to the user.  If my username - or password - is (hex) 4d f8 b5
a7 eb, then if I'm at a device which uses 8859-1, I need to type <M>
<o-slash> <micro> <section> <e-diaeresis>; if I'm at a device which
uses 8859-7, I need to type <M> <psi> <dialytika tonos> <section>
<lambda>; if I'm using CP850, I don't know what the last four octets
would be as characters, but I don't think it would be eithe of those.
Whichever way, the correct five octets get transmitted, and that's what
the system needs.  After all, the username stored in /etc/passwd is an
octet string, not a character string (there's no encoding attached to
it), and the password hash for comparison is computed over an octet
string, not a character string.

> SSH chooses the latter approach, but the first approach would have
> the same need for configurating what encoding to advertise.

> Not sure how big a hassle it is in practice.

For me?  None, because I use ASCII for those things, and all the HCI
I/O devices I use also use ASCII.  But I've used lots of terminals (and
terminal emulators) with various things in the 128-255 positions, and
if I had non-ASCII octets in my username and/or password, I'd have had
to vary the characters typed in order to get the right octet sequences.

You may not like this.  _I_ may not like it.  It may not be ideal.  But
it's how a lot of things worked, including historical Unix, and the way
a nontrivial number still work: everything is octet sequences once the
bits leave the keyboard until they come back to the screen; they are
character sequences only in the minds of humans working with them (that
is, the encoding tags necessary to turn octet sequences into character
sequences are not carried with them, instead existing only in humans'
minds and (usually implicitly) in I/O devices).

> I guess in theory it's possible on a unix system to have different
> user's use different character encodings for their passwords.  I
> don't see any good way to provide reliable interoperability in that
> case (and no, I don't think it's a good solution to say passwords are
> octet strings and it's the user's responsibility to figure out what
> the corresponding characters are on each system; maybe we disagree
> here).

I'm not sure whether I think it's a good solution in the abstract.  I
_am_ sure it's the solution that's in place now in a substantial
fraction of the installed base; trying to pretend otherwise is just
playing head-in-the-sand and leads to things like the ssh spec we have,
which faces implementors writing for systems like Unix variants -
anything using getpwnam()/crypt()/etc, which perforce operate on octet
strings rather than character strings - with only a few choices, none
of them particularly good, as I outlined last email.

>> In moussh's case, I chose to treat those things as opaque octet
>> strings.  [...]
> I think it makes sense to have the default be either ascii or utf-8,

That might be good advice to someone designing an OS.  But, as an
implementor, I'm faced with existing OSes, which store octet strings,
not tagged with the encoding information necessary to make them
character strings, and with no place to find any "this system uses
8859-1" or "this system uses ASCII" configuration, because there isn't
any such configuration.  Existing systems - at least some of them! -
don't do that; they push the issue off to the user, as I outlined
above.  I could in principle have given moussh a configuration option
telling it what encoding the local system uses, but that presumes that
such a thing is well-defined, which may not be the case: there's
nothing preventing a system from being connected to an 8859-1 terminal
used by person A in French and an 8859-7 terminal used by person B in
Greek at the same time.  It wouldn't be ideal, in that it pushes off to
humans the potential conflicts between, eg, person A wanting username
XYZ and person B wanting username TUV, when XYZ in 8859-1 and TUV in
8859-7 end up with the same octet sequence.  (With long names, like
"mouse", this isn't very likely, but lots of login names are very
short, like, two letters; e-acute a-ring isn't really that much more or
less plausible a username than iota epsilon, to name two sequences that
would clash in the above example.

I was - and am - not sure what the best option is, so I went with the
easiest to implement, easiest to change later, and IMO most in keeping
with the Unix philosophy.  Making moussh unusable on systems on which
multiple encodings are in use, just to toe the IETF's religious "Thou
Shalt Use UTF-8" line, struck me as a Bad Thing, especially since, as
it stands, it's still usable by those who do like UTF-8; use UTF-8 for
all those things and moussh conforms just fine.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse%rodents-montreal.org@localhost
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B



Home | Main Index | Thread Index | Old Index