IETF-SSH archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

UTF-8 [was Re: New Version Notification - draft-sgtatham-secsh-iutf8-05.txt]



> What=E2=80=99s inherently broken in using UTF-8...?

Different characters occupy different amounts of space.

(Some) characters are larger than one addressing unit (most machines).

There are octet sequences which are not valid UTF-8 character
sequences.  This results in text tools that break on small amounts of
non-UTF-8 text mixed into the text they're handling.  (This is not
really a problem with UTF-8 proper - there are also octets that are not
valid 8859-1 text, for example - but a problem with how it's
implemented; in my experience UTF-8 text tools break when faced with
non-UTF-8 octet sequences, whereas single-octet text tools usually
don't break when faced with invalid octets.)

Some characters have multiple distinct encodings.  (Okay, that too is
not really UTF-8 proper - it's actually Unicode.)

I've seen it said (by the git documentation) that transcoding from some
character sets like 8859-1 to UTF-8 is not a reversible operation.
This seems dubious to me, but, if true, it would be another, and fairly
strong, strike against UTF-8 in my opinion.

That's just what come to mind immediately.  I don't use UTF-8 myself if
I can help it (when I run into something using it my major concern is
how to make it stop doing so), so it's entirely possible there are
others I'm just not aware of.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse%rodents-montreal.org@localhost
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B



Home | Main Index | Thread Index | Old Index