IETF-SSH archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

How to treat utf8 text with overlong utf8 sequences?



What do you think about sending overlong / "non-minimum form" utf8
sequences in various utf8 strings in the protocol?

It matters the most for utf8 strings that are displayed to the user,
e.g. the prompt strings in SSH_MSG_USERAUTH_INFO_REQUEST, where the
specification recommends control character filtering.

I prefer doing the control filtering before converting the data to the
local character set, because it's pretty well defined which
ucs4/unicode values are control characters (namely u0000-u001f,
u007f-u009f).

If we allow overlong control character sequences, then e .g. ESC can
be represented in utf8 as

  0x1b, (0xc0 0x9b), (0xe0 0x80 0x9b) ... or (0xfc 0x80 0x80 0x80 0x80 0x9b)

Filtering gets easier if I can first check if the utf8 string contains
overlong sequences at an early stage, and treat that as a protocol
error.

About the same question applies for the utf8 encoding of ud800-udfff
(surrogates) and the non-characters ufffe and uffff, which are also not
supposed to ever occur in valid utf8 text.

RFC 2279 does not address these questions, as far as I can see.

I'm tempted to treat any use of overlong or otherwise invalid utf8
strings that I receive from the remote end as a protocol error.

* Do you think that is a reasonable thing to do?

* Does it violate the ssh specification?

* Will it cause any interoperability problems in practice?

Regards,
/Niels



Home | Main Index | Thread Index | Old Index