IETF-SSH archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
How to treat utf8 text with overlong utf8 sequences?
What do you think about sending overlong / "non-minimum form" utf8
sequences in various utf8 strings in the protocol?
It matters the most for utf8 strings that are displayed to the user,
e.g. the prompt strings in SSH_MSG_USERAUTH_INFO_REQUEST, where the
specification recommends control character filtering.
I prefer doing the control filtering before converting the data to the
local character set, because it's pretty well defined which
ucs4/unicode values are control characters (namely u0000-u001f,
u007f-u009f).
If we allow overlong control character sequences, then e .g. ESC can
be represented in utf8 as
0x1b, (0xc0 0x9b), (0xe0 0x80 0x9b) ... or (0xfc 0x80 0x80 0x80 0x80 0x9b)
Filtering gets easier if I can first check if the utf8 string contains
overlong sequences at an early stage, and treat that as a protocol
error.
About the same question applies for the utf8 encoding of ud800-udfff
(surrogates) and the non-characters ufffe and uffff, which are also not
supposed to ever occur in valid utf8 text.
RFC 2279 does not address these questions, as far as I can see.
I'm tempted to treat any use of overlong or otherwise invalid utf8
strings that I receive from the remote end as a protocol error.
* Do you think that is a reasonable thing to do?
* Does it violate the ssh specification?
* Will it cause any interoperability problems in practice?
Regards,
/Niels
Home |
Main Index |
Thread Index |
Old Index