IETF-SSH archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: How to treat utf8 text with overlong utf8 sequences?



I wrote:

> What do you think about sending overlong / "non-minimum form" utf8
> sequences in various utf8 strings in the protocol?

Thanks to all those who answered (Simon, Simon and Derek).

Unicode-3.0 (the version I have on paper) is a little vague, decoders
are appearantly expected to decode overlong utf8 sequences, although
implementations are not allowed to generate them. Unicode-3.2 is more
explicit, see http://www.unicode.org/reports/tr28/,
in particular "Table 3.1B. Legal UTF-8 Byte Sequences".

This table explicitly excludes overlong utf8 sequences, and
utf8-encodings of unicode/utf16 surrogate characters. So it seems
fairly safe to treat use of such utf8 sequences as protocol errors in
ssh. Also RFC 3629 spells this out quite clearly.

Neither the table in tr28 nor RFC 3629 explicitly forbids the unicode
"non-characters" ufffe and uffff, and the corresponding utf8 sequences
(0xef 0xbf 0xbe) and (0xef 0xbf 0xbf), but in the ssh context, I think
it makes sense to treat them too as protocol errors.

I hope we don't need to clarify this explicitly in the ssh
specification, as its a general utf8 issue. However, I think we should
update our utf-8 references to refer to RFC 3629.

Regards,
/Niels



Home | Main Index | Thread Index | Old Index