Source-Changes-HG archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
[src/trunk]: src/lib/libedit UTF-8 fixes from Ingo Schwarze:
details: https://anonhg.NetBSD.org/src/rev/aa5b73afcddd
branches: trunk
changeset: 343504:aa5b73afcddd
user: christos <christos%NetBSD.org@localhost>
date: Mon Feb 08 17:18:43 2016 +0000
description:
UTF-8 fixes from Ingo Schwarze:
1. Assume that errno is non-zero when entering read_char()
and that read(2) returns 0 (indicating end of file).
Then, the code will clear errno before returning.
(Obviously, the statement "errno = 0" is almost always
a bug unless there is save_errno = errno right before it
and the previous value is properly restored later,
in all reachable code paths.)
2. When encountering an invalid byte sequence, the code discards
all following bytes until MB_LEN_MAX overflows; consider, for
example, 0xc2 immediately followed by a few valid ASCII bytes.
Three of those ASCII bytes will be discarded.
3. On a POSIX system, EILSEQ will always be set after reading a
valid (yes, valid, not invalid!) UTF-8 character. The reason
is that mbtowc(3) will first be called with a length limit
(third argument) of 1, which will fail, return -1, and - on
a POSIX system - set errno to EILSEQ.
This third bug is mitigated a bit because i couldn't find any
system that actually conforms to POSIX in this respect: None
of OpenBSD, NetBSD, FreeBSD, Solaris 11, and glibc set errno
when an incomplete character is passed to mbtowc(3), even though
that is required by POSIX.
Anyway, that mbtowc(3) bug will be fixed at least in OpenBSD
after release unlock, so it would be good to fix this bug in
libedit before fixing the bug in mbtowc(3).
How can these three bugs be fixed?
1. As far as i understand it, the intention of the bogus errno = 0
is to undo the effects of failing system calls in el_wset(),
sig_set(), and read__fixio() if the subsequent read(2) indicates
end of file. So, restoring errno has to be moved right after
read__fixio(). Of course, neither 0 nor e is the right value
to restore: 0 is wrong if errno happened to be set on entry, e
would be wrong because if one read(2) fails but a second attempt
succeeds after read__fixio(), errno should not be touched. So,
the errno to be restored in this case has to be saved before
calling read(2) for the first time.
2. Solving the second issue requires distinguishing invalid and
incomplete characters, but that is impossible with the function
mbtowc(3) because it returns -1 in both cases and sets errno
to EILSEQ in both cases (once properly implemented).
It is vital that each input character is processed right away.
It is not acceptable to wait for the next input character before
processing the previous one because this is an interactive
library, not a batch system. Consequently, the only situation
where it is acceptable to wait for the next byte without first
processing the previous one(s) is when the previous one(s) form
an incomplete sequence that can be continued to form a valid
character.
Consequently, short of reimplementing a full UTF-8 state machine
by hand, the only correct way forward is to use mbrtowc(3).
Even then, care is needed to always have the state object
properly initialized before using it, and to not discard a valid
ASCII or UTF-8 lead byte if it happens to follow an invalid
sequence.
3. Fortunately, solution 2. also solves issue 3. as a side effect,
by no longer using mbtowc(3) in the first place.
diffstat:
lib/libedit/chartype.h | 4 +++-
lib/libedit/read.c | 39 ++++++++++++++++++++++++++++++++-------
2 files changed, 35 insertions(+), 8 deletions(-)
diffs (115 lines):
diff -r 4e984d3319be -r aa5b73afcddd lib/libedit/chartype.h
--- a/lib/libedit/chartype.h Mon Feb 08 16:44:45 2016 +0000
+++ b/lib/libedit/chartype.h Mon Feb 08 17:18:43 2016 +0000
@@ -1,4 +1,4 @@
-/* $NetBSD: chartype.h,v 1.15 2015/05/17 13:14:41 christos Exp $ */
+/* $NetBSD: chartype.h,v 1.16 2016/02/08 17:18:43 christos Exp $ */
/*-
* Copyright (c) 2009 The NetBSD Foundation, Inc.
@@ -61,6 +61,7 @@
#endif
#define ct_mbtowc mbtowc
+#define ct_mbrtowc mbrtowc
#define ct_mbtowc_reset mbtowc(0,0,(size_t)0)
#define ct_wctomb wctomb
#define ct_wctomb_reset wctomb(0,0)
@@ -116,6 +117,7 @@
#else /* NARROW */
#define ct_mbtowc error
+#define ct_mbrtowc error
#define ct_mbtowc_reset
#define ct_wctomb error
#define ct_wctomb_reset
diff -r 4e984d3319be -r aa5b73afcddd lib/libedit/read.c
--- a/lib/libedit/read.c Mon Feb 08 16:44:45 2016 +0000
+++ b/lib/libedit/read.c Mon Feb 08 17:18:43 2016 +0000
@@ -1,4 +1,4 @@
-/* $NetBSD: read.c,v 1.71 2014/07/06 18:15:34 christos Exp $ */
+/* $NetBSD: read.c,v 1.72 2016/02/08 17:18:43 christos Exp $ */
/*-
* Copyright (c) 1992, 1993
@@ -37,7 +37,7 @@
#if 0
static char sccsid[] = "@(#)read.c 8.1 (Berkeley) 6/4/93";
#else
-__RCSID("$NetBSD: read.c,v 1.71 2014/07/06 18:15:34 christos Exp $");
+__RCSID("$NetBSD: read.c,v 1.72 2016/02/08 17:18:43 christos Exp $");
#endif
#endif /* not lint && not SCCSID */
@@ -317,6 +317,7 @@
char cbuf[MB_LEN_MAX];
size_t cbp = 0;
int bytes = 0;
+ int save_errno = errno;
again:
el->el_signal->sig_no = 0;
@@ -332,9 +333,10 @@
default:
break;
}
- if (!tried && read__fixio(el->el_infd, e) == 0)
+ if (!tried && read__fixio(el->el_infd, e) == 0) {
+ errno = save_errno;
tried = 1;
- else {
+ } else {
errno = e;
*cp = '\0';
return -1;
@@ -343,24 +345,47 @@
/* Test for EOF */
if (num_read == 0) {
- errno = 0;
*cp = '\0';
return 0;
}
#ifdef WIDECHAR
if (el->el_flags & CHARSET_IS_UTF8) {
+ mbstate_t mbs;
+ size_t rbytes;
+again_lastbyte:
if (!utf8_islead((unsigned char)cbuf[0]))
goto again; /* discard the byte we read and try again */
++cbp;
- if ((bytes = ct_mbtowc(cp, cbuf, cbp)) == -1) {
- ct_mbtowc_reset;
+ /* This only works because UTF8 is stateless */
+ memset(&mbs, 0, sizeof(mbs));
+ switch (rbytes = ct_mbrtowc(cp, cbuf, cbp, &mbs)) {
+ case (size_t)-1:
+ if (cbp > 1) {
+ /*
+ * Invalid sequence, discard all bytes
+ * except the last one.
+ */
+ cbuf[0] = cbuf[cbp - 1];
+ cbp = 0;
+ goto again_lastbyte;
+ } else {
+ /* Invalid byte, discard it. */
+ cbp = 0;
+ goto again;
+ }
+ case (size_t)-2:
if (cbp >= MB_LEN_MAX) { /* "shouldn't happen" */
errno = EILSEQ;
*cp = '\0';
return -1;
}
+ /* Incomplete sequence, read another byte. */
goto again;
+ default:
+ /* Valid character, process it. */
+ bytes = (int)rbytes;
+ break;
}
} else if (isascii((unsigned char)cbuf[0]) ||
/* we don't support other multibyte charsets */
Home |
Main Index |
Thread Index |
Old Index