once again, some discussion about <ctype.h> interfaces....

To: NetBSD User-Level Technical Discussion List <tech-userlevel%NetBSD.org@localhost>
Subject: once again, some discussion about <ctype.h> interfaces....
From: "Greg A. Woods" <woods%planix.ca@localhost>
Date: Mon, 28 Jan 2013 22:42:57 -0800

Quite some time ago there was a thread here about "using the interfaces
in ctype.h".

More recently there was another thread on current-users about "Casting
ctype lookups".

Since the first discussion I'd been avoiding the problem, but not quite
correctly (basically by ignoring the EOF-is-different issue by including
a mask of the parameter in all array references).  I've undone that now,
but I'm not happy about it -- I'm getting far too many unnecessary
warnings on legacy code that should not have to be modified (i.e. code
that is passing 'char', never 'int', and thus never EOF).

I think I've now finally figured out a way for at least the macros to
avoid causing unnecessary warnings, and to make them work correctly
regardless of the width or sign of the parameter, though unfortunately
by generating new different warnings, at least with some warning flags.

I've added my discussion about it in the form of a comment to patch
<ctype.h>, and I'd appreciate any comment about the technical issues,
especially w.r.t. the new warnings this code generates in some uses and
how they might be avoided.

(apologies for this mode, but I'd like to keep the text of my comments
in one place so it remains consistent, and so when changed I have a
canonical location where I can find the changes)

--- ctype.h     27 Jan 2013 19:31:17 -0800      1.29
+++ ctype.h     28 Jan 2013 22:05:47 -0800      
@@ -84,6 +84,94 @@
 #endif
 __END_DECLS
 
+/*
+ * XXX these fix the problem of passing (signed char):
+ *
+#define isdigit(c)     ((sizeof(c) > 1 && (c) == EOF) ? 0 : ((int) ((_ctype_ + 
1)[(unsigned char) (c)] & _N)))
+#define        islower(c)      ((sizeof(c) > 1 && (c) == EOF) ? 0 : ((int) 
((_ctype_ + 1)[(unsigned char) (c)] & _L)))
+#define        isspace(c)      ((sizeof(c) > 1 && (c) == EOF) ? 0 : ((int) 
((_ctype_ + 1)[(unsigned char) (c)] & _S)))
+#define        ispunct(c)      ((sizeof(c) > 1 && (c) == EOF) ? 0 : ((int) 
((_ctype_ + 1)[(unsigned char) (c)] & _P)))
+#define        isupper(c)      ((sizeof(c) > 1 && (c) == EOF) ? 0 : ((int) 
((_ctype_ + 1)[(unsigned char) (c)] & _U)))
+#define        isalpha(c)      ((sizeof(c) > 1 && (c) == EOF) ? 0 : ((int) 
((_ctype_ + 1)[(unsigned char) (c)] & (_U|_L))))
+#define        isxdigit(c)     ((sizeof(c) > 1 && (c) == EOF) ? 0 : ((int) 
((_ctype_ + 1)[(unsigned char) (c)] & (_N|_X))))
+#define        isalnum(c)      ((sizeof(c) > 1 && (c) == EOF) ? 0 : ((int) 
((_ctype_ + 1)[(unsigned char) (c)] & (_U|_L|_N))))
+#define        isprint(c)      ((sizeof(c) > 1 && (c) == EOF) ? 0 : ((int) 
((_ctype_ + 1)[(unsigned char) (c)] & (_P|_U|_L|_N|_B))))
+#define        isgraph(c)      ((sizeof(c) > 1 && (c) == EOF) ? 0 : ((int) 
((_ctype_ + 1)[(unsigned char) (c)] & (_P|_U|_L|_N))))
+#define        iscntrl(c)      ((sizeof(c) > 1 && (c) == EOF) ? 0 : ((int) 
((_ctype_ + 1)[(unsigned char) (c)] & _C)))
+#define        tolower(c)      ((sizeof(c) > 1 && (c) == EOF) ? EOF : ((int) 
((_tolower_tab_ + 1)[(unsigned char) (c)])))
+#define        toupper(c)      ((sizeof(c) > 1 && (c) == EOF) ? EOF : ((int) 
((_toupper_tab_ + 1)[(unsigned char) (c)])))
+ *
+ * It is quite common for older C code to pass plain "char" values to <ctype.h>
+ * interfaces (after all the traditional "signature" of these functions defined
+ * them as accepting "char" parameters!), but now NetBSD warns of these uses,
+ * even if they are benign.  The simple fix is to add a cast to (unsigned char)
+ * for their parameters.  However having to modify working legacy code in this
+ * way is terribly annoying.  The above macros implement the <ctype.h>
+ * interfaces safely such that they can deal with signed char parameters which
+ * may contain negative values, while still dealing cleanly with the potential
+ * of being passed an int containing an EOF (-1) value, thus conforming very
+ * well with the requirements of the standards while not making deamons fly out
+ * of anyone's nose when legacy code is used to process data containing
+ * characters with their 8'th bit set.  Indeed they _always_ just do The Right
+ * Thing and avoid the adverse affects of sign extension when signed chars are
+ * used as their parameters.
+ *
+ * Note that these implementations could also allow removal of the
+ * (_ctype_+1)[] trick, but that would mean fixing up the functions that load
+ * LC_CTYPE locales, plus the isctype.c functions, as well as of course the
+ * definitions of _C_ctype_, _C_tolower, and _C_toupper.
+ *
+ * The extra test here will normally be avoided with any decent compiler due to
+ * the compile time sizeof() comparison since quite often the parameter will be
+ * a char of some sort (i.e. have a sizeof() == 1).  (I find that with my own
+ * collection of code, use of "int" with <ctype.h> interfaces is far less
+ * common than uses of plain "char".)
+ *
+ * However the comparison with EOF will elicit an un-desired warning from most
+ * compilers, such as GCC's "warning: comparison is always false due to limited
+ * range of data type", or Clang's "warning: will never be executed
+ * [-Wunreachable-code]") whenever the parameter is an "unsigned char" due to
+ * the impossibility of any unsigned char's value ever matching -1, i.e. as a
+ * signed int by default.  I don't know how to avoid these warnings and still
+ * achieve the desired compatability with even the normal set of common
+ * parameter types (char, signed char, unsigned char, and int).
+ *
+ * Obviously this doesn't help the real libc <ctype.h> functions -- sign
+ * extension will still happen when an (signed char) parameter with a negative
+ * value is passed to one of them, and then they will be stuck with, at
+ * minimum, an in-ability to distinguish between 0xFF and -1 (EOF) and so
+ * cannot use either an (unsigned char) cast as above or a mask of 0xFF without
+ * potentially returning the wrong value when passed EOF.
+ *
+ * However, since any potential EOF value should in theory always be dealt with
+ * in such a way as to avoid ever calling any of the <ctype.h> functions with
+ * it in the first place, perhaps the libc <ctype.h> functions could/should
+ * just play it safe and effectively treat EOF (i.e. when passed as an "int")
+ * as if it were 0xFF?  I'm not even certain this would contravene any
+ * standards as I can't find any firm definition of what these i"functions"
+ * should return if passed EOF.  Really, how often does EOF actually get passed
+ * to these functions?  I'd hope it's approximatlely never, and that those are
+ * the only places there are any real bugs in legacy code that need fixing!
+ * (Though they are also likely harder bugs to fix!)  The only danger I see is
+ * that programs using toupper() or tolower() _and_ which might pass the EOF
+ * (as an int) to those functions, might return a valid character instead of
+ * EOF.
+ *
+ * Perhaps compilers would be smart to generate a warning whenever a narrower
+ * signed parameter will be sign extended (if negative) to widen it to match
+ * the prototype (or the default parameter conversions).  I don't off-hand
+ * recall many other APIs where widening from a smaller signed value to a
+ * larger signed value is the norm, except in similar cases where sign
+ * extension inevitably causes confusion and/or undefined behaviour as it can
+ * here.
+ *
+ * Perhaps it would also be wise to set NetBSD's compiler to default to
+ * treating un-qualified "char" types as "unsigned char" as well.  Having
+ * "char" be always unsigned would perhaps help "hide" (in a very good way!)
+ * bugs with legacy code having to deal with more 8-bit (e.g. Latin-1) input
+ * than it was originally written to deal with.
+ */
+/* XXX the masks here should be macros that can be shared with 
src/lib/libc/gen/isctype.c */
 #define        isdigit(c)      ((int)((_ctype_ + 1)[(c)] & _N))
 #define        islower(c)      ((int)((_ctype_ + 1)[(c)] & _L))
 #define        isspace(c)      ((int)((_ctype_ + 1)[(c)] & _S))

-- 
                                                Greg A. Woods
                                                Planix, Inc.

<woods%planix.ca@localhost>        +1 250 762-7675        http://www.planix.ca/

Attachment: pgps976q8plSS.pgp
Description: PGP signature

Follow-Ups:
- Re: once again, some discussion about <ctype.h> interfaces....
  - From: Robert Elz
- Re: once again, some discussion about <ctype.h> interfaces....
  - From: Joerg Sonnenberger
- Re: once again, some discussion about <ctype.h> interfaces....
  - From: Mouse

Prev by Date: Re: Functions locality and instruction fetch
Next by Date: Re: once again, some discussion about <ctype.h> interfaces....
Previous by Thread: GPIB library available on NetBSD
Next by Thread: Re: once again, some discussion about <ctype.h> interfaces....
Indexes:

Home | Main Index | Thread Index | Old Index