tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: using the interfaces in ctype.h
On 21-Apr-08, at 12:09 PM, Alan Barrett wrote:
On Mon, 21 Apr 2008, Greg A. Woods; Planix, Inc. wrote:
If the implementation masked the value before using it, then it
would be
unable to distinguish EOF from UCHAR_MAX (typically '\377').
Indeed, however the current implementation doesn't even try to
"detect" or
"distinguish" EOF, and indeed passing EOF without casting it properly
and/or masking will result in an out-of-bounds array access in the
current
implementation.
What are you smoking? The use of constructs like
(_ctype_ + 1)[c]
in NetBSD's implementation (both in the macros defined in ctype.h, and
in the C code defined in libc/gen/isctype.c) will access _ctype_[0]
when
c == -1, and -1 happens to be the value that NetBSD used for EOF.
Ah, right, OK, sorry, my mistake. However that's really just a
pedantic point irrelevant to my main argument. So, -1, which in our
case happens to be EOF, is OK.
However that does nothing to help for any other negative values.
Assuming that the caller will only use -1 or a value between 0 and
_CTYPE_NUM_CHARS is not safe when the implementation is accessing an
array of only _CTYPE_NUM_CHARS+1 and the prototype for the API
specifies a parameter of type "int". If the implementation were an
inline function that could protect the array from out-of-bounds access
then that would be fine, but it's not on NetBSD.
Since masking inside the
implementation would violate the requirement to distinguish EOF from
UCHAR_MAX, it's good that NetBSD doesn't do that.
Huh? That makes no sense whatsoever.
For example (assuming 8-bit chars), if the implementation did the
equivalent of
c = c & 0xff;
before it used the value of c, then inputs of -1 (EOF) and 0xff (a
perfectly valid unsigned char, not the same as EOF) would both be
changed to 0xff, making it impossible for the rest of the code to
distinguish between these two inputs.
Huh? In NetBSD both (_ctype_+1)[-1]==0 and (_ctype_+1)[0xff]==0 so
what's to be distinguished?!?!?!?
Furthermore how's that any different than suggesting that the caller
cast the parameter with "(unsigned char)" or "(int)(unsigned char)"?
The cast still causes the passed value to be effectively masked with
0xFF and so even if the implementation did want to distinguish a
character of 0xFF from the value of EOF it could not.
You gain far more by building the cast into the implementation rather
than effectively forcing the application to employ it. At least with
it built in then the application won't thwart any future or alternate
implementation from detecting EOF before doing anything else with the
value.
FreeBSD, OpenBSD, and Darwin all seem to have much better
implementations, though they are all using proper (inline)
functions
which makes it easier in some ways to do it right.)
I am mildly curious. In what way are they "better"?
Well they can't as easily be responsible for causing a program to
crash,
for example.
You haven't shown an example, and I don't know what these other
implementations do.
If you'd like I can point you at HTTP accessible copies of the other
implementations if necessary....
Anyway, I don't subscribe to the theory that it's
"better" for the implementation to go out of its way to prevent an
erroneous program from crashing; I thhik that erroneous programs
deserve
to crash.
However, making it crash with a useful error message and an
abort() is more friendly than just pressing on with bad data.
I would agree entirely though I suspect there are many folks who would
disagree (witness the outrage when assert() was sprinkled elsewhere
about in libc). However the NetBSD implementation doesn't even try --
it just behaves naively and may then access memory outside the defined
object's allocated storage. At least with a built-in mask on the
array access value nothing untoward can happen.
The more expensive inline function style of implementation would
afford both better ways of forcing an application to abort, as well as
better ways of safely ignoring values out of range, thus offering the
ideal solution to both our desire to force broken applications to
crash as well as the desire of others to treat them benignly and allow
them to run safely.
For my own use the built-in mask affords the latter solution
transparently to applications, and without having to hack too much of
the NetBSD code, so that's the way I'll go for the near term.
I recommend the following slightly more portable technique for
ctype.h:
#define _CTYPE_MASK ~(UINT_MAX << CHAR_BIT)
I believe that that's identical to UCHAR_MAX, given the way unsigned
arithmetic works, and that UCHAR_MAX+1 is guaranteed to be equal to
1<<CHAR_BIT.
Yes it may be true that UCHAR_MAX has the same value as my mask, at
least in NetBSD, but that's not how _ctype_ is defined in NetBSD.
_ctype_ is defined in terms of CHAR_BIT, so the definition I chose is
more readable and more logical (in my opinion, of course) than using
any other unrelated constant or macro referring to an unrelated
constant, and thus both the mask used to access _ctype_ and the
definition of _ctype_ itself are simultaneously dependent on the same
macro and independent of UCHAR_MAX. However perhaps my definition
should be:
#define _CTYPE_MASK ~(~0UL << CHAR_BIT)
just to be pedantic and portable and to avoid any reference to any
other constant.
#define isdigit(c) ((int)(_ctype_ + 1)[((c) & _CTYPE_MASK)] & _N))
That's just wrong, as I explained before. Given two distinct inputs
c == EOF (0xffffffff, if int is 32 bits) and c == UCHAR_MAX (0xff, if
char is 8 bits), the results from ((c) & _CTYPE_MASK) will be 0xff in
both cases, so the macro will be unable to distinguish between the two
inputs. OK, '\xff' doesn't happen to be a digit in any character set
that I know about, so it doesn't matter in this particular case, but
cases in which it does matter are easy to imagine.
In fact with the implementation of the NetBSD "ctype" is*() and to*()
APIs, nothing outside of the proper range of ASCII is meaningful and
so 0xFF is always outside the range of valid inputs.
Hang on, it's even worse than that. The C standard allows signed
integers to have a representation other than two's complement. The
result of (-1 & 0xff) on a one's complement machine will be 0xfe, not
0xff. NetBSD might not run on any one's complement machines, but I
try
to consider them when writing code that's intended to be portable.
Your tangent about running on systems not supporting two's complement
is interesting, however I think it is well outside and beyond the
context of NetBSD, which I would humbly suggest will not run on any
such hardware any time soon and without vast effort on both the OS
side of things as well as within many applications which use the
"ctype" APIs. That's a boat that sailed off and sank quite some time
ago. :-)
--
Greg A. Woods; Planix, Inc.
<woods%planix.ca@localhost>
Home |
Main Index |
Thread Index |
Old Index