tech-userlevel archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: bin/39002: harmful AWK extension: non-portable escaped character
I think it may still be important here to point out that AWK has
separate syntax for expressing regular expressions and strings because
of this very issue of what backslashes represent in each syntax. This
should be abundantly clear to any C programmer who has had occasion to
represent REs in C strings (or any other of the many languages
offering only C-like strings and no separate RE syntax), or to those
who have used something like lex which also has separate syntax for
regular expressions and strings.
As the awk(1) manual says:
String constants are quoted " ", with the usual C escapes
recog-
nized within.
and:
/ re / is a constant regular expression;
any string (constant or variable) may be used as a regular
expression,
except in the position of an isolated regular expression in a
pattern.
Perhaps if the manual also explicitly warned that expressing an RE as
a string required extra escaping of all backslashes (instead of
relying on the reader's experience with C and/or shell (command-line)
strings) then this issue would, eventually, go _quietly_ away.
I note the mawk manual does say explicitly (w.r.t. string constant
syntax):
If you escape any other character \c, you get \c, i.e.,
mawk ignores
the escape.
and like the AWK manual it also declares that RE syntax is separate
from string syntax by saying:
Regular expressions are enclosed in slashes,
and finally it says:
Any expression can be used on the right hand side of the ~ or !
~ opera-
tors or passed to a built-in that expects a regular
expression. If
needed, it is converted to string, and then interpreted as
a regular
expression.
which in a round-about way also says what I've said above, at least to
anyone cognizant of the differences between strings and REs, i.e. that
care will have to be taken to properly represent backslashes and such
in strings that will be interpreted as regular expressions.
I believe the mistake that triggered all of this was in assuming that
"gawk" can be used as an interpreter for a portable AWK language
script. It cannot. GAWK in its native mode is not AWK compatible.
GAWK has this glaring difference:
The escape sequences may also be used inside constant
regular expres-
sions (e.g., /[ \t\f\n\r\v]/ matches whitespace characters).
In true AWK regular expressions are pure ("a `\' followed by any other
character (matching that character taken as an ordinary character, as
if the `\' had not been present)") and they are not cross-contaminated
by C-like syntax in the way that GAWK's are.
I don't know if GAWK's so-called "compatibility" mode corrects this
difference or not.
I think the GAWK people had far too much influence on the POSIX AWK
standardization, perhaps sadly because GAWK was one of the only
contending alternative (and open) implementations at the time the
standard was written. Perhaps this ambiguity in the POSIX AWK
standard was also due to the lack of an earlier firm RE standard which
GAWK could have adhered to and which POSIX AWK could have referenced,
i.e. one which would have disallowed C-like character escapes in pure
REs. GAWK is certainly the odd one out here and now.
On 24-Jun-08, at 10:32 AM, Valeriy E. Ushakov wrote:
After successfully alienating and antagonizing your audience, don't be
surprised people are not interested in hearing whatever rational
argument you might actually have there.
Thanks! :-)
--
Greg A. Woods; Planix, Inc.
<woods%planix.ca@localhost>
Home |
Main Index |
Thread Index |
Old Index