Source-Changes archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

CVS commit: othersrc/external/bsd/agcre/dist



Module Name:    othersrc
Committed By:   agc
Date:           Wed Aug 16 23:38:35 UTC 2017

Added Files:
        othersrc/external/bsd/agcre/dist: internal.h

Log Message:
Just what this world needs - another regexp library. However, for
something I was doing, I needed a regexp library in C, BSD-licensed,
and able to be exposed to a wide range of expressions, some better
controlled than others.

The resulting library is libagcre, which implements regular expression
compilation and execution. It uses the Pike Virtual Machine approach,
and features:

+ standard POSIX features where sane
+ some/most Perl escapes
+ lazy matching via '?'
+ non-capture parenthese (?:...)
+ in-expression case-insensitive directives are supported (?i)...(?-i)
+ all case-insensitivity is actioned at expression exec time.
Case-insensitivity can be specified at expression compile-time,
and, if so, it will be remembered.  But the expression itself, once
compiled, can be used to match in both a case-sensitive and insensitive
manner
+ utf8 is supported both for expressions and for input text when
matching
+ unicode escapes (in the Java format of \uABCD) are supported
+ exact multiple repetition specifiers {N}, and {N,M} are supported
+ backreferences are supported
+ utf16 (LE and BE) and utf32 (LE and BE) are supported, both for the
expression and for the input being searched
+ at the most basic level, individual 32bit unicode characters are
matched
+ an egrep/grep implementation for matching unicode regexps
is included

A simple implementation of sets is used to provide inclusion and
exclusion information for unicode characters, which is taken directly
from unicode.org. No bitmasks are used - ranges are specified by
using an upper and a lower bound for the codepoints. Callbacks can
also be added to these sets, to provide functionality similar to
the ctype macros across the whole unicode character set.

The standard regular expression basic3 torture test is passed with
4 known (and, I'd argue, incorrect) results flagged.  As expected,
the expression '(a?){9999}aaaaaaaaaaaaaaaaaaaaaaaaaaaaa' matches
in linear time, as does the expression
'((((((((((((((((((((((((((((((x))))))))))))))))))))))))))))))'

        % time agcre '(a?){9999}aaaaaaaaaaaaaaaaaaaaaaaaaaaaa' dist/tests/2.in
        aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
        0.063u 0.000s 0:00.06 100.0%    0+0k 0+0io 0pf+0w
        % time egrep '(a?){9999}aaaaaaaaaaaaaaaaaaaaaaaaaaaaa' dist/tests/2.in
        ^C88.462u 0.730s 1:29.21 99.9%  0+0k 0+0io 0pf+0w
        %

The library and agcre utility have been run through valgrind to
confirm no memory leaks.

In general, the emphasis is on a modern, predictable, VM-style,
well-featured regexp library, in C, with a BSD license. In
particular, sljit has not been used to speed up on certain platforms,
most Perl regexp features are supported, as are back references,
and UTF-8, UTF-16 and UTF32.

Once again, I wouldn't expect anyone to use this as the main engine
in egrep. But I am always amazed at the uses for some of the things
that I write.

For more information about the Pike VM, and comparison to other
regexp implementations, please see:

        https://swtch.com/~rsc/regexp/regexp2.html

Alistair Crooks
Tue Aug 15 07:43:34 PDT 2017


To generate a diff of this commit:
cvs rdiff -u -r0 -r1.1 othersrc/external/bsd/agcre/dist/internal.h

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.




Home | Main Index | Thread Index | Old Index