Subject: Re: the state of regex(3)
To: NetBSD Userlevel Technical Discussion List <tech-userlevel@NetBSD.ORG>
From: Greg A. Woods <woods@weird.com>
List: tech-userlevel
Date: 01/02/2004 17:48:18
I had forgotten that I had a basic bit of test harness for doing simple
regex testing and benchmarking with the egrep implementation by James
Howard and Dag-Erling Smørgrav (which is just a wrapper around any POSIX
regex library). Remembering this prompted me to fetch and compile the
latest versions of the various libraries mentioned so far and give them
each a test run.
Here are the sizes of the static linked binary using the various
libraries:
NetBSD i386 w/ GCC 2.95.3 nb3
text data bss dec hex filename
NetBSD-regex 144646 8688 12856 166190 2892e grep
pcre-4.5 153982 9776 12824 176582 2b1c6 grep
tre-0.6.4 156046 8944 12824 177814 2b696 grep
rx-1.5 156810 10512 13240 180562 2c152 grep
onig-20031224 200490 18320 12824 231634 388d2 grep
Unfortunately I get an immediate core dump from the Oniguruma library
which looks to be a bug in its POSIX API interface code.
For the rest here are some timing results from the following silly test
I use to find obvious viruses in e-mail, as run across about 64MB of
accumulated virus e-mail. So far PCRE is the clear winner by a country
mile and TRE is way ahead of the rest of the pack. TRE will probably
also improve quite a bit more before there's a 1.x release of it. TRE
has become very much more interesting in the latest release too -- it
now has true support for approximate pattern matching using real EREs
(i.e. in a manner vastly superior to the old agrep).
/usr/bin/time -l ./grep -D -E -i \
-e 'The file was successfully deleted by RAV AntiVirus' \
-e 'I send you this file in order to have your advice' \
-e '^TV[nopqr][A-Z]...[AB]..A.A....*AAAA...*AAAA' \
-e '^M35[GHIJK].`..`..*````' \
-e '^[ ]*content-(disposition|type).*name[ ]*=[ ]*"?(.*\.(386|acm|ade|adp|app|asp|awx|ax|bas|bat|bin|cdf|chm|class|cmd|cnv|com|cpl|crt|csh|dll|dlo|doc|dot|drv|exe|flt|fot|hlp|hta|ini|inf|ins|isp|js|jse|lnk|mdb|mde|mod|msc|msi|msp|mst|nws|obj|ocx|olb|osd|ovl|pcd|pdr|pgm|pif|pkg|pot|ppt|pps|prg|reg|rpl|rtf|scr|script|sct|sh|sha|shtml|shs|swf|sys|tlb|tsp|ttf|vb|vlm|vxd|vxo|wiz|wll|wwk|pdr|url|vb|vbe|vbs|wsc|wsf|wsh|xla|xlb|xlc|xld|xlk|xll|xlm|xls|xlt|xlv|xlw|xnk))"?[ ]*$' \
/mfbd/woods/virii > test.out
NetBSD-regex:
192.77 real 191.70 user 0.04 sys
0 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
2049 page reclaims
0 page faults
0 swaps
0 block input operations
1 block output operations
2 messages sent
0 messages received
0 signals received
2 voluntary context switches
2641 involuntary context switches
pcre-4.5:
9.21 real 8.84 user 0.03 sys
0 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
2046 page reclaims
3 page faults
0 swaps
0 block input operations
0 block output operations
11 messages sent
0 messages received
0 signals received
16 voluntary context switches
145 involuntary context switches
tre-0.6.4:
65.11 real 64.30 user 0.13 sys
0 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
2344 page reclaims
4 page faults
0 swaps
0 block input operations
0 block output operations
12 messages sent
0 messages received
0 signals received
16 voluntary context switches
942 involuntary context switches
rx-1.5:
140.42 real 139.26 user 0.11 sys
0 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
3401 page reclaims
4 page faults
0 swaps
0 block input operations
0 block output operations
12 messages sent
0 messages received
0 signals received
16 voluntary context switches
1994 involuntary context switches
FYI those tests were run on a system with a PIII-700MHz CPU and 1GB RAM
--
Greg A. Woods
+1 416 218-0098 VE3TCP RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com> Secrets of the Weird <woods@weird.com>