Update to release version
git-svn-id: svn://bknr.net/svn/trunk/thirdparty/cl-ppcre@3601 4281704c-cde7-0310-8518-8e2dc76b1ff0
This commit is contained in:
@ -1,12 +1,13 @@
|
||||
Version 2.0.0
|
||||
2008-07-23
|
||||
2008-07-24
|
||||
Added named properties (\p{foo})
|
||||
Added Unicode support
|
||||
Introduced test functions for character classes
|
||||
Added optional test function optimization
|
||||
Cleaned up test suite, removed performance cruft
|
||||
Removed the various alternative system definitions (too much maintenance work)
|
||||
Exported PARSE-STRING
|
||||
Exported PARSE-STRING
|
||||
Changed default value of *USE-BMH-MATCHERS*
|
||||
General cleanup
|
||||
Lots of documentation additions
|
||||
|
||||
|
||||
177
doc/index.html
177
doc/index.html
@ -143,6 +143,7 @@ href="http://weitz.de/regex-coach/">The Regex Coach</a>.
|
||||
<li><a href="#backslash">Backslashes may confuse you...</a>
|
||||
</ol>
|
||||
<li><a href="#allegro">AllegroCL compatibility mode</a>
|
||||
<li><a href="#blabla">Hints, comments, performance considerations</a>
|
||||
<li><a href="#ack">Acknowledgements</a>
|
||||
</ol>
|
||||
|
||||
@ -223,8 +224,7 @@ they obviously don't make sense in Lisp.
|
||||
<li><code>\N{name}</code> (named characters), <code>\x{263a}</code>
|
||||
(wide hex characters), <code>\l</code>, <code>\u</code>,
|
||||
<code>\L</code>, and <code>\U</code>
|
||||
because they're actually not part of Perl's regex syntax and
|
||||
(honestly) because I was too lazy - but see <a href="http://weitz.de/cl-interpol/">CL-INTERPOL</a>.
|
||||
because they're actually not part of Perl's <em>regex</em> syntax - but see <a href="http://weitz.de/cl-interpol/">CL-INTERPOL</a>.
|
||||
|
||||
<li><code>\X</code> (extended Unicode), and <code>\C</code> (single
|
||||
character). But you can of course use all characters
|
||||
@ -1192,20 +1192,19 @@ play dirty tricks with implementation-dependent behaviour, though.</blockquote>
|
||||
<p><br>[Special variable]
|
||||
<br><a class=none name="*use-bmh-matchers*"><b>*use-bmh-matchers*</b></a>
|
||||
|
||||
<blockquote><br>Usually, the scanners created by <a
|
||||
href="#create-scanner"><code>CREATE-SCANNER</code></a> (or
|
||||
implicitly by other functions and macros) will use fast <a
|
||||
href="http://www-igm.univ-mlv.fr/~lecroq/string/node18.html">Boyer-Moore-Horspool
|
||||
matchers</a> to check for constant strings at the start or end of the
|
||||
regular expression. If <code>*USE-BMH-MATCHERS*</code> is
|
||||
<code>NIL</code> (the default is <code>T</code>), the standard
|
||||
function <a
|
||||
href="http://www.lispworks.com/documentation/HyperSpec/Body/f_search.htm"><code>SEARCH</code></a>
|
||||
will be used instead. This will usually be a bit slower but can save
|
||||
lots of space if you're storing many scanners. The <a
|
||||
href="#test">test suite</a> will automatically set
|
||||
<code>*USE-BMH-MATCHERS*</code> to <code>NIL</code> while you're running
|
||||
the default test.
|
||||
<blockquote><br>Usually, the scanners created
|
||||
by <a href="#create-scanner"><code>CREATE-SCANNER</code></a> (or
|
||||
implicitly by other functions and macros) will use the standard
|
||||
function <a href="http://www.lispworks.com/documentation/HyperSpec/Body/f_search.htm"><code>SEARCH</code></a>
|
||||
to check for constant strings at the start or end of the regular
|
||||
expression. If <code>*USE-BMH-MATCHERS*</code> is true (the default
|
||||
is <code>NIL</code>),
|
||||
fast <a href="http://www-igm.univ-mlv.fr/~lecroq/string/node18.html">Boyer-Moore-Horspool
|
||||
matchers</a> will be used instead. This will usually be faster but
|
||||
can make the scanners considerably bigger. Per BMH matcher - there
|
||||
can be up to two per scanner - a fixnum array of
|
||||
size <a href="#*regex-char-code-limit*"><code>*REGEX-CHAR-CODE-LIMIT*</code></a>
|
||||
is allocated and closed over.
|
||||
<p>
|
||||
Note: Due to the nature of <a href="http://www.lispworks.com/documentation/HyperSpec/Body/s_ld_tim.htm"><code>LOAD-TIME-VALUE</code></a> and the <a
|
||||
href="#compiler-macro">compiler macro for <code>SCAN</code> and other functions</a>, some
|
||||
@ -1638,7 +1637,7 @@ convert a parse tree).
|
||||
<br> <br><h3><a name="unicode" class=none>Unicode properties</a></h3>
|
||||
|
||||
You can add support for Unicode properties to CL-PPCRE by loading
|
||||
the CL-PPCRE-UNICODE system:
|
||||
the CL-PPCRE-UNICODE system (which depends on <a href="http://weitz.de/cl-unicode/">CL-UNICODE</a>):
|
||||
<pre>
|
||||
(asdf:oos 'asdf:load-op :cl-ppcre-unicode)
|
||||
</pre>
|
||||
@ -2039,6 +2038,148 @@ To use the AllegroCL compatibility mode you have to
|
||||
</pre>
|
||||
<em>before</em> you compile CL-PPCRE.
|
||||
|
||||
<br> <br><h3><a class=none name="blabla">Hints, comments, performance considerations</a></h3>
|
||||
|
||||
Here are, in no particular order, a couple of things about CL-PPCRE
|
||||
and regular expressions in general that you might or might not want to
|
||||
read.
|
||||
|
||||
<ul>
|
||||
<li>A lot of hackers (especially users of Perl and other scripting
|
||||
languages) think that regular expressions are the greatest thing
|
||||
since slice bread and use it for almost everything. That is just
|
||||
plain wrong. Other hackers (especially Lispers) tend to think that
|
||||
regular expressions are the work of the devil and try to avoid them
|
||||
at all cost. That's also wrong. Regular expressions are a handy
|
||||
and useful addition to your toolkit which you should use when
|
||||
appropriate - you should just try to figure out first <em>if</em>
|
||||
they're appropriate for the task at hand.
|
||||
|
||||
<li>If you're concerned about the string syntax of regular
|
||||
expressions which can look like line noise and is really hard to
|
||||
read for long expressions, consider using
|
||||
CL-PPCRE's <a href="#create-scanner2">S-expression syntax</a>
|
||||
instead. It is less error-prone and you don't have to worry about
|
||||
escaping characters. It is also easier to manipulate
|
||||
programmatically.
|
||||
|
||||
<li>For alternations, order is important. The general rule is that
|
||||
the regex engine tries from left to right and tries to match as much
|
||||
as possible.
|
||||
<pre>
|
||||
CL-USER 1 > (scan-to-strings "<=|<" "<=")
|
||||
"<="
|
||||
#()
|
||||
|
||||
CL-USER 2 > (scan-to-strings "<|<=" "<=")
|
||||
"<"
|
||||
#()
|
||||
</pre>
|
||||
|
||||
<li><a class=none name="compiler-macro">CL-PPCRE</a>
|
||||
uses <a href="http://www.lispworks.com/documentation/HyperSpec/Body/03_bba.htm">compiler
|
||||
macros</a> to pre-compile scanners
|
||||
at <a href=="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_l.htm#load_time">load
|
||||
time</a> if possible. This happens if the compiler can determine
|
||||
that the regular expression (no matter if it's a string or an
|
||||
S-expression)
|
||||
is <a href="http://www.lispworks.com/documentation/HyperSpec/Body/f_consta.htm">constant</a>
|
||||
at <a href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_c.htm#compile_time">compile
|
||||
time</a> and is intended to save the time for creating scanners
|
||||
at <a href="http://www.lispworks.com/documentation/HyperSpec/Body/26_glo_e.htm#execution_time">execution
|
||||
time</a> (probably creating the same scanner over and over in a
|
||||
loop). Make sure you don't prevent the compiler from helping you.
|
||||
For example, a definition like this one is usually not a good idea:
|
||||
<pre>
|
||||
(defun regex-match (regex target)
|
||||
<font color=orange>;; don't do that!</font>
|
||||
(scan regex target))
|
||||
</pre>
|
||||
|
||||
<li>If you want to search for a substring in a large string or if
|
||||
you search for the same string very
|
||||
often, <a href="#scan"><code>SCAN</code></a> will usually be faster
|
||||
than Common
|
||||
Lisp's <a href="http://www.lispworks.com/documentation/HyperSpec/Body/f_search.htm"><code>SEARCH</code></a>
|
||||
if you <a href="#*use-bmh-matchers*">use BMH matchers</a>. However,
|
||||
this only makes sense if scanner creation time is not the
|
||||
limiting factor, i.e. if the search target is <em>very</em> large or
|
||||
if you're using the same scanner very often.
|
||||
|
||||
<li>Complementary to the last hint, <em>don't</em> use regular
|
||||
expressions for one-time searches for constant strings. That's a
|
||||
terrible waste of resources.
|
||||
|
||||
<li><a href="#*use-bmh-matchers*"><code>*USE-BMH-MATCHERS*</code></a> together with a large value for
|
||||
<a href="#*regex-char-code-limit*"><code>*REGEX-CHAR-CODE-LIMIT*</code></a>
|
||||
can lead to huge scanners.
|
||||
|
||||
<li>A character class is by default translated into a sequence of
|
||||
tests exactly as you might expect. For
|
||||
example, <code>"[af-l\\d]"</code> means to test if the character is
|
||||
equal to <code>#\a</code>, then to test if it's
|
||||
between <code>#\f</code> and <code>#\l</code>, then if it's a digit.
|
||||
There's by default no attempt to remove redundancy (as
|
||||
in <code>"[a-ge-kf]"</code>) or to otherwise optimize these tests
|
||||
for speed. However, you can play
|
||||
with <a href="#*optimize-char-classes*"><code>*OPTIMIZE-CHAR-CLASSES*</code></a>
|
||||
if you've identified character classes as bottleneck and want to
|
||||
make sure that you have <em>O(1)</em> test functions.
|
||||
|
||||
<li>If you know that the expression you're looking for is anchored,
|
||||
use anchors in your regex. This can help the engine a lot to make
|
||||
your scanners more efficient.
|
||||
|
||||
<li>In addition to anchors, constant strings at the start or end of a
|
||||
regular expression can help the engine to quickly scan a strang.
|
||||
Note that for example <code>"(a-d|aebf)"</code>
|
||||
and <code>"ab(cd|ef)"</code> are equivalent, but only the second
|
||||
form has a constant start the regex engine can recognize.
|
||||
|
||||
<li>Try to avoid alternations if possible or at least factor them
|
||||
out as in the example above.
|
||||
|
||||
<li>If neither anchors nor constant strings are in sight, maybe
|
||||
"standalone" (sometimes also called "possessive") regular
|
||||
expressions can be helpful. Try the following:
|
||||
<pre>
|
||||
(let ((target (make-string 10000 :initial-element #\a))
|
||||
(scanner-1 (create-scanner "a*\\d"))
|
||||
(scanner-2 (create-scanner "(?>a*)\\d")))
|
||||
(time (scan scanner-1 target))
|
||||
(time (scan scanner-2 target)))
|
||||
</pre>
|
||||
|
||||
<li>Consider using <a href="#create-scanner">"single-line mode"</a>
|
||||
if it makes sense for your task. By default (following Perl's
|
||||
practice), a dot means to search for any character <em>except</em>
|
||||
line breaks. In single-line mode a dot searches for <em>any</em>
|
||||
character which in some cases means that large parts of the target
|
||||
can actually be skipped. This can be vastly more efficient for
|
||||
large targets.
|
||||
|
||||
<li>Don't use capturing register groups where a non-capturing group
|
||||
would do, i.e. <em>only</em> use registers if you need to refer to
|
||||
them later. If you use a register, each scan process needs to
|
||||
allocate space for it and update its contents (possibly many times)
|
||||
until it's finished. (In Perl parlance - use <code>"(?:foo)"</code> instead of
|
||||
<code>"(foo)"</code> whenever possible.)
|
||||
|
||||
<li>In addition to what has been said in the last hint, note that
|
||||
Perl semantics force the regex engine to report the <em>last</em>
|
||||
match for each register. This implies for example
|
||||
that <code>"([a-c])+"</code> and <code>"[a-c]*([a-c])"</code> have
|
||||
exactly the same semantics but completely different performance
|
||||
characteristics. (Actually, in some cases CL-PPCRE automatically
|
||||
converts expressions from the first type into the second type.
|
||||
That's not always possible, though, and you shouldn't rely on it.)
|
||||
|
||||
<li>By default, repetitions are "greedy" in Perl (and thus in
|
||||
CL-PPCRE). This has an impact on performance and also on the actual
|
||||
outcome of a scan. Look at your repetitions and ponder if a greedy
|
||||
repetition is really what you want.
|
||||
</ul>
|
||||
|
||||
<br> <br><h3><a class=none name="ack">Acknowledgements</a></h3>
|
||||
|
||||
Although I didn't use their code, I was heavily inspired by looking at
|
||||
@ -2067,7 +2208,7 @@ me her PowerBook to test early versions of CL-PPCRE with MCL and
|
||||
OpenMCL.
|
||||
|
||||
<p>
|
||||
$Header: /usr/local/cvsrep/cl-ppcre/doc/index.html,v 1.191 2008/07/23 02:14:09 edi Exp $
|
||||
$Header: /usr/local/cvsrep/cl-ppcre/doc/index.html,v 1.195 2008/07/23 22:24:52 edi Exp $
|
||||
<p><a href="http://weitz.de/index.html">BACK TO MY HOMEPAGE</a>
|
||||
|
||||
</body>
|
||||
|
||||
42
scanner.lisp
42
scanner.lisp
@ -1,5 +1,5 @@
|
||||
;;; -*- Mode: LISP; Syntax: COMMON-LISP; Package: CL-PPCRE; Base: 10 -*-
|
||||
;;; $Header: /usr/local/cvsrep/cl-ppcre/scanner.lisp,v 1.34 2008/07/06 18:12:05 edi Exp $
|
||||
;;; $Header: /usr/local/cvsrep/cl-ppcre/scanner.lisp,v 1.35 2008/07/23 22:25:15 edi Exp $
|
||||
|
||||
;;; Here the scanner for the actual regex as well as utility scanners
|
||||
;;; for the constant start and end strings are created.
|
||||
@ -36,21 +36,21 @@
|
||||
"Auxiliary macro used by CREATE-BMH-MATCHER."
|
||||
(let ((char-compare (if case-insensitive-p 'char-equal 'char=)))
|
||||
`(lambda (start-pos)
|
||||
(declare (fixnum start-pos))
|
||||
(if (or (minusp start-pos)
|
||||
(> (the fixnum (+ start-pos m)) *end-pos*))
|
||||
nil
|
||||
(loop named bmh-matcher
|
||||
for k of-type fixnum = (+ start-pos m -1)
|
||||
then (+ k (max 1 (aref skip (char-code (schar *string* k)))))
|
||||
while (< k *end-pos*)
|
||||
do (loop for j of-type fixnum downfrom (1- m)
|
||||
for i of-type fixnum downfrom k
|
||||
while (and (>= j 0)
|
||||
(,char-compare (schar *string* i)
|
||||
(schar pattern j)))
|
||||
finally (if (minusp j)
|
||||
(return-from bmh-matcher (1+ i)))))))))
|
||||
(declare (fixnum start-pos))
|
||||
(if (or (minusp start-pos)
|
||||
(> (the fixnum (+ start-pos m)) *end-pos*))
|
||||
nil
|
||||
(loop named bmh-matcher
|
||||
for k of-type fixnum = (+ start-pos m -1)
|
||||
then (+ k (max 1 (aref skip (char-code (schar *string* k)))))
|
||||
while (< k *end-pos*)
|
||||
do (loop for j of-type fixnum downfrom (1- m)
|
||||
for i of-type fixnum downfrom k
|
||||
while (and (>= j 0)
|
||||
(,char-compare (schar *string* i)
|
||||
(schar pattern j)))
|
||||
finally (if (minusp j)
|
||||
(return-from bmh-matcher (1+ i)))))))))
|
||||
|
||||
(defun create-bmh-matcher (pattern case-insensitive-p)
|
||||
"Returns a Boyer-Moore-Horspool matcher which searches the (special)
|
||||
@ -76,15 +76,15 @@ instead. \(BMH matchers are faster but need much more space.)"
|
||||
:test test))))))
|
||||
(let* ((m (length pattern))
|
||||
(skip (make-array *regex-char-code-limit*
|
||||
:element-type 'fixnum
|
||||
:initial-element m)))
|
||||
:element-type 'fixnum
|
||||
:initial-element m)))
|
||||
(declare (fixnum m))
|
||||
(loop for k of-type fixnum below m
|
||||
if case-insensitive-p
|
||||
do (setf (aref skip (char-code (char-upcase (schar pattern k)))) (- m k 1)
|
||||
(aref skip (char-code (char-downcase (schar pattern k)))) (- m k 1))
|
||||
do (setf (aref skip (char-code (char-upcase (schar pattern k)))) (- m k 1)
|
||||
(aref skip (char-code (char-downcase (schar pattern k)))) (- m k 1))
|
||||
else
|
||||
do (setf (aref skip (char-code (schar pattern k))) (- m k 1)))
|
||||
do (setf (aref skip (char-code (schar pattern k))) (- m k 1)))
|
||||
(if case-insensitive-p
|
||||
(bmh-matcher-aux :case-insensitive-p t)
|
||||
(bmh-matcher-aux))))
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
;;; -*- Mode: LISP; Syntax: COMMON-LISP; Package: CL-PPCRE; Base: 10 -*-
|
||||
;;; $Header: /usr/local/cvsrep/cl-ppcre/specials.lisp,v 1.40 2008/07/23 02:14:06 edi Exp $
|
||||
;;; $Header: /usr/local/cvsrep/cl-ppcre/specials.lisp,v 1.41 2008/07/23 22:25:15 edi Exp $
|
||||
|
||||
;;; globally declared special variables
|
||||
|
||||
@ -120,7 +120,7 @@ where we saw repetitive patterns.
|
||||
Only used for patterns which might have zero length.")
|
||||
(declaim (simple-vector *last-pos-stores*))
|
||||
|
||||
(defvar *use-bmh-matchers* t
|
||||
(defvar *use-bmh-matchers* nil
|
||||
"Whether the scanners created by CREATE-SCANNER should use the \(fast
|
||||
but large) Boyer-Moore-Horspool matchers.")
|
||||
|
||||
|
||||
Reference in New Issue
Block a user