diff --git a/CHANGELOG b/CHANGELOG index 1091b19..788f563 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,12 +1,13 @@ Version 2.0.0 -2008-07-23 +2008-07-24 Added named properties (\p{foo}) Added Unicode support Introduced test functions for character classes Added optional test function optimization Cleaned up test suite, removed performance cruft Removed the various alternative system definitions (too much maintenance work) -Exported PARSE-STRING +Exported PARSE-STRING +Changed default value of *USE-BMH-MATCHERS* General cleanup Lots of documentation additions diff --git a/doc/index.html b/doc/index.html index 3742814..52f450f 100644 --- a/doc/index.html +++ b/doc/index.html @@ -143,6 +143,7 @@ href="http://weitz.de/regex-coach/">The Regex Coach.
  • Backslashes may confuse you...
  • AllegroCL compatibility mode +
  • Hints, comments, performance considerations
  • Acknowledgements @@ -223,8 +224,7 @@ they obviously don't make sense in Lisp.
  • \N{name} (named characters), \x{263a} (wide hex characters), \l, \u, \L, and \U -because they're actually not part of Perl's regex syntax and -(honestly) because I was too lazy - but see CL-INTERPOL. +because they're actually not part of Perl's regex syntax - but see CL-INTERPOL.
  • \X (extended Unicode), and \C (single character). But you can of course use all characters @@ -1192,20 +1192,19 @@ play dirty tricks with implementation-dependent behaviour, though.


    [Special variable]
    *use-bmh-matchers* -


    Usually, the scanners created by CREATE-SCANNER (or -implicitly by other functions and macros) will use fast Boyer-Moore-Horspool -matchers to check for constant strings at the start or end of the -regular expression. If *USE-BMH-MATCHERS* is -NIL (the default is T), the standard -function SEARCH -will be used instead. This will usually be a bit slower but can save -lots of space if you're storing many scanners. The test suite will automatically set -*USE-BMH-MATCHERS* to NIL while you're running -the default test. +

    Usually, the scanners created +by CREATE-SCANNER (or +implicitly by other functions and macros) will use the standard +function SEARCH +to check for constant strings at the start or end of the regular +expression. If *USE-BMH-MATCHERS* is true (the default +is NIL), +fast Boyer-Moore-Horspool +matchers will be used instead. This will usually be faster but +can make the scanners considerably bigger. Per BMH matcher - there +can be up to two per scanner - a fixnum array of +size *REGEX-CHAR-CODE-LIMIT* +is allocated and closed over.

    Note: Due to the nature of LOAD-TIME-VALUE and the compiler macro for SCAN and other functions, some @@ -1638,7 +1637,7 @@ convert a parse tree).
     

    Unicode properties

    You can add support for Unicode properties to CL-PPCRE by loading -the CL-PPCRE-UNICODE system: +the CL-PPCRE-UNICODE system (which depends on CL-UNICODE):
     (asdf:oos 'asdf:load-op :cl-ppcre-unicode)
     
    @@ -2039,6 +2038,148 @@ To use the AllegroCL compatibility mode you have to before you compile CL-PPCRE. +
     

    Hints, comments, performance considerations

    + +Here are, in no particular order, a couple of things about CL-PPCRE +and regular expressions in general that you might or might not want to +read. + +
      +
    • A lot of hackers (especially users of Perl and other scripting + languages) think that regular expressions are the greatest thing + since slice bread and use it for almost everything. That is just + plain wrong. Other hackers (especially Lispers) tend to think that + regular expressions are the work of the devil and try to avoid them + at all cost. That's also wrong. Regular expressions are a handy + and useful addition to your toolkit which you should use when + appropriate - you should just try to figure out first if + they're appropriate for the task at hand. + +
    • If you're concerned about the string syntax of regular + expressions which can look like line noise and is really hard to + read for long expressions, consider using + CL-PPCRE's S-expression syntax + instead. It is less error-prone and you don't have to worry about + escaping characters. It is also easier to manipulate + programmatically. + +
    • For alternations, order is important. The general rule is that + the regex engine tries from left to right and tries to match as much + as possible. +
      +CL-USER 1 > (scan-to-strings "<=|<" "<=")
      +"<="
      +#()
      +
      +CL-USER 2 > (scan-to-strings "<|<=" "<=")
      +"<"
      +#()
      +
      + +
    • CL-PPCRE + uses compiler + macros to pre-compile scanners + at load + time if possible. This happens if the compiler can determine + that the regular expression (no matter if it's a string or an + S-expression) + is constant + at compile + time and is intended to save the time for creating scanners + at execution + time (probably creating the same scanner over and over in a + loop). Make sure you don't prevent the compiler from helping you. + For example, a definition like this one is usually not a good idea: +
      +(defun regex-match (regex target)
      +  ;; don't do that!
      +  (scan regex target))
      +
      + +
    • If you want to search for a substring in a large string or if + you search for the same string very + often, SCAN will usually be faster + than Common + Lisp's SEARCH + if you use BMH matchers. However, + this only makes sense if scanner creation time is not the + limiting factor, i.e. if the search target is very large or + if you're using the same scanner very often. + +
    • Complementary to the last hint, don't use regular + expressions for one-time searches for constant strings. That's a + terrible waste of resources. + +
    • *USE-BMH-MATCHERS* together with a large value for + *REGEX-CHAR-CODE-LIMIT* + can lead to huge scanners. + +
    • A character class is by default translated into a sequence of + tests exactly as you might expect. For + example, "[af-l\\d]" means to test if the character is + equal to #\a, then to test if it's + between #\f and #\l, then if it's a digit. + There's by default no attempt to remove redundancy (as + in "[a-ge-kf]") or to otherwise optimize these tests + for speed. However, you can play + with *OPTIMIZE-CHAR-CLASSES* + if you've identified character classes as bottleneck and want to + make sure that you have O(1) test functions. + +
    • If you know that the expression you're looking for is anchored, + use anchors in your regex. This can help the engine a lot to make + your scanners more efficient. + +
    • In addition to anchors, constant strings at the start or end of a + regular expression can help the engine to quickly scan a strang. + Note that for example "(a-d|aebf)" + and "ab(cd|ef)" are equivalent, but only the second + form has a constant start the regex engine can recognize. + +
    • Try to avoid alternations if possible or at least factor them + out as in the example above. + +
    • If neither anchors nor constant strings are in sight, maybe + "standalone" (sometimes also called "possessive") regular + expressions can be helpful. Try the following: +
      +(let ((target (make-string 10000 :initial-element #\a))
      +      (scanner-1 (create-scanner "a*\\d"))
      +      (scanner-2 (create-scanner "(?>a*)\\d")))
      +  (time (scan scanner-1 target))
      +  (time (scan scanner-2 target)))
      +
      + +
    • Consider using "single-line mode" + if it makes sense for your task. By default (following Perl's + practice), a dot means to search for any character except + line breaks. In single-line mode a dot searches for any + character which in some cases means that large parts of the target + can actually be skipped. This can be vastly more efficient for + large targets. + +
    • Don't use capturing register groups where a non-capturing group + would do, i.e. only use registers if you need to refer to + them later. If you use a register, each scan process needs to + allocate space for it and update its contents (possibly many times) + until it's finished. (In Perl parlance - use "(?:foo)" instead of + "(foo)" whenever possible.) + +
    • In addition to what has been said in the last hint, note that + Perl semantics force the regex engine to report the last + match for each register. This implies for example + that "([a-c])+" and "[a-c]*([a-c])" have + exactly the same semantics but completely different performance + characteristics. (Actually, in some cases CL-PPCRE automatically + converts expressions from the first type into the second type. + That's not always possible, though, and you shouldn't rely on it.) + +
    • By default, repetitions are "greedy" in Perl (and thus in + CL-PPCRE). This has an impact on performance and also on the actual + outcome of a scan. Look at your repetitions and ponder if a greedy + repetition is really what you want. +
    +
     

    Acknowledgements

    Although I didn't use their code, I was heavily inspired by looking at @@ -2067,7 +2208,7 @@ me her PowerBook to test early versions of CL-PPCRE with MCL and OpenMCL.

    -$Header: /usr/local/cvsrep/cl-ppcre/doc/index.html,v 1.191 2008/07/23 02:14:09 edi Exp $ +$Header: /usr/local/cvsrep/cl-ppcre/doc/index.html,v 1.195 2008/07/23 22:24:52 edi Exp $

    BACK TO MY HOMEPAGE diff --git a/scanner.lisp b/scanner.lisp index d899bf1..342e03d 100644 --- a/scanner.lisp +++ b/scanner.lisp @@ -1,5 +1,5 @@ ;;; -*- Mode: LISP; Syntax: COMMON-LISP; Package: CL-PPCRE; Base: 10 -*- -;;; $Header: /usr/local/cvsrep/cl-ppcre/scanner.lisp,v 1.34 2008/07/06 18:12:05 edi Exp $ +;;; $Header: /usr/local/cvsrep/cl-ppcre/scanner.lisp,v 1.35 2008/07/23 22:25:15 edi Exp $ ;;; Here the scanner for the actual regex as well as utility scanners ;;; for the constant start and end strings are created. @@ -36,21 +36,21 @@ "Auxiliary macro used by CREATE-BMH-MATCHER." (let ((char-compare (if case-insensitive-p 'char-equal 'char=))) `(lambda (start-pos) - (declare (fixnum start-pos)) - (if (or (minusp start-pos) - (> (the fixnum (+ start-pos m)) *end-pos*)) - nil - (loop named bmh-matcher - for k of-type fixnum = (+ start-pos m -1) - then (+ k (max 1 (aref skip (char-code (schar *string* k))))) - while (< k *end-pos*) - do (loop for j of-type fixnum downfrom (1- m) - for i of-type fixnum downfrom k - while (and (>= j 0) - (,char-compare (schar *string* i) - (schar pattern j))) - finally (if (minusp j) - (return-from bmh-matcher (1+ i))))))))) + (declare (fixnum start-pos)) + (if (or (minusp start-pos) + (> (the fixnum (+ start-pos m)) *end-pos*)) + nil + (loop named bmh-matcher + for k of-type fixnum = (+ start-pos m -1) + then (+ k (max 1 (aref skip (char-code (schar *string* k))))) + while (< k *end-pos*) + do (loop for j of-type fixnum downfrom (1- m) + for i of-type fixnum downfrom k + while (and (>= j 0) + (,char-compare (schar *string* i) + (schar pattern j))) + finally (if (minusp j) + (return-from bmh-matcher (1+ i))))))))) (defun create-bmh-matcher (pattern case-insensitive-p) "Returns a Boyer-Moore-Horspool matcher which searches the (special) @@ -76,15 +76,15 @@ instead. \(BMH matchers are faster but need much more space.)" :test test)))))) (let* ((m (length pattern)) (skip (make-array *regex-char-code-limit* - :element-type 'fixnum - :initial-element m))) + :element-type 'fixnum + :initial-element m))) (declare (fixnum m)) (loop for k of-type fixnum below m if case-insensitive-p - do (setf (aref skip (char-code (char-upcase (schar pattern k)))) (- m k 1) - (aref skip (char-code (char-downcase (schar pattern k)))) (- m k 1)) + do (setf (aref skip (char-code (char-upcase (schar pattern k)))) (- m k 1) + (aref skip (char-code (char-downcase (schar pattern k)))) (- m k 1)) else - do (setf (aref skip (char-code (schar pattern k))) (- m k 1))) + do (setf (aref skip (char-code (schar pattern k))) (- m k 1))) (if case-insensitive-p (bmh-matcher-aux :case-insensitive-p t) (bmh-matcher-aux)))) diff --git a/specials.lisp b/specials.lisp index 6d8604c..f06cb8a 100644 --- a/specials.lisp +++ b/specials.lisp @@ -1,5 +1,5 @@ ;;; -*- Mode: LISP; Syntax: COMMON-LISP; Package: CL-PPCRE; Base: 10 -*- -;;; $Header: /usr/local/cvsrep/cl-ppcre/specials.lisp,v 1.40 2008/07/23 02:14:06 edi Exp $ +;;; $Header: /usr/local/cvsrep/cl-ppcre/specials.lisp,v 1.41 2008/07/23 22:25:15 edi Exp $ ;;; globally declared special variables @@ -120,7 +120,7 @@ where we saw repetitive patterns. Only used for patterns which might have zero length.") (declaim (simple-vector *last-pos-stores*)) -(defvar *use-bmh-matchers* t +(defvar *use-bmh-matchers* nil "Whether the scanners created by CREATE-SCANNER should use the \(fast but large) Boyer-Moore-Horspool matchers.")