Front page | perl.perl5.porters |
Postings from February 2003
PCRE 4.0
Thread Next
From:
H.Merijn Brand
Date:
February 25, 2003 01:08
Subject:
PCRE 4.0
Message ID:
20030225100545.2B97.H.M.BRAND@hccnet.nl
FYI [ IMHO perl5 should also support point 12. ]
ChangeLog for PCRE (ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/)
------------------
Version 4.00 17-Feb-03
----------------------
1. If a comment in an extended regex that started immediately after a meta-item
extended to the end of string, PCRE compiled incorrect data. This could lead to
all kinds of weird effects. Example: /#/ was bad; /()#/ was bad; /a#/ was not.
2. Moved to autoconf 2.53 and libtool 1.4.2.
3. Perl 5.8 no longer needs "use utf8" for doing UTF-8 things. Consequently,
the special perltest8 script is no longer needed - all the tests can be run
from a single perltest script.
4. From 5.004, Perl has not included the VT character (0x0b) in the set defined
by \s. It has now been removed in PCRE. This means it isn't recognized as
whitespace in /x regexes too, which is the same as Perl. Note that the POSIX
class [:space:] *does* include VT, thereby creating a mess.
5. Added the class [:blank:] (a GNU extension from Perl 5.8) to match only
space and tab.
6. Perl 5.005 was a long time ago. It's time to amalgamate the tests that use
its new features into the main test script, reducing the number of scripts.
7. Perl 5.8 has changed the meaning of patterns like /a(?i)b/. Earlier versions
were backward compatible, and made the (?i) apply to the whole pattern, as if
/i were given. Now it behaves more logically, and applies the option setting
only to what follows. PCRE has been changed to follow suit. However, if it
finds options settings right at the start of the pattern, it extracts them into
the global options, as before. Thus, they show up in the info data.
8. Added support for the \Q...\E escape sequence. Characters in between are
treated as literals. This is slightly different from Perl in that $ and @ are
also handled as literals inside the quotes. In Perl, they will cause variable
interpolation. Note the following examples:
Pattern PCRE matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
For compatibility with Perl, \Q...\E sequences are recognized inside character
classes as well as outside them.
9. Re-organized 3 code statements in pcretest to avoid "overflow in
floating-point constant arithmetic" warnings from a Microsoft compiler. Added a
(size_t) cast to one statement in pcretest and one in pcreposix to avoid
signed/unsigned warnings.
10. SunOS4 doesn't have strtoul(). This was used only for unpicking the -o
option for pcretest, so I've replaced it by a simple function that does just
that job.
11. pcregrep was ending with code 0 instead of 2 for the commands "pcregrep" or
"pcregrep -".
12. Added "possessive quantifiers" ?+, *+, ++, and {,}+ which come from Sun's
Java package. This provides some syntactic sugar for simple cases of what my
documentation calls "once-only subpatterns". A pattern such as x*+ is the same
as (?>x*). In other words, if what is inside (?>...) is just a single repeated
item, you can use this simplified notation. Note that only makes sense with
greedy quantifiers. Consequently, the use of the possessive quantifier forces
greediness, whatever the setting of the PCRE_UNGREEDY option.
13. A change of greediness default within a pattern was not taking effect at
the current level for patterns like /(b+(?U)a+)/. It did apply to parenthesized
subpatterns that followed. Patterns like /b+(?U)a+/ worked because the option
was abstracted outside.
14. PCRE now supports the \G assertion. It is true when the current matching
position is at the start point of the match. This differs from \A when the
starting offset is non-zero. Used with the /g option of pcretest (or similar
code), it works in the same way as it does for Perl's /g option. If all
alternatives of a regex begin with \G, the expression is anchored to the start
match position, and the "anchored" flag is set in the compiled expression.
15. Some bugs concerning the handling of certain option changes within patterns
have been fixed. These applied to options other than (?ims). For example,
"a(?x: b c )d" did not match "XabcdY" but did match "Xa b c dY". It should have
been the other way round. Some of this was related to change 7 above.
16. PCRE now gives errors for /[.x.]/ and /[=x=]/ as unsupported POSIX
features, as Perl does. Previously, PCRE gave the warnings only for /[[.x.]]/
and /[[=x=]]/. PCRE now also gives an error for /[:name:]/ because it supports
POSIX classes only within a class (e.g. /[[:alpha:]]/).
17. Added support for Perl's \C escape. This matches one byte, even in UTF8
mode. Unlike ".", it always matches newline, whatever the setting of
PCRE_DOTALL. However, PCRE does not permit \C to appear in lookbehind
assertions. Perl allows it, but it doesn't (in general) work because it can't
calculate the length of the lookbehind. At least, that's the case for Perl
5.8.0 - I've been told they are going to document that it doesn't work in
future.
18. Added an error diagnosis for escapes that PCRE does not support: these are
\L, \l, \N, \P, \p, \U, \u, and \X.
19. Although correctly diagnosing a missing ']' in a character class, PCRE was
reading past the end of the pattern in cases such as /[abcd/.
20. PCRE was getting more memory than necessary for patterns with classes that
contained both POSIX named classes and other characters, e.g. /[[:space:]abc/.
21. Added some code, conditional on #ifdef VPCOMPAT, to make life easier for
compiling PCRE for use with Virtual Pascal.
22. Small fix to the Makefile to make it work properly if the build is done
outside the source tree.
23. Added a new extension: a condition to go with recursion. If a conditional
subpattern starts with (?(R) the "true" branch is used if recursion has
happened, whereas the "false" branch is used only at the top level.
24. When there was a very long string of literal characters (over 255 bytes
without UTF support, over 250 bytes with UTF support), the computation of how
much memory was required could be incorrect, leading to segfaults or other
strange effects.
25. PCRE was incorrectly assuming anchoring (either to start of subject or to
start of line for a non-DOTALL pattern) when a pattern started with (.*) and
there was a subsequent back reference to those brackets. This meant that, for
example, /(.*)\d+\1/ failed to match "abc123bc". Unfortunately, it isn't
possible to check for precisely this case. All we can do is abandon the
optimization if .* occurs inside capturing brackets when there are any back
references whatsoever. (See below for a better fix that came later.)
26. The handling of the optimization for finding the first character of a
non-anchored pattern, and for finding a character that is required later in the
match were failing in some cases. This didn't break the matching; it just
failed to optimize when it could. The way this is done has been re-implemented.
27. Fixed typo in error message for invalid (?R item (it said "(?p").
28. Added a new feature that provides some of the functionality that Perl
provides with (?{...}). The facility is termed a "callout". The way it is done
in PCRE is for the caller to provide an optional function, by setting
pcre_callout to its entry point. Like pcre_malloc and pcre_free, this is a
global variable. By default it is unset, which disables all calling out. To get
the function called, the regex must include (?C) at appropriate points. This
is, in fact, equivalent to (?C0), and any number <= 255 may be given with (?C).
This provides a means of identifying different callout points. When PCRE
reaches such a point in the regex, if pcre_callout has been set, the external
function is called. It is provided with data in a structure called
pcre_callout_block, which is defined in pcre.h. If the function returns 0,
matching continues; if it returns a non-zero value, the match at the current
point fails. However, backtracking will occur if possible. [This was changed
later and other features added - see item 49 below.]
29. pcretest is upgraded to test the callout functionality. It provides a
callout function that displays information. By default, it shows the start of
the match and the current position in the text. There are some new data escapes
to vary what happens:
\C+ in addition, show current contents of captured substrings
\C- do not supply a callout function
\C!n return 1 when callout number n is reached
\C!n!m return 1 when callout number n is reached for the mth time
30. If pcregrep was called with the -l option and just a single file name, it
output "<stdin>" if a match was found, instead of the file name.
31. Improve the efficiency of the POSIX API to PCRE. If the number of capturing
slots is less than POSIX_MALLOC_THRESHOLD, use a block on the stack to pass to
pcre_exec(). This saves a malloc/free per call. The default value of
POSIX_MALLOC_THRESHOLD is 10; it can be changed by --with-posix-malloc-threshold
when configuring.
32. The default maximum size of a compiled pattern is 64K. There have been a
few cases of people hitting this limit. The code now uses macros to handle the
storing of links as offsets within the compiled pattern. It defaults to 2-byte
links, but this can be changed to 3 or 4 bytes by --with-link-size when
configuring. Tests 2 and 5 work only with 2-byte links because they output
debugging information about compiled patterns.
33. Internal code re-arrangements:
(a) Moved the debugging function for printing out a compiled regex into
its own source file (printint.c) and used #include to pull it into
pcretest.c and, when DEBUG is defined, into pcre.c, instead of having two
separate copies.
(b) Defined the list of op-code names for debugging as a macro in
internal.h so that it is next to the definition of the opcodes.
(c) Defined a table of op-code lengths for simpler skipping along compiled
code. This is again a macro in internal.h so that it is next to the
definition of the opcodes.
34. Added support for recursive calls to individual subpatterns, along the
lines of Robin Houston's patch (but implemented somewhat differently).
35. Further mods to the Makefile to help Win32. Also, added code to pcregrep to
allow it to read and process whole directories in Win32. This code was
contributed by Lionel Fourquaux; it has not been tested by me.
36. Added support for named subpatterns. The Python syntax (?P<name>...) is
used to name a group. Names consist of alphanumerics and underscores, and must
be unique. Back references use the syntax (?P=name) and recursive calls use
(?P>name) which is a PCRE extension to the Python extension. Groups still have
numbers. The function pcre_fullinfo() can be used after compilation to extract
a name/number map. There are three relevant calls:
PCRE_INFO_NAMEENTRYSIZE yields the size of each entry in the map
PCRE_INFO_NAMECOUNT yields the number of entries
PCRE_INFO_NAMETABLE yields a pointer to the map.
The map is a vector of fixed-size entries. The size of each entry depends on
the length of the longest name used. The first two bytes of each entry are the
group number, most significant byte first. There follows the corresponding
name, zero terminated. The names are in alphabetical order.
37. Make the maximum literal string in the compiled code 250 for the non-UTF-8
case instead of 255. Making it the same both with and without UTF-8 support
means that the same test output works with both.
38. There was a case of malloc(0) in the POSIX testing code in pcretest. Avoid
calling malloc() with a zero argument.
39. Change 25 above had to resort to a heavy-handed test for the .* anchoring
optimization. I've improved things by keeping a bitmap of backreferences with
numbers 1-31 so that if .* occurs inside capturing brackets that are not in
fact referenced, the optimization can be applied. It is unlikely that a
relevant occurrence of .* (i.e. one which might indicate anchoring or forcing
the match to follow \n) will appear inside brackets with a number greater than
31, but if it does, any back reference > 31 suppresses the optimization.
40. Added a new compile-time option PCRE_NO_AUTO_CAPTURE. This has the effect
of disabling numbered capturing parentheses. Any opening parenthesis that is
not followed by ? behaves as if it were followed by ?: but named parentheses
can still be used for capturing (and they will acquire numbers in the usual
way).
41. Redesigned the return codes from the match() function into yes/no/error so
that errors can be passed back from deep inside the nested calls. A malloc
failure while inside a recursive subpattern call now causes the
PCRE_ERROR_NOMEMORY return instead of quietly going wrong.
42. It is now possible to set a limit on the number of times the match()
function is called in a call to pcre_exec(). This facility makes it possible to
limit the amount of recursion and backtracking, though not in a directly
obvious way, because the match() function is used in a number of different
circumstances. The count starts from zero for each position in the subject
string (for non-anchored patterns). The default limit is, for compatibility, a
large number, namely 10 000 000. You can change this in two ways:
(a) When configuring PCRE before making, you can use --with-match-limit=n
to set a default value for the compiled library.
(b) For each call to pcre_exec(), you can pass a pcre_extra block in which
a different value is set. See 45 below.
If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
43. Added a new function pcre_config(int, void *) to enable run-time extraction
of things that can be changed at compile time. The first argument specifies
what is wanted and the second points to where the information is to be placed.
The current list of available information is:
PCRE_CONFIG_UTF8
The output is an integer that is set to one if UTF-8 support is available;
otherwise it is set to zero.
PCRE_CONFIG_NEWLINE
The output is an integer that it set to the value of the code that is used for
newline. It is either LF (10) or CR (13).
PCRE_CONFIG_LINK_SIZE
The output is an integer that contains the number of bytes used for internal
linkage in compiled expressions. The value is 2, 3, or 4. See item 32 above.
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
The output is an integer that contains the threshold above which the POSIX
interface uses malloc() for output vectors. See item 31 above.
PCRE_CONFIG_MATCH_LIMIT
The output is an unsigned integer that contains the default limit of the number
of match() calls in a pcre_exec() execution. See 42 above.
44. pcretest has been upgraded by the addition of the -C option. This causes it
to extract all the available output from the new pcre_config() function, and to
output it. The program then exits immediately.
45. A need has arisen to pass over additional data with calls to pcre_exec() in
order to support additional features. One way would have been to define
pcre_exec2() (for example) with extra arguments, but this would not have been
extensible, and would also have required all calls to the original function to
be mapped to the new one. Instead, I have chosen to extend the mechanism that
is used for passing in "extra" data from pcre_study().
The pcre_extra structure is now exposed and defined in pcre.h. It currently
contains the following fields:
flags a bitmap indicating which of the following fields are set
study_data opaque data from pcre_study()
match_limit a way of specifying a limit on match() calls for a specific
call to pcre_exec()
callout_data data for callouts (see 49 below)
The flag bits are also defined in pcre.h, and are
PCRE_EXTRA_STUDY_DATA
PCRE_EXTRA_MATCH_LIMIT
PCRE_EXTRA_CALLOUT_DATA
The pcre_study() function now returns one of these new pcre_extra blocks, with
the actual study data pointed to by the study_data field, and the
PCRE_EXTRA_STUDY_DATA flag set. This can be passed directly to pcre_exec() as
before. That is, this change is entirely upwards-compatible and requires no
change to existing code.
If you want to pass in additional data to pcre_exec(), you can either place it
in a pcre_extra block provided by pcre_study(), or create your own pcre_extra
block.
46. pcretest has been extended to test the PCRE_EXTRA_MATCH_LIMIT feature. If a
data string contains the escape sequence \M, pcretest calls pcre_exec() several
times with different match limits, until it finds the minimum value needed for
pcre_exec() to complete. The value is then output. This can be instructive; for
most simple matches the number is quite small, but for pathological cases it
gets very large very quickly.
47. There's a new option for pcre_fullinfo() called PCRE_INFO_STUDYSIZE. It
returns the size of the data block pointed to by the study_data field in a
pcre_extra block, that is, the value that was passed as the argument to
pcre_malloc() when PCRE was getting memory in which to place the information
created by pcre_study(). The fourth argument should point to a size_t variable.
pcretest has been extended so that this information is shown after a successful
pcre_study() call when information about the compiled regex is being displayed.
48. Cosmetic change to Makefile: there's no need to have / after $(DESTDIR)
because what follows is always an absolute path. (Later: it turns out that this
is more than cosmetic for MinGW, because it doesn't like empty path
components.)
49. Some changes have been made to the callout feature (see 28 above):
(i) A callout function now has three choices for what it returns:
0 => success, carry on matching
> 0 => failure at this point, but backtrack if possible
< 0 => serious error, return this value from pcre_exec()
Negative values should normally be chosen from the set of PCRE_ERROR_xxx
values. In particular, returning PCRE_ERROR_NOMATCH forces a standard
"match failed" error. The error number PCRE_ERROR_CALLOUT is reserved for
use by callout functions. It will never be used by PCRE itself.
(ii) The pcre_extra structure (see 45 above) has a void * field called
callout_data, with corresponding flag bit PCRE_EXTRA_CALLOUT_DATA. The
pcre_callout_block structure has a field of the same name. The contents of
the field passed in the pcre_extra structure are passed to the callout
function in the corresponding field in the callout block. This makes it
easier to use the same callout-containing regex from multiple threads. For
testing, the pcretest program has a new data escape
\C*n pass the number n (may be negative) as callout_data
If the callout function in pcretest receives a non-zero value as
callout_data, it returns that value.
50. Makefile wasn't handling CFLAGS properly when compiling dftables. Also,
there were some redundant $(CFLAGS) in commands that are now specified as
$(LINK), which already includes $(CFLAGS).
51. Extensions to UTF-8 support are listed below. These all apply when (a) PCRE
has been compiled with UTF-8 support *and* pcre_compile() has been compiled
with the PCRE_UTF8 flag. Patterns that are compiled without that flag assume
one-byte characters throughout. Note that case-insensitive matching applies
only to characters whose values are less than 256. PCRE doesn't support the
notion of cases for higher-valued characters.
(i) A character class whose characters are all within 0-255 is handled as
a bit map, and the map is inverted for negative classes. Previously, a
character > 255 always failed to match such a class; however it should
match if the class was a negative one (e.g. [^ab]). This has been fixed.
(ii) A negated character class with a single character < 255 is coded as
"not this character" (OP_NOT). This wasn't working properly when the test
character was multibyte, either singly or repeated.
(iii) Repeats of multibyte characters are now handled correctly in UTF-8
mode, for example: \x{100}{2,3}.
(iv) The character escapes \b, \B, \d, \D, \s, \S, \w, and \W (either
singly or repeated) now correctly test multibyte characters. However,
PCRE doesn't recognize any characters with values greater than 255 as
digits, spaces, or word characters. Such characters always match \D, \S,
and \W, and never match \d, \s, or \w.
(v) Classes may now contain characters and character ranges with values
greater than 255. For example: [ab\x{100}-\x{400}].
(vi) pcregrep now has a --utf-8 option (synonym -u) which makes it call
PCRE in UTF-8 mode.
52. The info request value PCRE_INFO_FIRSTCHAR has been renamed
PCRE_INFO_FIRSTBYTE because it is a byte value. However, the old name is
retained for backwards compatibility. (Note that LASTLITERAL is also a byte
value.)
53. The single man page has become too large. I have therefore split it up into
a number of separate man pages. These also give rise to individual HTML pages;
these are now put in a separate directory, and there is an index.html page that
lists them all. Some hyperlinking between the pages has been installed.
54. Added convenience functions for handling named capturing parentheses.
55. Unknown escapes inside character classes (e.g. [\M]) and escapes that
aren't interpreted therein (e.g. [\C]) are literals in Perl. This is now also
true in PCRE, except when the PCRE_EXTENDED option is set, in which case they
are faulted.
56. Introduced HOST_CC and HOST_CFLAGS which can be set in the environment when
calling configure. These values are used when compiling the dftables.c program
which is run to generate the source of the default character tables. They
default to the values of CC and CFLAGS. If you are cross-compiling PCRE,
you will need to set these values.
57. Updated the building process for Windows DLL, as provided by Fred Cox.
--
H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/)
using perl-5.6.1, 5.8.0 & 633 on HP-UX 10.20 & 11.00, AIX 4.2, AIX 4.3,
WinNT 4, Win2K pro & WinCE 2.11. Smoking perl CORE: smokers@perl.org
http://archives.develooper.com/daily-build@perl.org/ perl-qa@perl.org
send smoke reports to: smokers-reports@perl.org, QA: http://qa.perl.org
Thread Next
-
PCRE 4.0
by H.Merijn Brand