develooper Front page | perl.perl5.porters | Postings from September 2014

[perl #120041] regcomp.c missing parens and broken STCLASS

Karl Williamson via RT
September 29, 2014 20:19
[perl #120041] regcomp.c missing parens and broken STCLASS
Message ID:
On Tue Oct 01 02:05:19 2013, hv wrote:
> I'm still confused about the intent of this twice-repeated mantra:
>     if (! (ANYOF_FLAGS(data.start_class) & ANYOF_EMPTY_STRING)
>         && ! ssc_is_anything(data.start_class))
> Given that when the first clause is true, ssc_is_anything() immediately
> returns FALSE, isn't this in both cases the same as:
>     if (! (ANYOF_FLAGS(data.start_class) & ANYOF_EMPTY_STRING))
> ?
> I think there'd be value in adding some brief comments about the intent
> around these checks.

This has now been changed by

commit b35552de5cea8eb47ccb046284ecb9a099430255
 Author: Karl Williamson <>
 Date:   Mon Sep 22 13:59:39 2014 -0600
     Tighten uses of regex synthetic start class
     A synthetic start class (SSC) is generated by the regular expression
     pattern compiler to give a consolidation of all the possible things that
     can match at the beginning of where a pattern can possibly match.
     For example
     requires the match to begin with either an 'a' or a 'b'.  There are no
     other possibilities.  We can set things up to quickly scan for either of
     these in the target string, and only when one of these is found do we
     need to look for 'foo'.
     There is an overhead associated with using SSCs.  If the number of
     possibilities that the SSC excludes is relatively small, it can be
     counter-productive to use them.
     This patch creates a crude sieve to decide whether to use an SSC or not.
     If the SSC doesn't exclude at least half the "likely" possiblities, it
     is discarded.  This patch is a starting point, and can be refined if
     necessary as we gain experience.
     See thread beginning with
     In many patterns, no SSC is generated; and with the advent of tries,
     SSC's have become less important, so whatever we do is not terribly
 The code now reads
	if ((!(r->anchored_substr || r->anchored_utf8) || r->anchored_offset)
	    && stclass_flag
            && ! (ANYOF_FLAGS(data.start_class) & SSC_MATCHES_EMPTY_STRING)
	    && is_ssc_worth_it(pRExC_state, data.start_class))
> I note also that the new test ends up applying a rather pessimal
> optimization:
> % ./perl -Ilib -Mre=debug -we '"" =~ /^A*\z/ or die;'
> Compiling REx "^A*\z"
> Final program:
>    1: BOL (2)
>    2: STAR (5)
>    3:   EXACT <A> (0)
>    5: EOS (6)
>    6: END (0)
> floating ""$ at 0..2147483647 (checking floating) anchored(BOL) minlen 0 
> Matching REx "^A*\z" against ""
> Found floating substr ""$ at offset 0...
> Guessed: match at offset 0
> [...]
> Hugo

Later in the thread we concluded that this optimisation was unchanged from before, and needs a different ticket.  I don't know if that ever got filed.
Karl Williamson

via perlbug:  queue: perl5 status: resolved Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About