develooper Front page | perl.perl5.porters | Postings from August 2012

[perl #114356] REGEXPs have massive reference counts

Thread Next
From:
Nicholas Clark
Date:
August 1, 2012 02:49
Subject:
[perl #114356] REGEXPs have massive reference counts
Message ID:
rt-3.6.HEAD-11172-1343814578-509.114356-75-0@perl.org
# New Ticket Created by  Nicholas Clark 
# Please include the string:  [perl #114356]
# in the subject line of all future correspondence about this issue. 
# <URL: https://rt.perl.org:443/rt3/Ticket/Display.html?id=114356 >


At some point since perl 5.16.0, if one builds blead with -DDEBUGGING
and sets PERL_DESTRUCT_LEVEL=2 in the environment, mktables has started
taking "forever"* to run. Moreover, it seems that it "hangs" in global
destruction.

I've added a --timeout option to the bisect runner to make it easy to work
out when. It finds that it's this commit, merged as part of Dave's fix
of code blocks:

commit 9f141731d83a1ac6294a5580a5b11ff41490309a
Author: David Mitchell <davem@iabyn.com>
Date:   Fri Nov 4 10:12:20 2011 +0000

    Move bulk of pp_regcomp() into re_op_compile()
    
    When called, pp_regcomp() is presented with a list of SVs on the stack.
    Previously, it would perform (amongst other things):
      * overloading those SVs;
      * concatenating them;
      * detection of bare /$qr/;
      * detection of unchanged pattern;
    optionally followed by a call to the built-in or an external regexp
    compiler.
    
    Since we want to avoid premature concatenation (so that we can handle
    /$runtime(?{...})/), move all these activities from pp_regcomp() into
    re_op_compile().
    
    This makes re_op_compile() a bit cumbersome, with a large arg list,
    but I haven't found any way of only moving only a subset of the above.
    
    Note that a side-effect of this is that qr-overloading now works for all
    regex compilations, not just those reached via pp_regcomp(); in particular
    this now invokes the qr method rather than the "" method if available:
    /(??{ $overloaded_object })/


which seems crazy, but I checked, and it's true. It seems that after this
commit some SVs of type SVt_REGEXP have massively inflated reference counts,
and this results in Perl_sv_clean_all() being called tens of thousands of
times. Running mktables under gdb in 9f141731d83a1ac6^ I see this:

Creating Perl synonyms
Writing tables
Making pod file
Making test script
Updating 'mktables.lst'

Breakpoint 4, Perl_sv_clean_all () at sv.c:628
628         PL_in_clean_all = TRUE;
(gdb) finish
Run till exit from #0  Perl_sv_clean_all () at sv.c:628
0x0000000000407d78 in perl_destruct (my_perl=0x992010) at perl.c:1072
1072        while (sv_clean_all() > 2)
Value returned is $8 = 59215
(gdb) c
Continuing.

Breakpoint 4, Perl_sv_clean_all () at sv.c:628
628         PL_in_clean_all = TRUE;
(gdb) call S_visit(&Perl_sv_dump, SVt_REGEXP, 255)
$9 = 0
(gdb) 


Running it at 9f141731d83a1ac6 I get 472 lines of output (attached), which
start like this:

Creating Perl synonyms
Writing tables
Making pod file
Making test script
Updating 'mktables.lst'

Breakpoint 4, Perl_sv_clean_all () at sv.c:628
628         PL_in_clean_all = TRUE;
(gdb) finish
Run till exit from #0  Perl_sv_clean_all () at sv.c:628
0x0000000000407d78 in perl_destruct (my_perl=0x992010) at perl.c:1072
1072        while (sv_clean_all() > 2)
Value returned is $24 = 59353
(gdb) c
Continuing.

Breakpoint 4, Perl_sv_clean_all () at sv.c:628
628         PL_in_clean_all = TRUE;
(gdb) call S_visit(&Perl_sv_dump, SVt_REGEXP, 255)
SV = REGEXP(0x5678df0) at 0x57f3428
  REFCNT = 20
  FLAGS = (POK,FAKE,BREAK,pPOK)
  PV = 0x55f0580 "(?^aax:^ ( .{27}   # Don't look before the\n                                              #  indent.\n                        \\ *                   # Don't look in leading\n                                              #  blanks past the indent\n                            [^ ] .*           # Find the right-most\n                        (?:                   #  acceptable break:\n                            [ \\s = ]          # space or equal\n                            | - (?! [.0-9] )  # or non-unary minus.\n                        )                     # $1 includes the character\n                    ))"\0
  CUR = 604
  LEN = 608
  EXTFLAGS = 0x2000288 (PMf_EXTENDED,ANCH_BOL,COPY_DONE)
  INTFLAGS = 0x4
  NPARENS = 1
  LASTPAREN = 1
  LASTCLOSEPAREN = 1
  MINLEN = 29
  MINLENRET = 29
  GOFS = 0
  PRE_PREFIX = 7
  SEEN_EVALS = 0
  SUBLEN = 68
  SUBBEG = 0x56277a0 "  XPerlSpace              (Perl extension).  \\s, including beyond AS"
  ENGINE = 0x6f6a00
  MOTHER_RE = 0x0
  PAREN_NAMES = 0x0
  SUBSTRS = 0x5634930
  PPRIVATE = 0x563e4c0
  OFFS = 0x55db2f0
  QR_ANONCV = 0x0


Note that that regular expression seems to correspond to this code:

                # Otherwise fold at an acceptable break char closest to
                # the max length.  Look at just the maximal initial
                # segment of the line
                my $segment = substr($line[$i], 0, $max - 1);
                if ($segment =~
                    /^ ( .{$hanging_indent}   # Don't look before the
                                              #  indent.
                        \ *                   # Don't look in leading
                                              #  blanks past the indent
                            [^ ] .*           # Find the right-most
                        (?:                   #  acceptable break:
                            [ \s = ]          # space or equal
                            | - (?! [.0-9] )  # or non-unary minus.
                        )                     # $1 includes the character
                    )/x)

which is reached 8011 times, and matches 8001 times.

However, two very similar patterns seems to be present later, differing only
in .{27} being .{29} and .{0} and having different SUBLEN, SUBBEG, SUBSTRS,
PPRIVATE and OFFS, with reference counts of 1186 and 6799.
20 + 1186 + 6799 is 8005. Is that suspicious?

I assume that a reference count is long longer being dropped when it should
be, but it's not obvious to me how the logic works, and hence whether
anything I might suggest adds more bugs than it solves.

Nicholas Clark

* seems actually to be a factor of 12 longer
Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About