develooper Front page | perl.perl5.porters | Postings from January 2014

DAVEM TPF Grant#2 December 2013 report

Thread Next
From:
Dave Mitchell
Date:
January 7, 2014 15:56
Subject:
DAVEM TPF Grant#2 December 2013 report
Message ID:
20140107155628.GC27210@iabyn.com
I spent the majority of my time last month working on a function called
Perl_re_intuit_start() in the regex engine.

This function is one of the major optimisations in the regex engine.
For each compiled pattern, it is noted what is the longest fixed and
floating strings that *must* appear in the string for the match to
succeed, along with what char class (if any) the pattern must start with.

Using these heuristics, re_intuit_start() tries to quickly scan the string
(using fast Boyer-Moore) to locate these known substrings. This allows us
to either quickly reject the string as not matching, or to find a minimum
start position in the string, before handing off to the normal NFA
automaton.

Unfortunately it's a bit flaky when it comes to handling utf8-encoded
strings, especially long ones. There are lots of bits of code that do
length calculations on bits of the string, for example along the lines of
"if the maximum start position plus the length of the floating string is
less than the end of the string less minlen chars, then..." which ends
up doing a lot of calculations of utf8 lengths of the substring between
two pointers. Which for long strings becomes grindingly slow.

Also, chunks of the code are marked with reassuring comments like:

	/* XXXX May be hopelessly wrong for UTF... */

The problem with this function being mainly an optimisation, is that as
long as it doesn't falsely reject valid strings or choose to high a
starting point, it won't fail any tests: even if it makes the match
unnecessarily slow. So errors are hard to spot.

The code's also hard to understand. (Even ignoring 'fail:' and variants,
it has 17 separate labels that can be goto'ed).

So I've spent most of last month going through this function, trying to
understand how bits of it work, fixing issues, and generally cleaning up,
refactoring and better documenting the code. I'm nowhere near finished
yet.


SUMMARY:
      5:32 RT#120426 Strange and unwarranted underflow in string-to-number
      0:17 RT#120675 Unexpected tainting via regex using locale
     38:33 RT#120692 Slow global pattern match with input from utf8
      7:01 fix smoke issues
      2:07 look at Compress-Raw-Zlib and -g3 issue
     14:14 process p5p mailbox
    ------
     67:44 TOTAL (HH::MM)

   4.4 weeks
  67.7 total hours
  15.3 average hours per week


As of 2013/12/31: since the beginning of the grant:

  11.3 weeks
 183.3 total hours
  16.2 average hours per week

There are 217 hours left on the grant.


-- 
Monto Blanco... scorchio!

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About