develooper Front page | perl.perl5.porters | Postings from July 2016

StackOverflow pathological regex

Thread Next
From:
George Greer
Date:
July 21, 2016 18:58
Subject:
StackOverflow pathological regex
Message ID:
alpine.LFD.2.20.1607211452280.24508@drei.m-l.org
Regarding the StackOverflow outage:

http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016

"The direct cause was a malformed post that caused one of our regular 
expressions to consume high CPU on our web servers. [...] The regular 
expression was: ^[\s\u200c]+|[\s\u200c]+$ Which is intended to trim 
unicode space from start and end of a line. A simplified version of the 
Regex that exposes the same issue would be \s+$ which to a human looks 
easy (“all the spaces at the end of the string”), but which means quite 
some work for a simple backtracking Regex engine."

It doesn't look like Perl's engine is quite as pathological, but it still 
iterates over every single space in the string forwards instead of 
starting at the end of the string and asking if it could ever possibly 
match:

./miniperl -Dr -e '$a=("foo \x{2022} bar ")x4_000 . (" ") x 20_000; $a =~ s/^[\s\x{200c}]+//; $a =~ s/[\s\x{200c}]+$//;'

- - - 8< - - - 8< - - -
   36 < bar > <foo >          |   1|  13:SEOL(14)
                              |   1|  failed...
                              |   0| failed...
   39 <r foo> < %x{2022} >    |   0| 1:PLUS(13)
                              |   0| ANYOF[\t\n\x0B\f\r \x85\xA0][1680 2000-200A 200C 2028-2029 202F 205F 3000] can match 1 times out of 2147483647...
   40 < foo > <%x{2022} b>    |   1|  13:SEOL(14)
                              |   1|  failed...
                              |   0| failed...
   43 <o %x{2022}> < bar foo >|   0| 1:PLUS(13)
                              |   0| ANYOF[\t\n\x0B\f\r \x85\xA0][1680 2000-200A 200C 2028-2029 202F 205F 3000] can match 1 times out of 2147483647...
   44 < %x{2022} > <bar foo > |   1|  13:SEOL(14)
                              |   1|  failed...
                              |   0| failed...
- - - 8< - - - 8< - - -

So the question is, shouldn't it for an anchored match?

(In this case an xeger isn't faster because the reversing of the 
really long string far outweighs the gain of a forward scan.)


-- 
George Greer
Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About