Hi Bram! First of all thank you for your comprehensive and very didactic answer. Sure, we'll consider your advise about using the /x modifier when posting Regexp for the sake of readability and sanity. We faced that problem when maintaining an old and very poorly programmed production script that was "freezing" using 100% of CPU without any explainable reason. After some overall debugging, when we watched the Perl's Regexp Debugger we realized that it would by some kind of "indefinite loop" at first sight due to the very long recursion that seemed to never end (even after a bunch of hours). The input data used here was very long strings taken from production environment, which possibly leaded to thousands of thousands of loops that seemed to be infinite. This regexp has a lot of very general constructions that I suppose were generated by a 'try+error' approach or by supplying a string list to some kind of automatic regexp generator (like the one present in visual-regexp) as a mean to address the problem quickly and "save" some time instead of thinking systematically about the patterns that needed to me matched (and those that shouldn't as you explained). I've showed your e-mail to some teammates here and they appreciated very much your clear explanation and the regexp debugging using Perl code inside it with the /x modifier to easily discover how many different combinations were tried against the regexp. Nice work, thank you very much! Best regards, Douglas On Tue, Apr 29, 2008 at 8:18 PM, Bram via RT <perlbug-followup@perl.org> wrote: > > We've found a bug in Perl's regular expression, which leads to > > indefinite loop with high memory and CPU usage. We have prepared a > > Test Case to reproduce the problem. > > > > > > THE PROBLEM: > > Trying to match the following regular expression: > > / > > ID\s+(\d+)\s+VID\s+(\d+)\s+JID\s+(\d+)\s+INFO:\s+FTP\s+done.*profile:\s+([^\s]+).*titulo:([^|]+).*duracao:\s+(\d+).*filesize:\s+(\d+).*md5sum:\s+(\w+).*sourcePath:\s+([^\s]+).*uploadPath:\s([^\s]+).*isRetry:\s+(\d+)/ > > > > with the following string: > > "ftp.pl[13997]: ID 2883 VID 637818 JID 3702 INFO: FTP done | profile: > > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA | titulo: BBBBBBBB BBBBBBB BB BBBB > > BB BBBBBB BBBBBBBB BBB BBBBBB BBBBBBB BBBBBB B BBBBBBBB | duracao: > > 103369 | filesize: 9304367 | md5sum: 8d00621d9c60596cc79ce46d91cb1809 > > | sourcePath: /usr/local/cccccc/cccccc/cccccccccccccccccc.ccc | > > uploadPath: /dddddd/dddddddddd//2007/02/13/DDDDDDDDDDDDD_dddd.ddd | > > isRetry:" > > > > The perl loop indefinetly.... > > > Hello, > > Thank you for the bug report. > > A rather late reply/solution but better late then never... > > First, when posting a bug report with such a long regex it would be > best if you use the /x modifier. > This will make the regex easier to understand (which could have > triggered a faster reply). > (I would also advise to use the /x modifier in the code that uses this > regex.) > > Your regex: (assuming I made no errors) > > m/ID\s+ (\d+)\s+ # ID > VID\s+ (\d+)\s+ # VID > JID\s+ (\d+)\s+ # JID > INFO:\s+ # INFO > FTP\s+done .* # FTP done > profile:\s+ ([^\s]+) .* # profile: > titulo: ([^|]+) .* # titulo: > duracao:\s+ (\d+) .* # duraco: > filesize:\s+ (\d+) .* # filesize: > md5sum:\s+ (\w+) .* # md5sum: > sourcePath:\s+ ([^\s]+) .* # sourcePath: > uploadPath:\s ([^\s]+) .* # uploadPath > isRetry:\s+(\d+) # isRetry > /x > > > The regex does not run indefinnetly, it just makes a whole lot of > combinations before deciding that it can't match. > > The reason why is because of constructs like this: ([^\s+]).* (which > is the same as \S+.* - which I used in the re) > > The problem with his construct is that if a match fails it will start a > lot of backtracking between (\S+) and .* because .* is too general. > An example: > > #!/usr/bin/perl -l > > $_="abc def"; > if ( > m/^ # Start of line > \S+ # Non-spaces > .* # Everything else > $ # End of line > (?{$count++}) # Count the numbers of times we got here > (?!) # Make the regex fail > /x) {} > > print $count; > __END__ > > In this case $count will contain the number of different combinations > of the same string that were tried against the regex. > For this small string it prints 3, namely: > > m/^\S\S\S....$/; > m/^\S\S.....$/; > m/^\S......$/; > > > > Now let's expand the regex and the string: > > $_="abc defghi jkl"; > if ( > m/^ # Start of line > \S+ # Non-spaces > .* # Everything else > \S+ # Non-spaces > .* # Everything else > $ # End of line > (?{$count++}) # Count the numbers of times we got here > (?!) # Make the regex fail > /x) {} > > print $count; > __END__ > > For this small string and regex it prints 85. Meaning it tried 85 > different combinations before giving up. > > Use $_="abc defghij klm"; and you get a count of 106. > This means 21 extra combinations were tried for 1 extra character in > the input. > > > So, what is the better approach? > Change m/\S+.*/ to something where no (or atleast not as much) > backtracking happens. > This would be: m/\S+(?:\s.*)?/. > > Now what will happen is that the only valid thing after a non-space is > a space. After that any characters may occur. > This group is optional meaning the space is not required. > This limits the backtracking between \S+ and .* > > > Back to the examples: > > #!/usr/bin/perl -l > > $_="abc def"; > if ( > m/^ # Start of line > \S+ # Non-spaces > (?:\s.*)? # Everything else > $ # End of line > (?{$count++}) # Count the numbers of times we got here > (?!) # Make the regex fail > /x) {} > > print $count; > __END__ > > This prints 1. Indicating no backtracking happend. (vs 3 earlier) > > > > > #!/usr/bin/perl -l > > $_="abc defghi jkl"; > if ( > m/^ # Start of line > \S+ # Non-spaces > (?:\s.*)? # Everything else > \S+ # Non-spaces > (?:\s.*)? # Everything else > $ # End of line > (?{$count++}) # Count the numbers of times we got here > (?!) # Make the regex fail > /x) {} > > print $count; > __END__ > > This print 11. (vs 85 earlier) > Using the larger string $_="abc defghij klm"; and you get a count of > 12. (vs 106 earlier) > > Some backtracking happend because this isn't an extremely good example > (since there is nothing between the two groups). > > > Now, applying this on your regex: > > > m/ID\s+ (\d+)\s+ # ID > VID\s+ (\d+)\s+ # VID > JID\s+ (\d+)\s+ # JID > INFO:\s+ # INFO > FTP\s+done .* # FTP done > profile:\s+ (\S+) (?:\s.*)? # profile: > titulo: ([^|]+) (?:|.*)? # titulo: > duracao:\s+ (\d+) (?:\D.*)? # duraco: > filesize:\s+ (\d+) (?:\D.*)? # filesize: > md5sum:\s+ (\w+) (?:\W.*)? # md5sum: > sourcePath:\s+ (\S+) (?:\s.*)? # sourcePath: > uploadPath:\s (\S+) (?:\s.*)? # uploadPath > isRetry:\s+(\d+) # isRetry > /x > > > So, let's count the number of combinations it tried with this regex > with and the given input. > > First for success: > > $_ = "ftp.pl[13997]: ID 2883 VID 637818 JID 3702 INFO: FTP done | " . > "profile: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA | titulo: BBBBBBBB " . > "BBBBBBB BB BBBBBB BBBBBB BBBBBBBB BBB BBBBBB BBBBBBB BBBBBB B > BBBBBBBB " . > "| duracao: 103369 | filesize: 9304367 | md5sum: > 8d00621d9c60596cc79ce46d91cb1809 " . > "| sourcePath: /usr/local/cccccc/cccccc/cccccccccccccccccc.ccc | > uploadPath: " . > "/dddddd/dddddddddd//2007/02/13/DDDDDDDDDDDDD_dddd.ddd | isRetry: > 5"; > > if ( > m/ID\s+ (\d+)\s+ # ID > VID\s+ (\d+)\s+ # VID > JID\s+ (\d+)\s+ # JID > INFO:\s+ # INFO > FTP\s+done .* # FTP done > profile:\s+ (\S+) (?:\s.*)? # profile: > titulo: ([^|]+) (?:|.*)? # titulo: > duracao:\s+ (\d+) (?:\D.*)? # duraco: > filesize:\s+ (\d+) (?:\D.*)? # filesize: > md5sum:\s+ (\w+) (?:\W.*)? # md5sum: > sourcePath:\s+ (\S+) (?:\s.*)? # sourcePath: > uploadPath:\s (\S+) (?:\s.*)? # uploadPath > isRetry:\s+ > (?{$count++}) > (\d+) # isRetry > /x) { > print "ok"; > } > else { > print "not ok"; > } > print $count; > __END__ > > This prints 'ok' and 1. Meaning there was no need for backtracking at > all. > > > Now a failure: > > #!/usr/bin/perl -l > > $_ = "ftp.pl[13997]: ID 2883 VID 637818 JID 3702 INFO: FTP done | " . > "profile: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA | titulo: BBBBBBBB " . > "BBBBBBB BB BBBBBB BBBBBB BBBBBBBB BBB BBBBBB BBBBBBB BBBBBB B > BBBBBBBB " . > "| duracao: 103369 | filesize: 9304367 | md5sum: > 8d00621d9c60596cc79ce46d91cb1809 " . > "| sourcePath: /usr/local/cccccc/cccccc/cccccccccccccccccc.ccc | > uploadPath: " . > "/dddddd/dddddddddd//2007/02/13/DDDDDDDDDDDDD_dddd.ddd | isRetry: > "; > > if ( > m/ID\s+ (\d+)\s+ # ID > VID\s+ (\d+)\s+ # VID > JID\s+ (\d+)\s+ # JID > INFO:\s+ # INFO > FTP\s+done .* # FTP done > profile:\s+ (\S+) (?:\s.*)? # profile: > titulo: ([^|]+) (?:|.*)? # titulo: > duracao:\s+ (\d+) (?:\D.*)? # duraco: > filesize:\s+ (\d+) (?:\D.*)? # filesize: > md5sum:\s+ (\w+) (?:\W.*)? # md5sum: > sourcePath:\s+ (\S+) (?:\s.*)? # sourcePath: > uploadPath:\s (\S+) (?:\s.*)? # uploadPath > isRetry:\s+ > (?{$count++}) > (\d+) # isRetry > /x) { > print "ok"; > } > else { > print "not ok"; > } > print $count; > __END__ > > This prints 'not ok' and 81. So some backtracking happend but not very > much. > > > Doing the same thing with the original regex: (but a much shorter > version of the input) > > #!/usr/bin/perl -l > > $_ = "ftp.pl[13997]: ID 2883 VID 637818 JID 3702 INFO: FTP done | " . > "profile: AA | titulo: BB " . > "| duracao: 10 | filesize: 93 | md5sum: 8d " . > "| sourcePath: /u | uploadPath: " . > "/d | isRetry: "; > > if ( > m/ID\s+ (\d+)\s+ # ID 9999 > VID\s+ (\d+)\s+ # VID 9999 > JID\s+ (\d+)\s+ # JID 9999 > INFO:\s+ # INFO > FTP\s+done .* # FTP done > profile:\s+ ([^\s]+) .* # profile: > titulo: ([^|]+) .* # titulo: > duracao:\s+ (\d+) .* # duraco: > filesize:\s+ (\d+) .* # filesize: > md5sum:\s+ (\w+) .* # md5sum: > sourcePath:\s+ ([^\s]+) .* # sourcePath: > uploadPath:\s ([^\s]+) .* # uploadPath > isRetry:\s+ > (?{$count++}) > (\d+) # isRetry > /x) { > print "ok"; > } > else { > print "not ok"; > } > print $count;__END__ > > This prints 'not ok' and 256 > > Now let's increase the string by 1 character: > > $_ = "ftp.pl[13997]: ID 2883 VID 637818 JID 3702 INFO: FTP done | " . > "profile: AAA | titulo: BB " . > "| duracao: 10 | filesize: 93 | md5sum: 8d " . > "| sourcePath: /u | uploadPath: " . > "/d | isRetry: "; > > This prints 'not ok' and 384. > So an extra character after profile means 128 extra combinations were > tested. > > > This is not a bug in the re engine but the very base of the re engine. > A - not so carefuly written - regex can easily put the re engine in a > long loop since it tries every possible combination. > The one writing the regex needs to tell the engine which combinations > make sense (and thus need to be tested) and which don't need to be > tested (and can be ignored). > > > Kind regards, > > Bram > >