On 1/31/06, via RT Lukasz Debowski <perlbug-followup@perl.org> wrote:
> # New Ticket Created by Lukasz Debowski
> # Please include the string: [perl #38379]
> # in the subject line of all future correspondence about this issue.
> # <URL: https://rt.perl.org/rt3/Ticket/Display.html?id=38379 >
>
>
>
> This is a bug report for perl from ldebowsk@ipipan.waw.pl,
> generated with the help of perlbug 1.35 running under perl v5.8.4.
>
>
> -----------------------------------------------------------------
> [Please enter your report here]
>
> Dear Recipients,
>
> I observed that Perl regular expression matching operator produces
> segmentation fault when trying to match a too long expression. I
> consider it a bug since no information is given about the cause of the
> fault and its location in the script. Compare running two simple scripts:
>
> ======= SCRIPT #1 =======
>
> ldebowsk@mises:~$ perl -e '$_="<w>ab"; while(1){ $_=$_."<11>1>a>b>1>c>d";
> if(/^<w>[^<>]+(<[01][01](>[1-9][0-9]*>[^><]+>[^><]+)+)+$/){print "$k\n";}
> $k++;}'
> 1
> 2
> ...
> 1903
> 1904
> Naruszenie ochrony pamięci (i.e. "Segmentation fault" in Polish)
> ldebowsk@mises:~$
>
> ======= SCRIPT #2 =======
>
> ldebowsk@mises:~$ perl -e "0/0;"
> Illegal division by zero at -e line 1.
> ldebowsk@mises:~$
>
> =========================
>
> I think that it would be much nicer if for script #1 a regular
> Perl error message be produced. For example, "Exceeding run-time
> memory when matching REGEXP at -e line 1".
>
> It took me too much time to find out that the too long match is the
> cause of the segmentation fault. I came across this behavior when
> using two scripts for natural language processing. The first one was
> producing a kind of part-of-speech annotated corpus out of a plain
> text and the second one was validating the format of the corpus. The
> matched expression in the validating script was exactly as in the
> if-condition of script #1. For typical language data, the matched
> expression consisted of <100 consecutive segments of type
> (>[1-9][0-9]*>[^><]+>[^><]+)+), but when I ran the part-of-speech
> annotating script on some weird data it produced unexpectedly a string
> consisting of >1000 consecutive segments of type
> (>[1-9][0-9]*>[^><]+>[^><]+)+).
Id say that probably you could avoid the segfault by reworking your
pattern. Any time you see something like
+)+)+
in a pattern you should think carefully about whether the pattern can
be reworked to backtrack less often.
And to fix this problem the perl regex engine needs to be converted
from recursive to interpretive. Which is code "its not getting fixed
anytime soon :-)"
cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous