develooper Front page | perl.perl5.porters | Postings from May 2013

strange regex coredumps *unless* under -Mre=debug

Thread Next
From:
Tom Christiansen
Date:
May 13, 2013 02:15
Subject:
strange regex coredumps *unless* under -Mre=debug
Message ID:
30506.1368411281@chthon
The enclosed program works perfectly fine -- provided that
both of these are true:

    * it is running v5.16 or later
    * the -Mre=debug pragma is enabled
        (and no, use re "debug" is not good enough!)

If running an earlier release, it just doesn't work, and without the 
C<use re "debug">, you get a coredump part-way through the match.

Actually, that's not true.  You still get a coredump under
C<use re "debug"> -- it is the CLI switch -Mre=debug that makes
it work, and without that even the internal C<use re "debug">
isn't good enough!

How come those aren't equivalent??

There are several other strange behaviors that can be tickled,
as mentioned in the comments.

Am I simply asking too much, or is this an actual bug or three?

The goal is to iteratively pull out a match string all of whose codepoints
have the same Unicode Script character property, with slop for Common and
Inherited thrown in.  It seems useful, maybe even important, to be able to
do this.

Here's what it shows when it works:

    macbook% perl5.16.0 -Mre=debug ~/scriptrun | & grep Got
    DEBUG: Got peekahead character f, U+0066
    Got string: 'foo1 and '
    DEBUG: Got peekahead character Π, U+03a0
    Got string: 'Πππ '
    DEBUG: Got peekahead character 語, U+8a9e
    Got string: '語語語 '
    DEBUG: Got peekahead character d, U+0064
    Got string: 'done'

And this shows the re-interpolated strings:

    macbook% perl5.16.0 -Mre=debug ~/scriptrun | & grep re-in
    DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
    DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Greek}]*}
    DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Han}]*}
    DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}

Without the -Mre=debug, it dumps core a good ways into 
the re debugging output.  Without the C<use re "debug">,
it dumps core here:

    macbook% perl5.16.0 ~/scriptrun
    DEBUG: Got peekahead character f, U+0066
    DEBUG: Scriptname is Latin
    DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
    Got string: 'foo1 and '
    DEBUG: Got peekahead character Π, U+03a0
    DEBUG: Scriptname is Greek
    DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Greek}]*}
    Segmentation fault
    Exit 139

That is under v5.16.  Under blead it gives a coredump after 
outputting less than 5.16 did. With blead, it happens here already:

    macbook% blead ~/scriptrun | & grep Got
    DEBUG: Got peekahead character f, U+0066
    Got string: 'foo1 and '
    DEBUG: Got peekahead character Π, U+03a0
    Segmentation fault 
    Exit 139

Very strange, all around.

Perhaps there is some obvious better way to do this, but if so,
I'm having a mental block against it.  So I tried to cobble together
something that would do the job anyway, and I keep bumping into
things that not only don't work, they fail with core dumps in 
any number of places.

Any ideas?  Should I file an actual bug report?

thanks,

--tom

use v5.10.1;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:std :utf8);
use utf8;

# comment out this next line and dump core!
# leave it in but omit -Mre=debug and you 
# still dump core!!
use re "debug";  

use Unicode::UCD qw(charscript);

# regex to match a string that's all of the 
# same Script=XXX type
#
my $rx = qr{
    (?= 
       [\p{Script=Common}\p{Script=Inherited}] * 
        (?<CAPTURE>  
            [^\p{Script=Common}\p{Script=Inherited}] 
        ) 
    )
    (??{
        my $capture = $+{CAPTURE};  # neither $^N nor $1 worked here
        printf "DEBUG: Got peekahead character %s, U+%04x\n", 
            $capture, ord $capture;
        my $scriptname = charscript(ord $capture);
        print "DEBUG: Scriptname is $scriptname\n";
        my $run = q([\p{Script=Common}\p{Script=Inherited}\p{Script=)
                . $scriptname
                . q(}]*);
        print "DEBUG: string to re-interpolate as regex is q{$run}\n";
        $run;
    }) 
}x;

my $data = "foo1 and Πππ 語語語 done";

$| = 1;

while ($data =~ /($rx)/g) {
   #print "Got string: '$1'\n";  # causes quick SEGV
    print "Got string: '$&'\n";
} 

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About