develooper Front page | perl.perl5.porters | Postings from January 2019

[perl #133756] //g flag on regex with UTF-8 source causes regexoptimiser to wrongly reject a match

Thread Previous | Thread Next
From:
Nicholas Clark
Date:
January 9, 2019 14:11
Subject:
[perl #133756] //g flag on regex with UTF-8 source causes regexoptimiser to wrongly reject a match
Message ID:
rt-4.0.24-5616-1547043067-1700.133756-75-0@perl.org
# New Ticket Created by  Nicholas Clark 
# Please include the string:  [perl #133756]
# in the subject line of all future correspondence about this issue. 
# <URL: https://rt.perl.org/Ticket/Display.html?id=133756 >


This is a bug report for perl from nick@ccl4.org,
generated with the help of perlbug 1.41 running under perl 5.29.7.


-----------------------------------------------------------------
[Please describe your issue here]

For the case where the eval'd source code contains Ā in a code comment this
doesn't match. Where the code comment is ÿ it does:

nicholas@dromedary-001 perl6$ cat /tmp/rule.pl
use strict;
use warnings;

my $mr = "L\xFCften Kalt";
my $text = "L\xFCften Kalt";

# Culprit here is the //g flag:
for my $char ("\xFF", "\x100", "\xFF", "\x100") {
    my $got = eval "\$text =~ /$mr/g; # $char";

    if ($got) {
	print "Y\n";
    } elsif ($@) {
	print "\$\@: $@\n";
    } else {
	print "n\n";
    }
}

__END__
nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl
Y
n
Y
n



If one removes the //g flag, it does:

nicholas@dromedary-001 perl6$ cat /tmp/rule.pl-no-g
use strict;
use warnings;

my $mr = "L\xFCften Kalt";
my $text = "L\xFCften Kalt";

# Culprit here is the //g flag:
for my $char ("\xFF", "\x100", "\xFF", "\x100") {
    my $got = eval "\$text =~ /$mr/; # $char";

    if ($got) {
	print "Y\n";
    } elsif ($@) {
	print "\$\@: $@\n";
    } else {
	print "n\n";
    }
}

__END__
nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl-no-g
Y
Y
Y
Y



I would expect that it should match always, independent of whether //g is
present, or whether the source code is encoded as UTF-8 or octets.

Running with -Dr suggests that the culprit is the regex optimiser,
"Regex match can't succeed, so not even tried":

[snip]

EXECUTING...

Compiling REx "L%x{fc}ften Kalt"
~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4
rarest char  at 1
Final program:
   1: EXACT <L\x{fc}ften Kalt> (5)
   5: END (0)
anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11
Matching REx "L%x{fc}ften Kalt" against "L%x{fc}ften Kalt"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [0..11] gave 0
  Found anchored substr "L%x{fc}ften Kalt" at offset 0 (rx_origin now 0)...
  (multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
Freeing REx: "L%x{fc}ften Kalt"
Y
Compiling REx "L%x{fc}ften Kalt"
~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4
rarest char  at 1
Final program:
   1: EXACT <L\x{fc}ften Kalt> (5)
   5: END (0)
anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11
Matching REx "L%x{fc}ften Kalt" against ""
Regex match can't succeed, so not even tried
Freeing REx: "L%x{fc}ften Kalt"
n
Compiling REx "L%x{fc}ften Kalt"
~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4
rarest char  at 1
Final program:
   1: EXACT <L\x{fc}ften Kalt> (5)
   5: END (0)
anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11
Matching REx "L%x{fc}ften Kalt" against "L%x{fc}ften Kalt"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [0..11] gave 0
  Found anchored substr "L%x{fc}ften Kalt" at offset 0 (rx_origin now 0)...
  (multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
Freeing REx: "L%x{fc}ften Kalt"
Y
Compiling REx "L%x{fc}ften Kalt"
~ tying lastbr EXACT <L\x{fc}ften Kalt> (1) to ender END (5) offset 4
rarest char  at 1
Final program:
   1: EXACT <L\x{fc}ften Kalt> (5)
   5: END (0)
anchored "L%x{fc}ften Kalt" at 0..0 (checking anchored isall) minlen 11
Matching REx "L%x{fc}ften Kalt" against ""
Regex match can't succeed, so not even tried
Freeing REx: "L%x{fc}ften Kalt"
n


This does not seem to be a regression - everything I sampled back to 5.6.0
shows the same Y/n/Y/n behaviour.

Nicholas Clark

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
    category=core
    severity=low
---
Site configuration information for perl 5.29.7:

Configured by nicholas at Wed Jan  9 14:11:11 CET 2019.

Summary of my perl5 (revision 5 version 29 subversion 7) configuration:
  Commit id: 5203d63deea0ef134714a48c272a928fbbe64ce1
  Platform:
    osname=linux
    osvers=2.6.32-358.el6.x86_64
    archname=x86_64-linux
    uname='linux dromedary-001.ams6.corp.booking.com 2.6.32-358.el6.x86_64 #1 smp fri feb 22 00:31:26 utc 2013 x86_64 x86_64 x86_64 gnulinux '
    config_args='-Dusedevel -Dcc=ccache /usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64 -Dld=/usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64 -Dcf_email=nick@ccl4.org -Dperladmin=nick@ccl4.org -Dinc_version_list=  -Dinc_version_list_init=0 -Accflags=-DDEBUGGING -g -Doptimize=-Og -Uusethreads -Uuselongdouble -Uuse64bitall -Dprefix=~/Sandpit/snap-v5.29.6-94-g5203d63dee -Uusevendorprefix -Uvendorprefix=~/Sandpit/snap-v5.29.6-94-g5203d63dee -de'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=undef
    usemultiplicity=undef
    use64bitint=define
    use64bitall=undef
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
    bincompat5005=undef
  Compiler:
    cc='/usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64'
    ccflags ='-DDEBUGGING -g -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
    optimize='-Og'
    cppflags='-DDEBUGGING -g -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='4.9.0'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='/usr/local/gcc49/bin/gcc -Wl,-rpath=/usr/local/gcc49/lib64'
    ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/local/gcc49/lib /usr/local/gcc49/lib/gcc/x86_64-unknown-linux-gnu/4.9.0/include-fixed /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib /lib64 /usr/lib64 /usr/local/lib64
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.12.so
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.12'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -Og -L/usr/local/lib -fstack-protector-strong'


---
@INC for perl 5.29.7:
    lib
    /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/site_perl/5.29.7/x86_64-linux
    /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/site_perl/5.29.7
    /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/5.29.7/x86_64-linux
    /home/nicholas/Sandpit/snap-v5.29.6-94-g5203d63dee/lib/perl5/5.29.7

---
Environment for perl 5.29.7:
    HOME=/home/nicholas
    LANG (unset)
    LANGUAGE (unset)
    LC_ALL=en_GB.utf8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/nicholas/bin:/opt/local/bin:/opt/local/sbin:/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/sbin:/sbin:/usr/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About