develooper Front page | perl.perl5.porters | Postings from January 2019

Re: [perl #133756] //g flag on regex with UTF-8 source causes regexoptimiser to wrongly reject a match

Thread Previous | Thread Next
From:
Karl Williamson
Date:
January 9, 2019 16:50
Subject:
Re: [perl #133756] //g flag on regex with UTF-8 source causes regexoptimiser to wrongly reject a match
Message ID:
90ae4b0b-cfaa-4ace-6efc-a0d5d6274dda@khwilliamson.com
On 1/9/19 7:11 AM, Nicholas Clark (via RT) wrote:
> # New Ticket Created by  Nicholas Clark
> # Please include the string:  [perl #133756]
> # in the subject line of all future correspondence about this issue.
> # <URL: https://rt.perl.org/Ticket/Display.html?id=133756 >
> 
> 
> This is a bug report for perl from nick@ccl4.org,
> generated with the help of perlbug 1.41 running under perl 5.29.7.
> 
> 
> -----------------------------------------------------------------
> [Please describe your issue here]
> 
> For the case where the eval'd source code contains Ā in a code comment this
> doesn't match. Where the code comment is ÿ it does:
> 
> nicholas@dromedary-001 perl6$ cat /tmp/rule.pl
> use strict;
> use warnings;
> 
> my $mr = "L\xFCften Kalt";
> my $text = "L\xFCften Kalt";
> 
> # Culprit here is the //g flag:
> for my $char ("\xFF", "\x100", "\xFF", "\x100") {
>      my $got = eval "\$text =~ /$mr/g; # $char";
> 
>      if ($got) {
> 	print "Y\n";
>      } elsif ($@) {
> 	print "\$\@: $@\n";
>      } else {
> 	print "n\n";
>      }
> }
> 
> __END__
> nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl
> Y
> n
> Y
> n
> 
> 
> 
> If one removes the //g flag, it does:
> 
> nicholas@dromedary-001 perl6$ cat /tmp/rule.pl-no-g
> use strict;
> use warnings;
> 
> my $mr = "L\xFCften Kalt";
> my $text = "L\xFCften Kalt";
> 
> # Culprit here is the //g flag:
> for my $char ("\xFF", "\x100", "\xFF", "\x100") {
>      my $got = eval "\$text =~ /$mr/; # $char";
> 
>      if ($got) {
> 	print "Y\n";
>      } elsif ($@) {
> 	print "\$\@: $@\n";
>      } else {
> 	print "n\n";
>      }
> }
> 
> __END__
> nicholas@dromedary-001 perl6$ ./perl -Ilib /tmp/rule.pl-no-g
> Y
> Y
> Y
> Y
> 
> 
> 
> I would expect that it should match always, independent of whether //g is
> present, or whether the source code is encoded as UTF-8 or octets.
> 
> Running with -Dr suggests that the culprit is the regex optimiser,
> "Regex match can't succeed, so not even tried":
> 
> [snip]
> 

My gvim syntax highlighter immediately showed that \x100 is \x10 
followed by a "0".  Without that, I would have expected that $char 
contained a single character: \x{100}.  The /g would cause the second 
character, the "0" (U+0030) to be attempted to be matched.  I haven't 
investigated further, because my guess is that is what is going on here. 
  If you say there is more to it, then I'll investigate further.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About