develooper Front page | perl.perl5.porters | Postings from October 2016

[perl #129950] Some UTF-8 regular expression matches fail when readfrom file

Thread Previous | Thread Next
From:
Dan Collins via RT
Date:
October 24, 2016 20:47
Subject:
[perl #129950] Some UTF-8 regular expression matches fail when readfrom file
Message ID:
rt-4.0.24-15532-1477342044-1953.129950-14-0@perl.org
On Sun Oct 23 21:48:55 2016, manabe.hiroshi@gmail.com wrote:
> On 2016-10月-23 日 21:44:35, manabe.hiroshi@gmail.com wrote:
> > On 2016-10月-23 日 21:23:20, manabe.hiroshi@gmail.com wrote:
> > > You can reproduc the bug with the following procedure:
> > > 1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
> > > 2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
> > > m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
> > > Output: not matched
> > >
> > > This happenes only when the string is read from a file handle and the
> > > second character is in the range of \x{80}-\x{ff}.
> > > Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
> > > \x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
> > > \x{e4}]$}.
> > 
> > Sorry, the bug only reproduces itself when there is a set of
> > parenthes, i.e. m{^(a|a\x{e4})$} etc.
> 
> Sorry again, the correct unicode option for the step 2 was -Ci.

This seems interesting:

$ perl -Ci -e 'open IN, "<", "foo.txt"; $_ = <IN>; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
matched
$ perl -Ci -e 'open IN, "<", "foo.txt"; $_ = <IN>; print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
not matched

And with -Dr...

dcollins@nightshade64:~/toolchain$ perl5.25.2-debug -D -Ci -e 'open IN, "<", "foo.txt"; $_ = <IN>; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
 Debugging flag values: (see also -d)
  p  Tokenizing and parsing (with v, displays parse stack)
  s  Stack snapshots (with v, displays all stacks)
  l  Context (loop) stack processing
  t  Trace execution
  o  Method and overloading resolution
  c  String/numeric conversions
  P  Print profiling info, source file input state
  m  Memory and SV allocation
  f  Format processing
  r  Regular expression parsing and execution
  x  Syntax tree dump
  u  Tainting checks
  H  Hash dump -- usurps values()
  X  Scratchpad allocation
  D  Cleaning up
  S  Op slab allocation
  T  Tokenising
  R  Include reference counts of dumped variables (eg when using -Ds)
  J  Do not s,t,P-debug (Jump over) opcodes within package DB
  v  Verbose: use in conjunction with other flags
  C  Copy On Write
  A  Consistency checks on internal structures
  q  quiet - currently only suppresses the 'EXECUTING' message
  M  trace smart match resolution
  B  dump suBroutine definitions, including special Blocks like BEGIN
  L  trace some locale setting information--for Perl core development
  i  trace PerlIO layer processing

EXECUTING...

matched
dcollins@nightshade64:~/toolchain$ perl5.25.2-debug -Dr -Ci -e 'open IN, "<", "foo.txt"; $_ = <IN>; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
Compiling REx "^(a\x{e4})$"
rarest char ▒ at 1
Final program:
   1: SBOL /^/ (2)
   2: OPEN1 (4)
   4:   EXACT <a\x{e4}> (6)
   6: CLOSE1 (8)
   8: SEOL (9)
   9: END (0)
anchored "a%x{e4}"$ at 0 (checking anchored noscan) anchored(SBOL) minlen 2
Enabling $` $& $' support (0x7).

EXECUTING...

Matching REx "^(a\x{e4})$" against "a%x{e4}"
UTF-8 string...
Intuit: trying to determine minimum start position...
rarest char ▒ at 2
  Looking for check substr at fixed offset 0...
Intuit: Successfully guessed: match at offset 0
   0 <> <a%x{e4}>            |   0| 1:SBOL /^/(2)
   0 <> <a%x{e4}>            |   0| 2:OPEN1(4)
   0 <> <a%x{e4}>            |   0| 4:EXACT <a\x{e4}>(6)
   3 <a%x{e4}> <>            |   0| 6:CLOSE1(8)
   3 <a%x{e4}> <>            |   0| 8:SEOL(9)
   3 <a%x{e4}> <>            |   0| 9:END(0)
Match successful!
matched
Freeing REx: "^(a\x{e4})$"
dcollins@nightshade64:~/toolchain$ perl5.25.2-debug -Dr -Ci -e 'open IN, "<", "foo.txt"; $_ = <IN>; print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
Compiling REx "^(a|a\x{e4})$"
rarest char
 at 0
rarest char a at 0
Final program:
   1: SBOL /^/ (2)
   2: OPEN1 (4)
   4:   EXACT <a> (6)
   6:   TRIE-EXACT[\xE4] (10)
        <>
        <\344>
  10: CLOSE1 (12)
  12: SEOL (13)
  13: END (0)
anchored "a" at 0 floating ""$ at 1..2 (checking anchored noscan) anchored(SBOL) minlen 1
Enabling $` $& $' support (0x7).

EXECUTING...

Matching REx "^(a|a\x{e4})$" against "a%x{e4}"
UTF-8 string...
Intuit: trying to determine minimum start position...
rarest char
 at 0
rarest char a at 0
  Looking for check substr at fixed offset 0...
Intuit: Successfully guessed: match at offset 0
   0 <> <a%x{e4}>            |   0| 1:SBOL /^/(2)
   0 <> <a%x{e4}>            |   0| 2:OPEN1(4)
   0 <> <a%x{e4}>            |   0| 4:EXACT <a>(6)
   1 <a> <%x{e4}>            |   0| 6:TRIE-EXACT[\xE4](10)
                             |   0| matched empty string...
   1 <a> <%x{e4}>            |   0| 10:CLOSE1(12)
   1 <a> <%x{e4}>            |   0| 12:SEOL(13)
                             |   0| failed...
Match failed
not matched
Freeing REx: "^(a|a\x{e4})$"

Unicode errors aside, is the TRIE optimization getting this wrong?

-- 
Respectfully,
Dan Collins

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About