Front page | perl.perl5.porters |
Postings from October 2016
[perl #129950] Some UTF-8 regular expression matches fail when readfrom file
Thread Previous
|
Thread Next
From:
Dan Collins via RT
Date:
October 24, 2016 20:47
Subject:
[perl #129950] Some UTF-8 regular expression matches fail when readfrom file
Message ID:
rt-4.0.24-15532-1477342044-1953.129950-14-0@perl.org
On Sun Oct 23 21:48:55 2016, manabe.hiroshi@gmail.com wrote:
> On 2016-10月-23 日 21:44:35, manabe.hiroshi@gmail.com wrote:
> > On 2016-10月-23 日 21:23:20, manabe.hiroshi@gmail.com wrote:
> > > You can reproduc the bug with the following procedure:
> > > 1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
> > > 2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
> > > m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
> > > Output: not matched
> > >
> > > This happenes only when the string is read from a file handle and the
> > > second character is in the range of \x{80}-\x{ff}.
> > > Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
> > > \x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
> > > \x{e4}]$}.
> >
> > Sorry, the bug only reproduces itself when there is a set of
> > parenthes, i.e. m{^(a|a\x{e4})$} etc.
>
> Sorry again, the correct unicode option for the step 2 was -Ci.
This seems interesting:
$ perl -Ci -e 'open IN, "<", "foo.txt"; $_ = <IN>; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
matched
$ perl -Ci -e 'open IN, "<", "foo.txt"; $_ = <IN>; print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
not matched
And with -Dr...
dcollins@nightshade64:~/toolchain$ perl5.25.2-debug -D -Ci -e 'open IN, "<", "foo.txt"; $_ = <IN>; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
Debugging flag values: (see also -d)
p Tokenizing and parsing (with v, displays parse stack)
s Stack snapshots (with v, displays all stacks)
l Context (loop) stack processing
t Trace execution
o Method and overloading resolution
c String/numeric conversions
P Print profiling info, source file input state
m Memory and SV allocation
f Format processing
r Regular expression parsing and execution
x Syntax tree dump
u Tainting checks
H Hash dump -- usurps values()
X Scratchpad allocation
D Cleaning up
S Op slab allocation
T Tokenising
R Include reference counts of dumped variables (eg when using -Ds)
J Do not s,t,P-debug (Jump over) opcodes within package DB
v Verbose: use in conjunction with other flags
C Copy On Write
A Consistency checks on internal structures
q quiet - currently only suppresses the 'EXECUTING' message
M trace smart match resolution
B dump suBroutine definitions, including special Blocks like BEGIN
L trace some locale setting information--for Perl core development
i trace PerlIO layer processing
EXECUTING...
matched
dcollins@nightshade64:~/toolchain$ perl5.25.2-debug -Dr -Ci -e 'open IN, "<", "foo.txt"; $_ = <IN>; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
Compiling REx "^(a\x{e4})$"
rarest char ▒ at 1
Final program:
1: SBOL /^/ (2)
2: OPEN1 (4)
4: EXACT <a\x{e4}> (6)
6: CLOSE1 (8)
8: SEOL (9)
9: END (0)
anchored "a%x{e4}"$ at 0 (checking anchored noscan) anchored(SBOL) minlen 2
Enabling $` $& $' support (0x7).
EXECUTING...
Matching REx "^(a\x{e4})$" against "a%x{e4}"
UTF-8 string...
Intuit: trying to determine minimum start position...
rarest char ▒ at 2
Looking for check substr at fixed offset 0...
Intuit: Successfully guessed: match at offset 0
0 <> <a%x{e4}> | 0| 1:SBOL /^/(2)
0 <> <a%x{e4}> | 0| 2:OPEN1(4)
0 <> <a%x{e4}> | 0| 4:EXACT <a\x{e4}>(6)
3 <a%x{e4}> <> | 0| 6:CLOSE1(8)
3 <a%x{e4}> <> | 0| 8:SEOL(9)
3 <a%x{e4}> <> | 0| 9:END(0)
Match successful!
matched
Freeing REx: "^(a\x{e4})$"
dcollins@nightshade64:~/toolchain$ perl5.25.2-debug -Dr -Ci -e 'open IN, "<", "foo.txt"; $_ = <IN>; print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
Compiling REx "^(a|a\x{e4})$"
rarest char
at 0
rarest char a at 0
Final program:
1: SBOL /^/ (2)
2: OPEN1 (4)
4: EXACT <a> (6)
6: TRIE-EXACT[\xE4] (10)
<>
<\344>
10: CLOSE1 (12)
12: SEOL (13)
13: END (0)
anchored "a" at 0 floating ""$ at 1..2 (checking anchored noscan) anchored(SBOL) minlen 1
Enabling $` $& $' support (0x7).
EXECUTING...
Matching REx "^(a|a\x{e4})$" against "a%x{e4}"
UTF-8 string...
Intuit: trying to determine minimum start position...
rarest char
at 0
rarest char a at 0
Looking for check substr at fixed offset 0...
Intuit: Successfully guessed: match at offset 0
0 <> <a%x{e4}> | 0| 1:SBOL /^/(2)
0 <> <a%x{e4}> | 0| 2:OPEN1(4)
0 <> <a%x{e4}> | 0| 4:EXACT <a>(6)
1 <a> <%x{e4}> | 0| 6:TRIE-EXACT[\xE4](10)
| 0| matched empty string...
1 <a> <%x{e4}> | 0| 10:CLOSE1(12)
1 <a> <%x{e4}> | 0| 12:SEOL(13)
| 0| failed...
Match failed
not matched
Freeing REx: "^(a|a\x{e4})$"
Unicode errors aside, is the TRIE optimization getting this wrong?
--
Respectfully,
Dan Collins
Thread Previous
|
Thread Next