develooper Front page | perl.perl5.porters | Postings from October 2016

Re: [perl #129950] Some UTF-8 regular expression matches fail whenread from file

Thread Previous | Thread Next
From:
Dave Mitchell
Date:
October 25, 2016 11:45
Subject:
Re: [perl #129950] Some UTF-8 regular expression matches fail whenread from file
Message ID:
20161025114528.GM3128@iabyn.com
On Tue, Oct 25, 2016 at 12:31:59PM +0200, demerphq wrote:
> On 25 October 2016 at 12:12, Dave Mitchell <davem@iabyn.com> wrote:
> > On Mon, Oct 24, 2016 at 03:57:15PM -0700, Tony Cook via RT wrote:
> >> On Sun Oct 23 21:48:55 2016, manabe.hiroshi@gmail.com wrote:
> >> > On 2016-10月-23 日 21:44:35, manabe.hiroshi@gmail.com wrote:
> >> > > On 2016-10月-23 日 21:23:20, manabe.hiroshi@gmail.com wrote:
> >> > > > You can reproduc the bug with the following procedure:
> >> > > > 1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
> >> > > > 2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
> >> > > > m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
> >> > > > Output: not matched
> >> > > >
> >> > > > This happenes only when the string is read from a file handle and the
> >> > > > second character is in the range of \x{80}-\x{ff}.
> >> > > > Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
> >> > > > \x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
> >> > > > \x{e4}]$}.
> >> > >
> >> > > Sorry, the bug only reproduces itself when there is a set of
> >> > > parenthes, i.e. m{^(a|a\x{e4})$} etc.
> >> >
> >> > Sorry again, the correct unicode option for the step 2 was -Ci.
> >>
> >> The string doesn't need to be from a file:
> >>
> >> $ ./perl -e '$_ = "a\xE4"; utf8::upgrade($_); print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
> >> not matched
> >>
> >> (blead perl)
> >>
> >> The match is failing around like 5611 of regexec.c:
> >>
> >>                 if (   trie->bitmap
> >>                     && (NEXTCHR_IS_EOS || !TRIE_BITMAP_TEST(trie, nextchr)))
> >>                 {
> >>                   if (trie->states[ state ].wordnum) {
> >>                        DEBUG_EXECUTE_r(
> >>                             Perl_re_exec_indentf( aTHX_  "%smatched empty string...%s\n",
> >>                                           depth, PL_colors[4], PL_colors[5])
> >>                         );
> >>
> >> At this point nextchr has the first byte of the UTF-8 encoded \xE4 (0xc3).
> >
> > I'm looking into this as we speak.
> 
> I was going to look into it later as well. Let me know how far you get.

Not far as it turns out. The failing code has TRIE_BITMAP_TEST() returning
false, while good code like

    $_ = "a\x64";
    print "match\n" if m{^(a|a\x{64})$};

doesn't.
At which point I got distracted and haven't looked further. You're probably
a better choice than me to take this further :-) 


-- 
The Enterprise's efficient long-range scanners detect a temporal vortex
distortion in good time, allowing it to be safely avoided via a minor
course correction.
    -- Things That Never Happen in "Star Trek" #21

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About