develooper Front page | perl.perl5.porters | Postings from October 2016

Re: [perl #129950] Some UTF-8 regular expression matches fail whenread from file

Thread Previous | Thread Next
From:
Dave Mitchell
Date:
October 25, 2016 10:12
Subject:
Re: [perl #129950] Some UTF-8 regular expression matches fail whenread from file
Message ID:
20161025101230.GL3128@iabyn.com
On Mon, Oct 24, 2016 at 03:57:15PM -0700, Tony Cook via RT wrote:
> On Sun Oct 23 21:48:55 2016, manabe.hiroshi@gmail.com wrote:
> > On 2016-10月-23 日 21:44:35, manabe.hiroshi@gmail.com wrote:
> > > On 2016-10月-23 日 21:23:20, manabe.hiroshi@gmail.com wrote:
> > > > You can reproduc the bug with the following procedure:
> > > > 1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
> > > > 2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
> > > > m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
> > > > Output: not matched
> > > >
> > > > This happenes only when the string is read from a file handle and the
> > > > second character is in the range of \x{80}-\x{ff}.
> > > > Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
> > > > \x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
> > > > \x{e4}]$}.
> > > 
> > > Sorry, the bug only reproduces itself when there is a set of
> > > parenthes, i.e. m{^(a|a\x{e4})$} etc.
> > 
> > Sorry again, the correct unicode option for the step 2 was -Ci.
> 
> The string doesn't need to be from a file:
> 
> $ ./perl -e '$_ = "a\xE4"; utf8::upgrade($_); print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
> not matched
> 
> (blead perl)
> 
> The match is failing around like 5611 of regexec.c:
> 
>                 if (   trie->bitmap
>                     && (NEXTCHR_IS_EOS || !TRIE_BITMAP_TEST(trie, nextchr)))
>                 {
>         	    if (trie->states[ state ].wordnum) {
>         	         DEBUG_EXECUTE_r(
>                             Perl_re_exec_indentf( aTHX_  "%smatched empty string...%s\n",
>                                           depth, PL_colors[4], PL_colors[5])
>                         );
> 
> At this point nextchr has the first byte of the UTF-8 encoded \xE4 (0xc3).

I'm looking into this as we speak.

-- 
I thought I was wrong once, but I was mistaken.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About