develooper Front page | perl.perl5.porters | Postings from October 2016

Re: [perl #129950] Some UTF-8 regular expression matches fail whenread from file

Thread Previous | Thread Next
From:
demerphq
Date:
October 25, 2016 10:32
Subject:
Re: [perl #129950] Some UTF-8 regular expression matches fail whenread from file
Message ID:
CANgJU+Vm0tjLKWFcy-ww=_vCPYRtaJs4oUamJsK4d8oBd_TMeg@mail.gmail.com
On 25 October 2016 at 12:12, Dave Mitchell <davem@iabyn.com> wrote:
> On Mon, Oct 24, 2016 at 03:57:15PM -0700, Tony Cook via RT wrote:
>> On Sun Oct 23 21:48:55 2016, manabe.hiroshi@gmail.com wrote:
>> > On 2016-10月-23 日 21:44:35, manabe.hiroshi@gmail.com wrote:
>> > > On 2016-10月-23 日 21:23:20, manabe.hiroshi@gmail.com wrote:
>> > > > You can reproduc the bug with the following procedure:
>> > > > 1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
>> > > > 2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
>> > > > m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
>> > > > Output: not matched
>> > > >
>> > > > This happenes only when the string is read from a file handle and the
>> > > > second character is in the range of \x{80}-\x{ff}.
>> > > > Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
>> > > > \x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
>> > > > \x{e4}]$}.
>> > >
>> > > Sorry, the bug only reproduces itself when there is a set of
>> > > parenthes, i.e. m{^(a|a\x{e4})$} etc.
>> >
>> > Sorry again, the correct unicode option for the step 2 was -Ci.
>>
>> The string doesn't need to be from a file:
>>
>> $ ./perl -e '$_ = "a\xE4"; utf8::upgrade($_); print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
>> not matched
>>
>> (blead perl)
>>
>> The match is failing around like 5611 of regexec.c:
>>
>>                 if (   trie->bitmap
>>                     && (NEXTCHR_IS_EOS || !TRIE_BITMAP_TEST(trie, nextchr)))
>>                 {
>>                   if (trie->states[ state ].wordnum) {
>>                        DEBUG_EXECUTE_r(
>>                             Perl_re_exec_indentf( aTHX_  "%smatched empty string...%s\n",
>>                                           depth, PL_colors[4], PL_colors[5])
>>                         );
>>
>> At this point nextchr has the first byte of the UTF-8 encoded \xE4 (0xc3).
>
> I'm looking into this as we speak.

I was going to look into it later as well. Let me know how far you get.

We used to preload the bitmap with the first byte of the unicode
representation of the string, but I guess I can leave it to you.

Let me know otherwise.

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About