Front page | perl.perl5.porters |
Postings from October 2016
Re: [perl #129950] Some UTF-8 regular expression matches fail whenread from file
Thread Previous
|
Thread Next
From:
demerphq
Date:
October 27, 2016 12:04
Subject:
Re: [perl #129950] Some UTF-8 regular expression matches fail whenread from file
Message ID:
CANgJU+Wx1j9yjM4xwOfHo421zdDuqpwCBRCFy0Zkw9cp393Prg@mail.gmail.com
On 25 October 2016 at 13:45, Dave Mitchell <davem@iabyn.com> wrote:
> On Tue, Oct 25, 2016 at 12:31:59PM +0200, demerphq wrote:
>> On 25 October 2016 at 12:12, Dave Mitchell <davem@iabyn.com> wrote:
>> > On Mon, Oct 24, 2016 at 03:57:15PM -0700, Tony Cook via RT wrote:
>> >> On Sun Oct 23 21:48:55 2016, manabe.hiroshi@gmail.com wrote:
>> >> > On 2016-10月-23 日 21:44:35, manabe.hiroshi@gmail.com wrote:
>> >> > > On 2016-10月-23 日 21:23:20, manabe.hiroshi@gmail.com wrote:
>> >> > > > You can reproduc the bug with the following procedure:
>> >> > > > 1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
>> >> > > > 2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
>> >> > > > m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
>> >> > > > Output: not matched
>> >> > > >
>> >> > > > This happenes only when the string is read from a file handle and the
>> >> > > > second character is in the range of \x{80}-\x{ff}.
>> >> > > > Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
>> >> > > > \x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
>> >> > > > \x{e4}]$}.
>> >> > >
>> >> > > Sorry, the bug only reproduces itself when there is a set of
>> >> > > parenthes, i.e. m{^(a|a\x{e4})$} etc.
>> >> >
>> >> > Sorry again, the correct unicode option for the step 2 was -Ci.
>> >>
>> >> The string doesn't need to be from a file:
>> >>
>> >> $ ./perl -e '$_ = "a\xE4"; utf8::upgrade($_); print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
>> >> not matched
>> >>
>> >> (blead perl)
>> >>
>> >> The match is failing around like 5611 of regexec.c:
>> >>
>> >> if ( trie->bitmap
>> >> && (NEXTCHR_IS_EOS || !TRIE_BITMAP_TEST(trie, nextchr)))
>> >> {
>> >> if (trie->states[ state ].wordnum) {
>> >> DEBUG_EXECUTE_r(
>> >> Perl_re_exec_indentf( aTHX_ "%smatched empty string...%s\n",
>> >> depth, PL_colors[4], PL_colors[5])
>> >> );
>> >>
>> >> At this point nextchr has the first byte of the UTF-8 encoded \xE4 (0xc3).
>> >
>> > I'm looking into this as we speak.
>>
>> I was going to look into it later as well. Let me know how far you get.
>
> Not far as it turns out. The failing code has TRIE_BITMAP_TEST() returning
> false, while good code like
>
> $_ = "a\x64";
> print "match\n" if m{^(a|a\x{64})$};
>
> doesn't.
> At which point I got distracted and haven't looked further. You're probably
> a better choice than me to take this further :-)
Fixed. This ticket can be closed.
commit da42332b10691ba7af7550035ffc7f46c87e4e66
Author: Yves Orton <demerphq@gmail.com>
Date: Thu Oct 27 13:52:24 2016 +0200
regcomp.c: fix perl #129950 - fix firstchar bitmap under utf8 with
prefix optimisation
The trie code contains a number of sub optimisations, one of which
extracts common prefixes from alternations, and another which isa
bitmap of the possible matching first chars.
The bitmap needs to contain the possible first octets of the string
which the trie can match, and for codepoints which might have a different
first octet under utf8 or non-utf8 need to register BOTH codepoints.
So for instance in the pattern (?:a|a\x{E4}) we should restructure this
as a(|\x{E4), and the bitmap for the trie should contain both \x{E4} AND
\x{C3} as \x{C3} is the first byte of \x{EF} expressed as utf8.
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next