On Tue, Oct 25, 2016 at 12:31:59PM +0200, demerphq wrote: > On 25 October 2016 at 12:12, Dave Mitchell <davem@iabyn.com> wrote: > > On Mon, Oct 24, 2016 at 03:57:15PM -0700, Tony Cook via RT wrote: > >> On Sun Oct 23 21:48:55 2016, manabe.hiroshi@gmail.com wrote: > >> > On 2016-10月-23 日 21:44:35, manabe.hiroshi@gmail.com wrote: > >> > > On 2016-10月-23 日 21:23:20, manabe.hiroshi@gmail.com wrote: > >> > > > You can reproduc the bug with the following procedure: > >> > > > 1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä > >> > > > 2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print > >> > > > m{^a|a\x{e4}$} . "matched\n" : "not matched\n"; > >> > > > Output: not matched > >> > > > > >> > > > This happenes only when the string is read from a file handle and the > >> > > > second character is in the range of \x{80}-\x{ff}. > >> > > > Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}- > >> > > > \x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}- > >> > > > \x{e4}]$}. > >> > > > >> > > Sorry, the bug only reproduces itself when there is a set of > >> > > parenthes, i.e. m{^(a|a\x{e4})$} etc. > >> > > >> > Sorry again, the correct unicode option for the step 2 was -Ci. > >> > >> The string doesn't need to be from a file: > >> > >> $ ./perl -e '$_ = "a\xE4"; utf8::upgrade($_); print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";' > >> not matched > >> > >> (blead perl) > >> > >> The match is failing around like 5611 of regexec.c: > >> > >> if ( trie->bitmap > >> && (NEXTCHR_IS_EOS || !TRIE_BITMAP_TEST(trie, nextchr))) > >> { > >> if (trie->states[ state ].wordnum) { > >> DEBUG_EXECUTE_r( > >> Perl_re_exec_indentf( aTHX_ "%smatched empty string...%s\n", > >> depth, PL_colors[4], PL_colors[5]) > >> ); > >> > >> At this point nextchr has the first byte of the UTF-8 encoded \xE4 (0xc3). > > > > I'm looking into this as we speak. > > I was going to look into it later as well. Let me know how far you get. Not far as it turns out. The failing code has TRIE_BITMAP_TEST() returning false, while good code like $_ = "a\x64"; print "match\n" if m{^(a|a\x{64})$}; doesn't. At which point I got distracted and haven't looked further. You're probably a better choice than me to take this further :-) -- The Enterprise's efficient long-range scanners detect a temporal vortex distortion in good time, allowing it to be safely avoided via a minor course correction. -- Things That Never Happen in "Star Trek" #21Thread Previous | Thread Next