Front page | perl.perl5.porters |
Postings from January 2012
[perl #108638] Re: "the Unicode bug", reversed?
Thread Previous
From:
karl williamson
Date:
January 19, 2012 12:34
Subject:
[perl #108638] Re: "the Unicode bug", reversed?
Message ID:
rt-3.6.HEAD-14510-1327005239-64.108638-75-0@perl.org
# New Ticket Created by karl williamson
# Please include the string: [perl #108638]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org:443/rt3/Ticket/Display.html?id=108638 >
On 09/06/2011 09:28 AM, Tom Christiansen wrote:
> Summary: If you use -E, matches fail that work fine under -e. This is
> in some sense the opposite of the Unicode bug, which normally
> works the other way around.
>
> Matthew Barnett, who is implementing full casefolding in Python,
> initially reported to me these Perl bugs:
>
> However, these match:
>
> "\N{LATIN SMALL LETTER SHARP S}" =~ /ss/i
> "\N{LATIN SMALL LIGATURE LONG S T}" =~ /st/i
> "\N{LATIN SMALL LIGATURE ST}" =~ /st/i
> "\N{LATIN SMALL LETTER SHARP S}t" =~ /sst/i
>
> but these don't match:
>
> "s\N{LATIN SMALL LIGATURE LONG S T}" =~ /sst/i
> "s\N{LATIN SMALL LIGATURE ST}" =~ /sst/i
>
> I think what might be happening is that it isn't handling the
> possibility of overlapping full case-folding.
>
> When it sees "sst" in the regex it identifies "ss" as a possible result
> of full case-folding and so adds the unfolded alternative:
>
> ss => ss|\N{LATIN SMALL LETTER SHARP S}
>
> but it then doesn't identify "st" as another possible result of full
> case-folding, so it doesn't add the unfolded alternative (either of
> them, in fact):
>
> st => st|\N{LATIN SMALL LIGATURE ST}
>
> It should be doing:
>
> sst => sst|\N{LATIN SMALL LETTER SHARP S}t|s\N{LATIN SMALL LIGATURE ST}
>
> (Again, I'm ignoring the other alternative.)
>
> And it is indeed true that those two test cases fail, under both 5.14 and blead:
>
> This is perl 5, version 14, subversion 0 (v5.14.0) built for darwin-2level
>
> This is perl 5, version 15, subversion 2 (v5.15.2-264-g87e4a53) built for darwin-2level
>
> As shown here:
>
> % perl -Mcharnames=:full -lE 'print "s\N{LATIN SMALL LIGATURE LONG S T}" =~ /sst/i ? "Pass" : "Fail"'
> Fail
> % perl -Mcharnames=:full -lE 'print "s\N{LATIN SMALL LIGATURE ST}" =~ /sst/i ? "Pass" : "Fail"'
> Fail
>
> However, merely change the -E to a -e, suddenly they work!
>
> % perl -Mcharnames=:full -le 'print "s\N{LATIN SMALL LIGATURE LONG S T}" =~ /sst/i ? "Pass" : "Fail"'
> Pass
> % perl -Mcharnames=:full -le 'print "s\N{LATIN SMALL LIGATURE ST}" =~ /sst/i ? "Pass" : "Fail"'
> Pass
>
> So it looks like this is some reverse Unicode bug. Very strange.
>
> For the record, Ruby does get these right:
>
> % ruby 'print "s\uFB05" =~ /sst/i ? "Pass" : "Fail"'
> Pass
> % ruby 'print "s\uFB06" =~ /sst/i ? "Pass" : "Fail"'
> Pass
>
> Where that is:
>
> % ruby -v
> ruby 1.9.2p0 (2010-08-18 revision 29036) [i386-darwin9.8.0]
>
> Here are other, probably related issues:
>
> % perl -lE 'print "\x{FB05}" =~ /st/i ? "Pass" : "Fail"'
> Pass
> % perl -lE 'print "\x{DF}\x{FB05}" =~ /st/i ? "Pass" : "Fail"'
> Fail
> % blead -lE 'print "\x{DF}\x{FB05}" =~ /st/i ? "Pass" : "Fail"'
> Fail
>
> However, unlike the early attempts, *those* do *not* suddenly pass if
> you use -e instead of -E:
>
> % perl -le 'print "\x{DF}\x{FB05}" =~ /st/i ? "Pass" : "Fail"'
> Fail
> % blead -le 'print "\x{DF}\x{FB05}" =~ /st/i ? "Pass" : "Fail"'
> Fail
>
> See; it still fails. Very strange. They work fine in Ruby:
>
> % ruby -le 'print "\uFB05" =~ /st/i ? "Pass" : "Fail"'
> Pass
> % ruby -le 'print "\u00DF\uFB05" =~ /st/i ? "Pass" : "Fail"'
> Pass
>
> Like Perl, Ruby does *not* do partial matches of full casefolds
> (I don't think the idea makes sense), so it's not like it's going
> totally overboard with full casefolding:
>
> % perl -lE 'print "\x{DF}\x{FB05}" =~ /ssst/i ? "Pass" : "Fail"'
> Pass
> % ruby -le 'print "\u00DF\uFB05" =~ /ssst/i ? "Pass" : "Fail"'
> Pass
>
> % perl -lE 'print "\x{DF}\x{FB05}" =~ /sst/i ? "Pass" : "Fail"'
> Fail
> % ruby -le 'print "\u00DF\uFB05" =~ /sst/i ? "Pass" : "Fail"'
> Fail
>
> Which is as expected. The others aren't.
>
> --tom
I believe that these are all now fixed in blead
Thread Previous