develooper Front page | perl.perl5.porters | Postings from December 2010

Re: [perl #80030] Matching upper ASCII characters from file in REpatterns

Thread Previous | Thread Next
From:
karl williamson
Date:
December 8, 2010 21:58
Subject:
Re: [perl #80030] Matching upper ASCII characters from file in REpatterns
Message ID:
4D006FFC.1010804@khwilliamson.com
SADAHIRO Tomoyuki wrote:
> On Tue, 30 Nov 2010 21:26:20 -0500
> Eric Brine <ikegami@adaelis.com> wrote:
> 
>> On Tue, Nov 30, 2010 at 4:57 PM, Jonathan Pool <perlbug-followup@perl.org>wrote:
> 
>>     print ('3. The NBS is ' . (/[\x7f-\x80]/ ? '' : 'NOT ') . 'matched by
>> /[\7f-\x80]/' . "\n");
>>
>>
>>> With "use encoding 'utf8'" (or with both pragmas), [...] patterns 4, 5, and
>>> 6 fail instead of matching.
>>>
>> I'm not sure if that's a bug, or if it's broken by design.
>>
>> - Eric
> 
> This seems be able to be explained and perhaps not a bug.
> 
> According to POD of encoding.pm, 
> (see http://search.cpan.org/~dankogai/Encode-2.40/encoding.pm )
> "\xDF" under use encoding "iso 8859-7" is \x{3af} in Unicode
> "\xA4\xA1" under use encoding "euc-jp" is  \x{3041} in Unicode

I understand the rest of this post, but I don't understande the 
relevance of 8859-7 and euc-jp to the discussion.  Please enlighten me.
> 
> Then, under use encoding "utf8", U+00A0 in Unicode should be "\xC2\xA0".
> Use of "\xA0" expecting U+00A0 is wrong.
> 
> The reason why /[\x7F-\x80]/ matches U+00A0 is that /[\x7F-\x{FFFD}]/
> matches U+00A0 as \x80 is malform as utf8 and replaced with \x{FFFD}.

I looked at some debug info, and see that you are correct.  Jonathan, 
you said that the encoding was utf8, but \x80 is not a legal 
utf8-encoded character.  But it should have warned that it was 
substituting FFFD.
> 
> Regards,
> SADAHIRO Tomoyuki
> 


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About