develooper Front page | perl.perl5.porters | Postings from December 2010

Re: [perl #80030] Matching upper ASCII characters from file in RE patterns

Thread Previous | Thread Next
From:
SADAHIRO Tomoyuki
Date:
December 10, 2010 19:36
Subject:
Re: [perl #80030] Matching upper ASCII characters from file in RE patterns
Message ID:
20101211123609.9529.CB027F2D@nifty.com

On Sat, 11 Dec 2010 11:47:55 +0900
SADAHIRO Tomoyuki <bqw10602@nifty.com> wrote:

> > Could the latter representation (\xc2\x80) appear in a regular-expression character class, too?
> 
> Could with perl 5.8.0, 5.8.1, 5.8.3, 5.8.8.
> Cannot with perl 5.8.9, 5.10.0, 5.10.1.
> (I didn't run with other versions.) 

Sadly, it has been broken in a character class.
[\xHH\xHH] was not interpreted as a multi-octet character "\xHH\xHH"
under use encoding "a-multi-octet-encoding".

In an older perl (to 5.8.8),
  /\xC2\xA0/ matched only U+00C2. (by design)
  /\xE1\x80\x80/ matched only U+1000. (by design)
  /[\xC2\xA0]/ matched U+00C2 and U+00A0. (broken)
  /[\xE1\x80\x80]/ matched U+00E1 and U+0080. (broken)

In a newer perl (from 5.8.9),
  /\xC2\xA0/ matches only "\x{FFFD}"x2. (broken)
  /\xE1\x80\x80/ matches only "\x{FFFD}"x3. (broken)
  /[\xE1\x80\x80]/ and /[\xC2\xA0]/ match only U+FFFD. (broken)

#!perl
use strict;
use warnings;
use charnames ':full';
use encoding 'UTF-8';
print "perl $]\n";

my $u00e1 = "\N{LATIN SMALL LETTER A WITH ACUTE}"; # U+00E1

print "string-eq: ";
print "a\x{1000}z" eq "a\xE1\x80\x80z" ? "ok\n" : "not ok\n";
print "reg-exact: ";
print "a\x{1000}z" =~ /a\xE1\x80\x80z/ ? "ok\n" : "not ok\n";
print "reg-class: ";
print "a\x{1000}z" =~ /a[\xE1\x80\x80]z/ ? "ok\n" : "not ok\n";
print "  vs 00E1: ";
print "a${u00e1}z" !~ /a[\xE1\x80\x80]z/ ? "ok\n" : "not ok\n";
print "  vs FFFD: ";
print "a\x{FFFD}z" !~ /a[\xE1\x80\x80]z/ ? "ok\n" : "not ok\n";
__END__

perl 5.008
string-eq: ok
reg-exact: ok
reg-class: not ok
  vs 00E1: not ok
  vs FFFD: ok

perl 5.008001
string-eq: ok
reg-exact: ok
reg-class: not ok
  vs 00E1: not ok
  vs FFFD: ok

perl 5.008003
string-eq: ok
reg-exact: ok
reg-class: not ok
  vs 00E1: not ok
  vs FFFD: ok

perl 5.008008
string-eq: ok
reg-exact: ok
reg-class: not ok
  vs 00E1: not ok
  vs FFFD: ok

perl 5.008009
string-eq: ok
reg-exact: not ok
reg-class: not ok
  vs 00E1: ok
  vs FFFD: not ok

perl 5.010000
string-eq: ok
reg-exact: not ok
reg-class: not ok
  vs 00E1: ok
  vs FFFD: not ok

perl 5.010001
string-eq: ok
reg-exact: not ok
reg-class: not ok
  vs 00E1: ok
  vs FFFD: not ok


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About