SADAHIRO Tomoyuki wrote: > On Tue, 30 Nov 2010 21:26:20 -0500 > Eric Brine <ikegami@adaelis.com> wrote: > >> On Tue, Nov 30, 2010 at 4:57 PM, Jonathan Pool <perlbug-followup@perl.org>wrote: > >> print ('3. The NBS is ' . (/[\x7f-\x80]/ ? '' : 'NOT ') . 'matched by >> /[\7f-\x80]/' . "\n"); >> >> >>> With "use encoding 'utf8'" (or with both pragmas), [...] patterns 4, 5, and >>> 6 fail instead of matching. >>> >> I'm not sure if that's a bug, or if it's broken by design. >> >> - Eric > > This seems be able to be explained and perhaps not a bug. > > According to POD of encoding.pm, > (see http://search.cpan.org/~dankogai/Encode-2.40/encoding.pm ) > "\xDF" under use encoding "iso 8859-7" is \x{3af} in Unicode > "\xA4\xA1" under use encoding "euc-jp" is \x{3041} in Unicode I understand the rest of this post, but I don't understande the relevance of 8859-7 and euc-jp to the discussion. Please enlighten me. > > Then, under use encoding "utf8", U+00A0 in Unicode should be "\xC2\xA0". > Use of "\xA0" expecting U+00A0 is wrong. > > The reason why /[\x7F-\x80]/ matches U+00A0 is that /[\x7F-\x{FFFD}]/ > matches U+00A0 as \x80 is malform as utf8 and replaced with \x{FFFD}. I looked at some debug info, and see that you are correct. Jonathan, you said that the encoding was utf8, but \x80 is not a legal utf8-encoded character. But it should have warned that it was substituting FFFD. > > Regards, > SADAHIRO Tomoyuki >Thread Previous | Thread Next