develooper Front page | perl.perl5.porters | Postings from December 2010

Re: [perl #80030] Matching upper ASCII characters from file in RE patterns

Thread Previous | Thread Next
Jonathan Pool
December 10, 2010 18:14
Re: [perl #80030] Matching upper ASCII characters from file in RE patterns
Message ID:
> Jonathan, you said that the encoding was utf8, but \x80 is not a legal utf8-encoded character.  But it should have warned that it was substituting FFFD.

The script reads a line from a UTF8-encoded file into a Perl scalar.

It then operates on the scalar.

In man perlunicode, one reads:

"Unless explicitly stated, Perl operators use character semantics for Unicode data and byte semantics for non-Unicode data. ... Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is logically just a number ranging from 0 to 2**31 or so. The Unicode code for the desired character, in hexadecimal, should be placed in the braces. For instance, a smiley face is "\x{263A}". This encoding scheme only works for all characters ...."

This documentation tells me that the way to refer to a Unicode character (once it is in a string that has been assigned to a Perl scalar) is by its Unicode codepoint, not by its UTF-8 encoding. A white smiling face has codepoint U+263a, but it has UTF-8 encoding e298ba. The documentation tells me to refer to that character with \x{263a}, not with \x{e298ba}.

As you say, \x80 is not a legal UTF-8 encoding, but it is a legal (even though unnamed) Unicode character codepoint. So on the basis of the documentation I would expect Perl to recognize it as such and not to convert \x80 to \x{fffd}.


perl -wE 'binmode STDOUT, ":utf8"; use utf8; say "\x{263a}";'
perl -wE 'binmode STDOUT, ":utf8"; use utf8; say "\x{e298ba}";'

If I'm mistaken about any of the above, I'll be grateful to be corrected.

Thanks for your help.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About