> Jonathan, you said that the encoding was utf8, but \x80 is not a legal utf8-encoded character. But it should have warned that it was substituting FFFD. The script reads a line from a UTF8-encoded file into a Perl scalar. It then operates on the scalar. In man perlunicode, one reads: "Unless explicitly stated, Perl operators use character semantics for Unicode data and byte semantics for non-Unicode data. ... Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is logically just a number ranging from 0 to 2**31 or so. The Unicode code for the desired character, in hexadecimal, should be placed in the braces. For instance, a smiley face is "\x{263A}". This encoding scheme only works for all characters ...." This documentation tells me that the way to refer to a Unicode character (once it is in a string that has been assigned to a Perl scalar) is by its Unicode codepoint, not by its UTF-8 encoding. A white smiling face has codepoint U+263a, but it has UTF-8 encoding e298ba. The documentation tells me to refer to that character with \x{263a}, not with \x{e298ba}. As you say, \x80 is not a legal UTF-8 encoding, but it is a legal (even though unnamed) Unicode character codepoint. So on the basis of the documentation I would expect Perl to recognize it as such and not to convert \x80 to \x{fffd}. Illustration: perl -wE 'binmode STDOUT, ":utf8"; use utf8; say "\x{263a}";' ☺ perl -wE 'binmode STDOUT, ":utf8"; use utf8; say "\x{e298ba}";' ????? If I'm mistaken about any of the above, I'll be grateful to be corrected. Thanks for your help. ˉThread Previous | Thread Next