Jan Dubois skribis 2008-05-18 8:40 (-0700): > No, "\xff" is guaranteed to have byte semantics for backwards compatibility: "byte semantics" is a dangerous term, partly because different people use it for different things. Some people use it to refer to functions and operators acting on the bytes in the PV's buffer regardless of the SvUTF8 flag's state, but those functions are generally broken and in need of repair (as announced in perl5100delta, this would break compatibility). By default, "\xff" by itself will indeed create a string that *internally* is a single byte 0xff. A Perl string is a Unicode string. Or actually, a sequence of almost arbitrary integer values that most operations ought to interpret as unicode codepoints. If it contains only characters < 256, it may be "encoded as latin1" (represented as 8 bit with a straight mapping) internally, both for efficiency and for backwards compatibility. When strings are sent or received with system calls, that has to occur in bytes. If a string only contains characters < 256, it can be used as a byte string. (Note: I originally believed otherwise and was wrong.) Still it can be useful to write your program in a way that avoids that a string that will be used as a byte string, is ever upgraded to UTF8 *internally*: upgrading and downgrading it again might be a performance issue. There should be no difference in semantics, regardless of the internal encoding of the string. It is a bug that there is. I believe that this snippet: > perluniintro.pod: > | Note that C<\x..> (no C<{}> and only two hexadecimal digits), C<\x{...}>, > | and C<chr(...)> for arguments less than C<0x100> (decimal 256) > | generate an eight-bit character for backward compatibility with older > | Perls. For arguments of C<0x100> or more, Unicode characters are > | always produced. If you want to force the production of Unicode > | characters regardless of the numeric value, use C<pack("U", ...)> > | instead of C<\x..>, C<\x{...}>, or C<chr()>. is misleading. It suggests that Perl has two kinds of strings technically, which is not true. There is a single string type with two *internal* representations. The word *internal* is notably missing in the quoted part of perluniintro. Let's change "generate an eight-bit character" to "generate a string that has an eight-bit encoding internally". In any case, CHARACTERS DO NOT HAVE BITS. Bytes have 8 bits, characters just have a number. > As perluniintro.pod above points out, the only reliable way to do this > is pack("U", $codepoint). Or you can use named characters using > charnames.pm. If the remaining bugs in Perl (see also Unicode::Semantics) are fixed, then there is no longer any *need* for forcing the internal encoding to UTF8. This said, I think that pack("U", $codepoint) is not a very good idea. Without degressing into details, I would like to point out that it's usually better to associate the upgrade with the buggy operator, rather than the string itself. So instead of: my $char = pack("U", $codepoint); ... # perhaps lots of code here my $uc = uc($char); I would suggest using: my $char = chr($codepoint); ... # perhaps lots of code here utf8::upgrade($char); # work around bug my $uc = uc($char); > [perluniintro] > | Internally, Perl currently uses either whatever the native eight-bit > | character set of the platform (for example Latin-1) is This is simply not true. Perl uses either latin1 or ebcdic for its internally eight-bit strings. Not Windows-1252, for example. > | defaulting to UTF-8, to encode Unicode strings. defaulting to UTF-8, WITH A WARNING, for strings that could not be downgraded, i.e. strings that contain characters > 255. The warning is there for a reason: it says you're doing it wrong. You're forcing a byte-incompatible string on a byte operation (system call), and forgot to encode. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl> 1;