develooper Front page | perl.perl5.porters | Postings from November 2000

Re: [ID 20001114.001] use utf8;use charnames; is incorrect for \x{80}-\x{FF}

Thread Previous | Thread Next
From:
Nick Ing-Simmons
Date:
November 14, 2000 10:18
Subject:
Re: [ID 20001114.001] use utf8;use charnames; is incorrect for \x{80}-\x{FF}
Message ID:
E13vkfC-00048G-00@serv1.is1.u-net.net
Andrew McNaughton <andrew@tki.org.nz> writes:
>> It is an 8-bit value - that is the UNICODE codepoint is < 256.
>
>The unicode codepoint may be less than 256, but in utf8 2 byte characters
>start from codepoint 128, not 256.

The design we are using for perl5.6+ is that perl strings are sequences 
of UNICODE characters. The fact that they _may_ be represented internally
as UTF-8 encoded bytes is supposed to be transparent to the perl programmer.

>
>"use utf8" is generally only required for string literals, but in that
>particular module, there is a "use bytes" statement at the top. 

use bytes; 

has very peculiar behaviour (by design). It says to expose the perl internal
representation to the perl programmer - but makes NO promisses as to what 
that representation will be. There are two possible ones:
  - simple bytes 0..255
  - UTF-8 encoded chars

You can use various function in the Encode.pm module (development track)
to find out which you got. 

'use bytes' has extremely specialized and possibly transitional-only uses.

'no utf8' is perhaps a more meaningful construct.

That said 'use bytes' is supposed to be harmless if all chars are < 256.
Its main use is to stop perl mis-interpreting binary data.

>Other than
>for string literals, perl generally assumes utf8.
>
>Output functions do need to know whether they are operating on (utf8)
>characters or bytes, and many do, including chr().

chr() is _not_ an output function. It does not need to know if it is working
on bytes, it _knows_ (by definition) it is working on characters.

print/write/send are output functions as might be $ENV{'FOO'} = $string. 

>
>>From "perldoc perlunicode":
>
>       o   The chr() and ord() functions work on characters.
>           This is like pack("U") and unpack("U"), not like
>           pack("C") and unpack("C").  In fact, the latter are
>           how you now emulate byte-oriented chr() and ord()
>           under utf8.
>
>The problem is that chr() gets it wrong for utf8 from 128-255 under utf8
>character mode, 

chr() works as it is designed to work in perl5.7.0+
That is it can handle any UNICODE character. Characters 0..255 are
usually, but not always, represented as simple iso8859-1 bytes. 

>and arguably also with "use bytes" for values >= 256 which
>should probably produce a warning and an undef result.  

There is a _lot_ of history (=discussion, disagreement, resolution)
in what 'use bytes' should do.

I strongly suggest that you try and work with perl5-porters to make 
your code/modules work _without_ 'use bytes'.

-- 
Nick Ing-Simmons


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About