develooper Front page | perl.perl5.porters | Postings from November 2000

Re: [ID 20001114.001] use utf8;use charnames; is incorrect for \x{80}-\x{FF}

Thread Previous | Thread Next
From:
Andrew McNaughton
Date:
November 14, 2000 06:10
Subject:
Re: [ID 20001114.001] use utf8;use charnames; is incorrect for \x{80}-\x{FF}
Message ID:
Pine.BSF.4.10.10011150218260.7938-100000@sub.internal.cwa.co.nz







On Tue, 14 Nov 2000, Nick Ing-Simmons wrote:

> Date: Tue, 14 Nov 2000 08:32:36 GMT
> From: Nick Ing-Simmons <nik@tiuk.ti.com>
> To: andrew@tki.org.nz
> Cc: perl5-porters@perl.org
> Subject: Re: [ID 20001114.001] use utf8;use charnames; is incorrect for   
    \x{80}-\x{FF}
> 
> Andrew McNaughton <andrew@tki.org.nz> writes:
> >This is a bug report for perl from andrew@tki.org.nz,
> >generated with the help of perlbug 1.26 running under perl 5.006.
> >
> >
> >-----------------------------------------------------------------
> >[Please enter your report here]
> >
> >The following fails:
> >
> >use utf8;
> >use charnames ':full';
> >$text .= "\N{LATIN CAPITAL LETTER A WITH DIAERESIS}";
> >
> >
> >This fails because of the final line of &charnames::charnames.  It returns an
> >8 bit value.
> 
> It is an 8-bit value - that is the UNICODE codepoint is < 256.

The unicode codepoint may be less than 256, but in utf8 2 byte characters
start from codepoint 128, not 256.

> The problem is not with charnames as such, but rather
> the fact that perl's internal optimization of hold chars in range 0..255
> as single bytes is visible, and in particular there is as yet no way to 
> tell perl that you want utf8 for _output_ ("use utf8" affects litteral 
> strings on _input_ and has one or two other "odd" effects).

"use utf8" is generally only required for string literals, but in that
particular module, there is a "use bytes" statement at the top. Other than
for string literals, perl generally assumes utf8.

Output functions do need to know whether they are operating on (utf8)
characters or bytes, and many do, including chr().

From "perldoc perlunicode":

       o   The chr() and ord() functions work on characters.
           This is like pack("U") and unpack("U"), not like
           pack("C") and unpack("C").  In fact, the latter are
           how you now emulate byte-oriented chr() and ord()
           under utf8.

The problem is that chr() gets it wrong for utf8 from 128-255 under utf8
character mode, and arguably also with "use bytes" for values >= 256 which
should probably produce a warning and an undef result.  As things
stand you get a value modulo 256.

{
 use utf8;
 print chr(256 + 65),"\n";
 print chr(128 + 65),"\n";  # wrong
}
{
 use bytes;
 print chr(256 + 65),"\n";  # arguably wrong
 print chr(128 + 65),"\n";
}


Andrew McNaughton








--
Andrew McNaughton
Te Kete Ipurangi: The Online Learning Centre
andrew@tki.org.nz
Ph: 64 4 382 6500
Fax: 64 4 382 6509
Mobile: 021 323 076

PO Box 19-098
Wellington, NZ
http://www.tki.org.nz/


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About