Front page | perl.perl5.porters |
Postings from November 2000
Re: [ID 20001114.001] use utf8;use charnames; is incorrect for \x{80}-\x{FF}
Thread Previous
|
Thread Next
From:
Andrew McNaughton
Date:
November 14, 2000 08:22
Subject:
Re: [ID 20001114.001] use utf8;use charnames; is incorrect for \x{80}-\x{FF}
Message ID:
Pine.BSF.4.10.10011150402580.7938-100000@sub.internal.cwa.co.nz
Here's a cleaned up patch for charnames. This is a workaround for odd
behaviour in perl's chr function.
I found some references to chr's behaviour in the Changes file. It seems
that chr has been made to handle codes 128 .. 255, and that this change
has been pulled out. I can see reasons for this in terms of backwards
compatibility, but this eval trick really is a bit ugly. perhaps we need
a widechr function?
I'm rather concerned by what's happening with the utf-8 implementation. By
trying to modify existing functions, while retaining compatibility with
existing code, the semantics are getting muddled, and I expect this to
lead to a host of security problems. It is important that utf-8 text
should be cleanly utf-8. as soon as character sequences which are not
valid utf-8 start being processed by utf-8 text handlers, the ambiguities
will lead to a great many validation and security issues. I do understand
that this is difficult territory, but the only way to get through is with
a clean and consistent data model. In my view introducing a perl specific
text encoding scheme which behaves like utf-8 sometimes, but not at other
times is a serious mistake.
here's another bug presumably arising from the same area of confusion:
use Data::Dumper;
{
use utf8;
print Dumper
"\x{eb}",
"\x{100}",
"\x{eb}" . "\x{100}";
}
Andrew
--- charnames.pm.orig Tue Nov 14 15:04:03 2000
+++ charnames.pm Tue Nov 14 15:11:38 2000
@@ -38,7 +38,10 @@
my $fname = substr $txt, $off[0] + 2, $off[1] - $off[0] - 2;
die "Character 0x$hex with name '$fname' is above 0xFF";
}
- return chr $ord;
+ else {
+ use utf8;
+ return eval sprintf('"\x{%x}"',$ord);
+ }
}
On Wed, 15 Nov 2000, Andrew McNaughton wrote:
> Date: Wed, 15 Nov 2000 03:06:19 +1300 (NZDT)
> From: Andrew McNaughton <andrew@tki.org.nz>
> To: Nick Ing-Simmons <nik@tiuk.ti.com>
> Cc: perl5-porters@perl.org
> Subject: Re: [ID 20001114.001] use utf8;use charnames; is incorrect for
\x{80}-\x{FF}
>
>
>
>
>
>
>
>
> On Tue, 14 Nov 2000, Nick Ing-Simmons wrote:
>
> > Date: Tue, 14 Nov 2000 08:32:36 GMT
> > From: Nick Ing-Simmons <nik@tiuk.ti.com>
> > To: andrew@tki.org.nz
> > Cc: perl5-porters@perl.org
> > Subject: Re: [ID 20001114.001] use utf8;use charnames; is incorrect for
> \x{80}-\x{FF}
> >
> > Andrew McNaughton <andrew@tki.org.nz> writes:
> > >This is a bug report for perl from andrew@tki.org.nz,
> > >generated with the help of perlbug 1.26 running under perl 5.006.
> > >
> > >
> > >-----------------------------------------------------------------
> > >[Please enter your report here]
> > >
> > >The following fails:
> > >
> > >use utf8;
> > >use charnames ':full';
> > >$text .= "\N{LATIN CAPITAL LETTER A WITH DIAERESIS}";
> > >
> > >
> > >This fails because of the final line of &charnames::charnames. It returns an
> > >8 bit value.
> >
> > It is an 8-bit value - that is the UNICODE codepoint is < 256.
>
> The unicode codepoint may be less than 256, but in utf8 2 byte characters
> start from codepoint 128, not 256.
>
> > The problem is not with charnames as such, but rather
> > the fact that perl's internal optimization of hold chars in range 0..255
> > as single bytes is visible, and in particular there is as yet no way to
> > tell perl that you want utf8 for _output_ ("use utf8" affects litteral
> > strings on _input_ and has one or two other "odd" effects).
>
> "use utf8" is generally only required for string literals, but in that
> particular module, there is a "use bytes" statement at the top. Other than
> for string literals, perl generally assumes utf8.
>
> Output functions do need to know whether they are operating on (utf8)
> characters or bytes, and many do, including chr().
>
> From "perldoc perlunicode":
>
> o The chr() and ord() functions work on characters.
> This is like pack("U") and unpack("U"), not like
> pack("C") and unpack("C"). In fact, the latter are
> how you now emulate byte-oriented chr() and ord()
> under utf8.
>
> The problem is that chr() gets it wrong for utf8 from 128-255 under utf8
> character mode, and arguably also with "use bytes" for values >= 256 which
> should probably produce a warning and an undef result. As things
> stand you get a value modulo 256.
>
> {
> use utf8;
> print chr(256 + 65),"\n";
> print chr(128 + 65),"\n"; # wrong
> }
> {
> use bytes;
> print chr(256 + 65),"\n"; # arguably wrong
> print chr(128 + 65),"\n";
> }
>
>
> Andrew McNaughton
>
>
>
>
>
>
>
>
> --
> Andrew McNaughton
> Te Kete Ipurangi: The Online Learning Centre
> andrew@tki.org.nz
> Ph: 64 4 382 6500
> Fax: 64 4 382 6509
> Mobile: 021 323 076
>
> PO Box 19-098
> Wellington, NZ
> http://www.tki.org.nz/
>
>
--
Andrew McNaughton
Te Kete Ipurangi: The Online Learning Centre
andrew@tki.org.nz
Ph: 64 4 382 6500
Fax: 64 4 382 6509
Mobile: 021 323 076
PO Box 19-098
Wellington, NZ
http://www.tki.org.nz/
Thread Previous
|
Thread Next