Front page | perl.perl5.porters |
Postings from July 2000
[ID 20000725.001] Possible UTF8 bug?
Thread Next
From:
Philip Hazel
Date:
July 25, 2000 03:44
Subject:
[ID 20000725.001] Possible UTF8 bug?
Message ID:
Pine.SOL.4.21.0007251126080.10570-100000@draco.cus.cam.ac.uk
Hello,
Either I've misunderstood, or there is a problem with UTF-8 strings using
the \x{} notation in Perl v5.6.0. The program below is supposed to test
a method of extracting a list of integer character values from a string.
Strings using the \x{} notation do not seem to create valid UTF-8 when
the value is less than 256 unless you ensure that they are compiled
under "use utf8". The documentation suggests that this it not necessary.
The program below gives the error
Malformed UTF-8 character at ... line 12.
for the first three example strings. The use of substr + ord works in
these cases, but not in the fourth case, where a UTF-8 string does seem
to be created.
It you stick "use utf8" at the start of the program, the strings appear
to be created OK, and unpack unpacks correctly, but the combination of
substr and ord now gives different (and incorrect) answers for the
low-valued characters as well as the high valued one.
Looks to me like:
(1) \x{} is creating UTF-8 only when the value is > 255.
(2) Either substr or ord is broken on UTF-8 strings.
Regards,
Philip
--
Philip Hazel University of Cambridge Computing Service,
ph10@cus.cam.ac.uk Cambridge, England. Phone: +44 1223 334714.
#! /bin/perl
sub pr {
printf("\n---- %s ----\n", $_[1]);
for ($i = 0; $i < length($_[0]); $i++)
{
$s = substr($_[0], $i, 1);
printf("CH = \\x{%x} = %d\n", ord $s, ord $s);
}
@p = unpack('U', $_[0]);
printf("U = 0x%x = %d\n", $p[0], $p[0]);
}
&pr("A", "A");
&pr("\x{c2}", "\\x{c2}"); # These will work if enclosed
&pr("\x{ec}", "\\x{ec}"); # between use utf8 and use bytes
&pr("\x{80}", "\\x{80}"); #
&pr("\x{263A}", "\\x{263A}"); # This works always
Thread Next
-
[ID 20000725.001] Possible UTF8 bug?
by Philip Hazel