develooper Front page | perl.perl5.porters | Postings from July 2000

[ID 20000725.001] Possible UTF8 bug?

Thread Next
Philip Hazel
July 25, 2000 03:44
[ID 20000725.001] Possible UTF8 bug?
Message ID:

Either I've misunderstood, or there is a problem with UTF-8 strings using 
the \x{} notation in Perl v5.6.0. The program below is supposed to test
a method of extracting a list of integer character values from a string.

Strings using the \x{} notation do not seem to create valid UTF-8 when
the value is less than 256 unless you ensure that they are compiled
under "use utf8". The documentation suggests that this it not necessary.
The program below gives the error

Malformed UTF-8 character at ... line 12.

for the first three example strings. The use of substr + ord works in 
these cases, but not in the fourth case, where a UTF-8 string does seem 
to be created.

It you stick "use utf8" at the start of the program, the strings appear
to be created OK, and unpack unpacks correctly, but the combination of
substr and ord now gives different (and incorrect) answers for the
low-valued characters as well as the high valued one.

Looks to me like:  

  (1) \x{} is creating UTF-8 only when the value is > 255.
  (2) Either substr or ord is broken on UTF-8 strings.

Philip Hazel            University of Cambridge Computing Service,      Cambridge, England. Phone: +44 1223 334714.

#! /bin/perl

sub pr {
printf("\n---- %s ----\n", $_[1]);

for ($i = 0; $i < length($_[0]); $i++)
  $s = substr($_[0], $i, 1); 
  printf("CH = \\x{%x} = %d\n", ord $s, ord $s);
@p = unpack('U', $_[0]);
printf("U = 0x%x = %d\n", $p[0], $p[0]);

&pr("A", "A");

&pr("\x{c2}", "\\x{c2}");       # These will work if enclosed
&pr("\x{ec}", "\\x{ec}");       # between use utf8 and use bytes
&pr("\x{80}", "\\x{80}");       #

&pr("\x{263A}", "\\x{263A}");   # This works always

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About