develooper Front page | perl.perl5.porters | Postings from August 2001

Re: [ID 20010628.002] uc (and lc) of same character differs if it is utf8 encoded

Thread Previous | Thread Next
From:
Jarkko Hietaniemi
Date:
August 15, 2001 18:50
Subject:
Re: [ID 20010628.002] uc (and lc) of same character differs if it is utf8 encoded
Message ID:
20010815204958.Y10151@chaos.wustl.edu
> #!/usr/local/bin/perl -w
> 
> {
>   my ($e_accute_utf) = my ($e_accute) = chr 0xE9;
>   $e_accute_utf .= chr 300;
>   chop $e_accute_utf;
>   my $E_accute = uc $e_accute;
>   my $E_accute_utf = uc $e_accute_utf;
> 
>   if ($e_accute_utf eq $e_accute) {
>     print "ok\n";
>   } else {
>     print "not ok # '$e_accute_utf' ne '$e_accute'\n";
>   }
>   if ($E_accute_utf eq $E_accute) {
>     print "ok # '$E_accute_utf' eq '$E_accute'\n";
>   } else {
>     print "not ok # '$E_accute_utf' ne '$E_accute'\n";

That this doesn't work is locale-dependent: $E_accute is
uc $e_accute, and $e_accute is pure 8-bit character, and
whether uc upcases the $e_accute to $E_accute, is dependent
on the locale settings.

For example, for my Finnish locale, that test fails, since
$E_accute stays lowercase.  But switching locale helps:

LC_ALL=fr_FR.ISO8859-1 ./perl -Ilib -Mlocale t1
ok
ok # 'É' eq 'É'

The $...utf version works because it obeys the Unicode lower/uppercase
rules, but that it got correctly mapped to Unicode in the first place
is purely incidental: the 0xE9 happened to be Latin-1, which happens
to be the lowest 256-character 'page' of Unicode.

Summary: the bug cannot be solved without creative application of
high-yield explosives to locales.

>   }
> }

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About