develooper Front page | perl.perl5.porters | Postings from April 2013

Re: [perl #117787] use locale;" breaks \w on matching c-cedilla,o-diaeresis and u-diaeresis under tr_TR.utf8 and de_DE.utf8 locales

Thread Previous
From:
Karl Williamson
Date:
April 29, 2013 20:27
Subject:
Re: [perl #117787] use locale;" breaks \w on matching c-cedilla,o-diaeresis and u-diaeresis under tr_TR.utf8 and de_DE.utf8 locales
Message ID:
517ED7A1.8020500@khwilliamson.com
On 04/28/2013 11:22 AM, Dominic Hargreaves (via RT) wrote:
> # New Ticket Created by  Dominic Hargreaves
> # Please include the string:  [perl #117787]
> # in the subject line of all future correspondence about this issue.
> # <URL: https://rt.perl.org:443/rt3/Ticket/Display.html?id=117787 >
>
>
>
> This is a bug report for perl from dom@earth.li,
> generated with the help of perlbug 1.39 running under perl 5.17.12.
>
>>From <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=529305>:
>
> ----------------
> Showcase:
> (requires installing tr_TR.utf8 and de_De.utf8 locales via 'dpkg-reconfigure
> locales' or installing locales-all package)
>
>   #/usr/bin/perl
>   use strict;
>   use warnings;
>   use POSIX qw(setlocale LC_ALL);
>   setlocale(LC_ALL, "tr_TR.utf8");
>   print "Locale is ", setlocale(LC_ALL), "\n";
>
>   use locale;
>   use utf8;
>   binmode STDOUT, ":utf8";
>
>   print "$_ is " . ( /\w/ ? "" : "not " ) . "a word character\n"
>      for qw( ç ö ş ü ğ ı İ );
>
> The output is
>
>   Locale is tr_TR.utf8
>   ç is not a word character
>   ö is not a word character
>   ş is a word character
>   ü is not a word character
>   ğ is a word character
>   ı is a word character
>   İ is a word character
>
> Looking (with my uneducated eyes) in /usr/share/i18n/locales/tr_TR it seems
> that at least c-cedilla (U00E7 in small caps and U00C7 in caps) shall be
> treated as an "alpha" character so the problem seems to be in perl's
> interpretation.
> ----------------
>
> This is reproducible with 8b3945e7b7b7ae6fd2369864ebe169bd9a91cf4e
> (current blead) and has been the case since at least 5.8.8.

I tracked this down, and it appears to me to be a bug in the C library 
isalnum() function.  The suppliers might argue that it is intentional, 
but if so, it certainly isn't documented properly.

I'm doing some surmisal here.  What I think is going on is that under a 
UTF-8 locale, isalnum() (and its brethern) will only return true for 
invariant characters.  That is, only characters in the ASCII range.

To get whether a character above ASCII is an alnum, one must use 
iswalnum() instead.  There is no provision in Perl to do this.  Attached 
is a C program that demonstrates this on my old 10.10 Ubuntu system. 
Under the de locale, isalnum() returns true only for the ASCII alnums, 
but iswalnum() returns true for the whole range.

Perl assumes that isalnum() will work properly on any character whose 
ordinal is 0-255.  This turns out to be wrong.  I don't see how the 
suppliers of the C library could say that their implementation is 
correct; yet they have made equally absurd claims in the past.

It would probably be a lot of work for Perl to change to also use the C 
wide character classification functions.  But I will now take this 
opportunity to revive my proposal from a year ago to treat locales whose 
name ends in UTF-8, as UTF-8 for purposes of character classification 
and collation:
http://markmail.org/message/q4vorzd2xcxbm43y

That would fix this bug as a side effect, and is quite easy to implement.

The objections to last year's proposal all seem to me to stem from 
misunderstanding it, and from not wanting to encourage the use of a 
broken paradigm, locales, by fixing them.  I don't consider the latter 
to be a valid objection.



Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About