develooper Front page | perl.perl5.porters | Postings from May 2010

Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} a bug?)

Thread Previous | Thread Next
From:
karl williamson
Date:
May 22, 2010 13:19
Subject:
Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} a bug?)
Message ID:
4BF83C0B.1050804@khwilliamson.com
Burak Gürsoy wrote:
>> -----Original Message-----
>> From: karl williamson [mailto:public@khwilliamson.com]
>> Sent: Sunday, March 07, 2010 7:10 PM
>> To: Burak Gürsoy
>> Cc: perl5-porters@perl.org
>> Subject: Re: Is lc(\x{130}) -> i\x{307} a bug?
>>
> 
>> Perl is working correctly according to the Unicode standard.
> 
> Ok.
> 
>> The inclusion of U+0307 is the correct Unicode mapping for languages
>> other than Turkish and Azerbaijani.  The mapping should be just to 'i'
> 
> It's Turkish in my case :)
> 
>> for those two languages, but it does not preserve canonical equivalence
>> without further processing.  Perl currently doesn't do this, nor does
>> Perl currently support locale handling of code points beyond 0xFF.
>>
>> Perl is unlikely to add such support, as Unicode itself has moved away
>> from defining locale dependent mappings.  They still define this one
>> and
>> a few others that were included very early on in Unicode, but aren't
>> adding new ones.  Instead they have a CLDR project for locale data.  I
>> know next to nothing about that.
> 
> I'll check that. And an OffTopic question if you don't mind: any ideas 
> on the decision in perl6 on this matter?
> 
>> These mappings of U+0130 have been very problematic and have caused
>> significant consternation over the years, but they (and we) are stuck
>> with it now.
>>
>> I thought there might be a CPAN module that changed the behavior to
>> suit
>> these two languages, but I just searched there and didn't see anything.
> 
> Well... defining a ToLower() seems to be the remedy for this issue:
> 
> sub ToLower {
> return <<"RANGE";
> 0049\t\t0131
> 0130\t\t0069
> RANGE
> }
> 
> Too trivial to wrap inside a module I guess :) A module related to this
> must also handle the sorting, etc. However, since ToLower/ToUpper
> is by-passed for non-unicode-looking strings, the range trick will not work
> for things like uc('i')/lc('I') (for Turkish). Only way to make a reliable
> Turkish locale dependent thingy seems to be a combination of ToLower/ToUpper
> and pre-process the string before passing to uc/lc with s/// (or tr///).
> And this is a hack unfortunately.
> 
> Thanks,
> Burak
> 

I've done some more research on this matter, and have found some 
work-arounds.  First of all, if one has just a few code points to 
override the casing of, you can do the following:

my $upper = "$Config{privlib}/unicore/To/Upper.pl";

sub ToUpper {
     my $official = do $upper;
     $utf8::ToSpecUpper{'i'} = "\x{0130}"; # override official
     return $official;
}

This keeps one from having to import by hand the standard mappings for 
each new Unicode version.  Since this function is called once and then 
cached, performance shouldn't be a problem (well, any more of a problem 
than currently).

To get around the problem of the source having to be in utf8 for this to 
be called, one can do:

use subs qw(uc ucfirst lc lcfirst);

sub uc($) {
     my $string = shift;
     utf8::upgrade($string);
     return CORE::uc($string);
}

as long as all the calls to uc are in the same file.  I believe, but 
haven't tried, that you can use the new pluggable keywords feature to 
extend this to other files, see "PL_keyword_plugin" in perlapi, if you 
write an XS module.

But lowercasing in Turkish and Azerii has two extra, context-dependent 
rules defined in the Unicode standard.  (I haven't looked at the locale 
repository.)  One of those rules gets rid of the \x{307}, as the subject 
of your posts indicate.  You can put those in the lc() function.

sub lc($) {
     my $string = shift;
     utf8::upgrade($string);

     # Unless an I is before a dot_above, it turns into a dotless i.
     $string =~ s/I (?! [^\p{ccc=0}\p{ccc=Above}]* \x{0307} )/\x{131}/gx;

     # But when the I is followed by a dot_above, remove the dot_above so
     # the end result will be i.
     $string =~ s/I ([^\p{ccc=0}\p{ccc=Above}]* ) \x{0307}/i$1/gx;
     return CORE::lc($string);
}

I took those rules from the Unicode standard, 5.2 section 3.13.  Note 
that there is an issue in Perl with context-dependent case changing, as 
one can use "FOO_\LBAR\E_MORE_FOO", and the only context lc sees is 
'BAR',  without looking at the _MORE_FOO.  That shouldn't be a problem 
here, as one shouldn't have the \E between a character and its combined 
marks, but it could happen.

For completeness, here is ToLower()

sub ToLower {
     my $official = do $lower;
     $utf8::ToSpecLower{"\xc4\xb0"} = "i";
     return $official;
}

Note that the key to the hash must be in utf8.  This would differ on 
utf-ebcdic.  ToTitle is essentially ToUpper.  lcfirst and ucfirst 
correspond to lc and uc.

I believe that this is outlines a complete Turkish case changing 
implementation on Perl5.  I have yet to look at if there is a way to 
override case-insensitive regex matching; and I'm not planning to look 
at collation.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About