Front page | perl.perl5.porters |
Postings from May 2010
Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} a bug?)
Thread Previous
|
Thread Next
From:
karl williamson
Date:
May 22, 2010 13:19
Subject:
Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} a bug?)
Message ID:
4BF83C0B.1050804@khwilliamson.com
Burak Gürsoy wrote:
>> -----Original Message-----
>> From: karl williamson [mailto:public@khwilliamson.com]
>> Sent: Sunday, March 07, 2010 7:10 PM
>> To: Burak Gürsoy
>> Cc: perl5-porters@perl.org
>> Subject: Re: Is lc(\x{130}) -> i\x{307} a bug?
>>
>
>> Perl is working correctly according to the Unicode standard.
>
> Ok.
>
>> The inclusion of U+0307 is the correct Unicode mapping for languages
>> other than Turkish and Azerbaijani. The mapping should be just to 'i'
>
> It's Turkish in my case :)
>
>> for those two languages, but it does not preserve canonical equivalence
>> without further processing. Perl currently doesn't do this, nor does
>> Perl currently support locale handling of code points beyond 0xFF.
>>
>> Perl is unlikely to add such support, as Unicode itself has moved away
>> from defining locale dependent mappings. They still define this one
>> and
>> a few others that were included very early on in Unicode, but aren't
>> adding new ones. Instead they have a CLDR project for locale data. I
>> know next to nothing about that.
>
> I'll check that. And an OffTopic question if you don't mind: any ideas
> on the decision in perl6 on this matter?
>
>> These mappings of U+0130 have been very problematic and have caused
>> significant consternation over the years, but they (and we) are stuck
>> with it now.
>>
>> I thought there might be a CPAN module that changed the behavior to
>> suit
>> these two languages, but I just searched there and didn't see anything.
>
> Well... defining a ToLower() seems to be the remedy for this issue:
>
> sub ToLower {
> return <<"RANGE";
> 0049\t\t0131
> 0130\t\t0069
> RANGE
> }
>
> Too trivial to wrap inside a module I guess :) A module related to this
> must also handle the sorting, etc. However, since ToLower/ToUpper
> is by-passed for non-unicode-looking strings, the range trick will not work
> for things like uc('i')/lc('I') (for Turkish). Only way to make a reliable
> Turkish locale dependent thingy seems to be a combination of ToLower/ToUpper
> and pre-process the string before passing to uc/lc with s/// (or tr///).
> And this is a hack unfortunately.
>
> Thanks,
> Burak
>
I've done some more research on this matter, and have found some
work-arounds. First of all, if one has just a few code points to
override the casing of, you can do the following:
my $upper = "$Config{privlib}/unicore/To/Upper.pl";
sub ToUpper {
my $official = do $upper;
$utf8::ToSpecUpper{'i'} = "\x{0130}"; # override official
return $official;
}
This keeps one from having to import by hand the standard mappings for
each new Unicode version. Since this function is called once and then
cached, performance shouldn't be a problem (well, any more of a problem
than currently).
To get around the problem of the source having to be in utf8 for this to
be called, one can do:
use subs qw(uc ucfirst lc lcfirst);
sub uc($) {
my $string = shift;
utf8::upgrade($string);
return CORE::uc($string);
}
as long as all the calls to uc are in the same file. I believe, but
haven't tried, that you can use the new pluggable keywords feature to
extend this to other files, see "PL_keyword_plugin" in perlapi, if you
write an XS module.
But lowercasing in Turkish and Azerii has two extra, context-dependent
rules defined in the Unicode standard. (I haven't looked at the locale
repository.) One of those rules gets rid of the \x{307}, as the subject
of your posts indicate. You can put those in the lc() function.
sub lc($) {
my $string = shift;
utf8::upgrade($string);
# Unless an I is before a dot_above, it turns into a dotless i.
$string =~ s/I (?! [^\p{ccc=0}\p{ccc=Above}]* \x{0307} )/\x{131}/gx;
# But when the I is followed by a dot_above, remove the dot_above so
# the end result will be i.
$string =~ s/I ([^\p{ccc=0}\p{ccc=Above}]* ) \x{0307}/i$1/gx;
return CORE::lc($string);
}
I took those rules from the Unicode standard, 5.2 section 3.13. Note
that there is an issue in Perl with context-dependent case changing, as
one can use "FOO_\LBAR\E_MORE_FOO", and the only context lc sees is
'BAR', without looking at the _MORE_FOO. That shouldn't be a problem
here, as one shouldn't have the \E between a character and its combined
marks, but it could happen.
For completeness, here is ToLower()
sub ToLower {
my $official = do $lower;
$utf8::ToSpecLower{"\xc4\xb0"} = "i";
return $official;
}
Note that the key to the hash must be in utf8. This would differ on
utf-ebcdic. ToTitle is essentially ToUpper. lcfirst and ucfirst
correspond to lc and uc.
I believe that this is outlines a complete Turkish case changing
implementation on Perl5. I have yet to look at if there is a way to
override case-insensitive regex matching; and I'm not planning to look
at collation.
Thread Previous
|
Thread Next