Front page | perl.perl5.porters |
Postings from May 2010
Re: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} abug?)
Thread Previous
|
Thread Next
From:
karl williamson
Date:
May 24, 2010 22:05
Subject:
Re: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} abug?)
Message ID:
4BFB5A42.1000802@khwilliamson.com
Burak Gürsoy wrote:
>> -----Original Message-----
>> From: karl williamson [mailto:public@khwilliamson.com]
>> Sent: Saturday, May 22, 2010 11:18 PM
>> To: Burak Gürsoy
>> Cc: perl5-porters@perl.org; David Nicol; Tom Christiansen; Rafael
>> Garcia-Suarez
>> Subject: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} a
>> bug?)
>>
>
> [snip]
>
>> I've done some more research on this matter, and have found some
>> work-arounds. First of all, if one has just a few code points to
>> override the casing of, you can do the following:
>
> Hi Karl,
>
>> my $upper = "$Config{privlib}/unicore/To/Upper.pl";
>>
>> sub ToUpper {
>> my $official = do $upper;
>> $utf8::ToSpecUpper{'i'} = "\x{0130}"; # override official
>> return $official;
>> }
>
> Thanks for taking the trouble to test these. However I have some issues.
> First of all, %utf8::* hashes does not seem to do the trick. As you showed
> below, utf8::upgrade() works indeed to bypass the utf8 requirement, but
> I still have to do this:
>
> sub ToUpper {
> return <<"FOO";
> 0131\t\t0049
> 0069\t\t0130
> FOO
> }
>
> to have the correct results from uc(). Can't see what
> $utf8::ToSpecUpper{'i'} does,
> and your ToUpper() returns something else and this is used instead ot the
> hash while
> uc() operates on the parameter?
I'm sorry that I forgot the following: to get the utf8::ToSpecUpper{i}
to work requires a bug fix that I have locally, but which hasn't yet
been applied to blead, much less be in 5.12. That fix is [perl #75098]
>
> And btw, the usage of ToTitle() and ToFold() is not documented IIRC, but I
> expected them to work. However, defining them does not seem to have any
> effect (to alter [ul]cfirst()).
ToTitle is documented in perlunicode.pod, but has some erroneous
information, [perl #75314] fixes that pod, but is not in blead. ToLower
is used by lcfirst(), so there is no need for an extra function.
ToFold doesn't work. There is currently no way to override case
insensitive matching. It's not clear at the moment what can be done in
that direction. Turkish is clearly a demonstration of why that would be
something useful. As 5.14 development progresses, we'll know more.
I'll tell Yves who has some ideas of how to make folding better, but has
not had time to work much on them, that it would be nice if there were a
way to override the standard mappings.
>
>> This keeps one from having to import by hand the standard mappings for
>> each new Unicode version. Since this function is called once and then
>> cached, performance shouldn't be a problem (well, any more of a problem
>> than currently).
>>
>> To get around the problem of the source having to be in utf8 for this
>> to
>> be called, one can do:
>>
>> use subs qw(uc ucfirst lc lcfirst);
>>
>> sub uc($) {
>> my $string = shift;
>> utf8::upgrade($string);
>> return CORE::uc($string);
>> }
>>
>> as long as all the calls to uc are in the same file. I believe, but
>> haven't tried, that you can use the new pluggable keywords feature to
>> extend this to other files, see "PL_keyword_plugin" in perlapi, if you
>> write an XS module.
>>
>> But lowercasing in Turkish and Azerii has two extra, context-dependent
>> rules defined in the Unicode standard. (I haven't looked at the locale
>> repository.) One of those rules gets rid of the \x{307}, as the
>> subject
>> of your posts indicate. You can put those in the lc() function.
>>
>> sub lc($) {
>> my $string = shift;
>> utf8::upgrade($string);
>>
>> # Unless an I is before a dot_above, it turns into a dotless i.
>> $string =~ s/I (?! [^\p{ccc=0}\p{ccc=Above}]* \x{0307}
>> )/\x{131}/gx;
>>
>> # But when the I is followed by a dot_above, remove the dot_above
>> so
>> # the end result will be i.
>> $string =~ s/I ([^\p{ccc=0}\p{ccc=Above}]* ) \x{0307}/i$1/gx;
>> return CORE::lc($string);
>> }
>
> I believe this has to be:
>
> sub lc($) {
> my $string = shift;
> utf8::upgrade($string);
> $string = CORE::lc($string);
> # Unless an I is before a dot_above, it turns into a dotless i.
> $string =~ s/I (?! [^\p{ccc=0}\p{ccc=Above}]* \x{0307} )/\x{131}/gx;
>
> # But when the I is followed by a dot_above, remove the dot_above so
> # the end result will be i.
> $string =~ s/I ([^\p{ccc=0}\p{ccc=Above}]* ) \x{0307}/i$1/gx;
>
> return $string;
> }
The context-sensitive stuff has to be before the call to CORE::lc, as
afterwards the 'I' will already have been made into an 'i', and so it
will never match. I don't understand why you thought there was a
problem with the way it was.
>
> Stil seems to have a problem with "İ" though. I'll do more tests.
>
> Where are those \p{ccc= stuff documented? I'm not familiar with them.
perluniprops.pod which will refer you to
http://www.unicode.org/reports/tr44 for more detail
>
> Note: I've tested the codes with Strawberry 5.12.0 on Windows Vista 32bit.
>
>> I took those rules from the Unicode standard, 5.2 section 3.13. Note
>> that there is an issue in Perl with context-dependent case changing, as
>> one can use "FOO_\LBAR\E_MORE_FOO", and the only context lc sees is
>> 'BAR', without looking at the _MORE_FOO. That shouldn't be a problem
>> here, as one shouldn't have the \E between a character and its combined
>> marks, but it could happen.
>>
>> For completeness, here is ToLower()
>>
>> sub ToLower {
>> my $official = do $lower;
>> $utf8::ToSpecLower{"\xc4\xb0"} = "i";
>> return $official;
>> }
>>
>> Note that the key to the hash must be in utf8. This would differ on
>> utf-ebcdic. ToTitle is essentially ToUpper. lcfirst and ucfirst
>> correspond to lc and uc.
>>
>> I believe that this is outlines a complete Turkish case changing
>> implementation on Perl5. I have yet to look at if there is a way to
>> override case-insensitive regex matching; and I'm not planning to look
>> at collation.
>
>
Until 5.14 comes out, the easiest way to overcome the limitations is to
take the official To/Foo.pl files and put them in your ToFoo functions,
hand-changing the few applicable entries. Then by using the 'use subs'
idea, you should be able to get it to all work.
I have a patch that apparently gets rid of the need for 'use subs',
which is in RFC stage now. To test it, I wrote a test script, which is
a complete implementation of the Turkish changes I outlined. I'm
attaching that. There are specifics in it for the Perl test environment
that have to be changed (for example to use Config instead of the paths
it has), and 'use subs' is needed for all 4 routines, and the table
returned by the functions has to be the complete list of all mappings,
since the patches aren't in 5.12. But I think it might be helpful
nonetheless.
Thread Previous
|
Thread Next