develooper Front page | perl.perl5.porters | Postings from May 2010

RE: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} a bug?)

Thread Previous | Thread Next
From:
Burak Gürsoy
Date:
May 25, 2010 02:09
Subject:
RE: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} a bug?)
Message ID:
000901cafb9a$977716a0$c66543e0$@net
> -----Original Message-----
> From: karl williamson [mailto:public@khwilliamson.com]
> Sent: Saturday, May 22, 2010 11:18 PM
> To: Burak Gürsoy
> Cc: perl5-porters@perl.org; David Nicol; Tom Christiansen; Rafael
> Garcia-Suarez
> Subject: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} a
> bug?)
> 

[snip]

> 
> I've done some more research on this matter, and have found some
> work-arounds.  First of all, if one has just a few code points to
> override the casing of, you can do the following:

Hi Karl,
 
> my $upper = "$Config{privlib}/unicore/To/Upper.pl";
> 
> sub ToUpper {
>      my $official = do $upper;
>      $utf8::ToSpecUpper{'i'} = "\x{0130}"; # override official
>      return $official;
> }

Thanks for taking the trouble to test these. However I have some issues.
First of all, %utf8::* hashes does not seem to do the trick. As you showed
below, utf8::upgrade() works indeed to bypass the utf8 requirement, but 
I still have to do this:

sub ToUpper {
return <<"FOO";
0131\t\t0049
0069\t\t0130
FOO
}

to have the correct results from uc(). Can't see what
$utf8::ToSpecUpper{'i'} does, 
and your ToUpper() returns something else and this is used instead ot the
hash while
uc() operates on the parameter?

And btw, the usage of ToTitle() and ToFold() is not documented IIRC, but I 
expected them to work. However, defining them does not seem to have any
effect (to alter [ul]cfirst()).

> This keeps one from having to import by hand the standard mappings for
> each new Unicode version.  Since this function is called once and then
> cached, performance shouldn't be a problem (well, any more of a problem
> than currently).
> 
> To get around the problem of the source having to be in utf8 for this
> to
> be called, one can do:
> 
> use subs qw(uc ucfirst lc lcfirst);
> 
> sub uc($) {
>      my $string = shift;
>      utf8::upgrade($string);
>      return CORE::uc($string);
> }
> 
> as long as all the calls to uc are in the same file.  I believe, but
> haven't tried, that you can use the new pluggable keywords feature to
> extend this to other files, see "PL_keyword_plugin" in perlapi, if you
> write an XS module.
> 
> But lowercasing in Turkish and Azerii has two extra, context-dependent
> rules defined in the Unicode standard.  (I haven't looked at the locale
> repository.)  One of those rules gets rid of the \x{307}, as the
> subject
> of your posts indicate.  You can put those in the lc() function.
> 
> sub lc($) {
>      my $string = shift;
>      utf8::upgrade($string);
> 
>      # Unless an I is before a dot_above, it turns into a dotless i.
>      $string =~ s/I (?! [^\p{ccc=0}\p{ccc=Above}]* \x{0307}
> )/\x{131}/gx;
> 
>      # But when the I is followed by a dot_above, remove the dot_above
> so
>      # the end result will be i.
>      $string =~ s/I ([^\p{ccc=0}\p{ccc=Above}]* ) \x{0307}/i$1/gx;
>      return CORE::lc($string);
> }

I believe this has to be:

sub lc($) {
     my $string = shift;
     utf8::upgrade($string);
     $string = CORE::lc($string);
     # Unless an I is before a dot_above, it turns into a dotless i.
     $string =~ s/I (?! [^\p{ccc=0}\p{ccc=Above}]* \x{0307} )/\x{131}/gx;

     # But when the I is followed by a dot_above, remove the dot_above so
     # the end result will be i.
     $string =~ s/I ([^\p{ccc=0}\p{ccc=Above}]* ) \x{0307}/i$1/gx;

     return $string;
}

Stil seems to have a problem with "Ý" though. I'll do more tests.

Where are those \p{ccc= stuff documented? I'm not familiar with them.

Note: I've tested the codes with Strawberry 5.12.0 on Windows Vista 32bit.

> I took those rules from the Unicode standard, 5.2 section 3.13.  Note
> that there is an issue in Perl with context-dependent case changing, as
> one can use "FOO_\LBAR\E_MORE_FOO", and the only context lc sees is
> 'BAR',  without looking at the _MORE_FOO.  That shouldn't be a problem
> here, as one shouldn't have the \E between a character and its combined
> marks, but it could happen.
> 
> For completeness, here is ToLower()
> 
> sub ToLower {
>      my $official = do $lower;
>      $utf8::ToSpecLower{"\xc4\xb0"} = "i";
>      return $official;
> }
> 
> Note that the key to the hash must be in utf8.  This would differ on
> utf-ebcdic.  ToTitle is essentially ToUpper.  lcfirst and ucfirst
> correspond to lc and uc.
> 
> I believe that this is outlines a complete Turkish case changing
> implementation on Perl5.  I have yet to look at if there is a way to
> override case-insensitive regex matching; and I'm not planning to look
> at collation.


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About