develooper Front page | perl.perl5.porters | Postings from March 2010

RE: Is lc(\x{130}) -> i\x{307} a bug?

Thread Previous | Thread Next
From:
Burak Gürsoy
Date:
March 12, 2010 05:22
Subject:
RE: Is lc(\x{130}) -> i\x{307} a bug?
Message ID:
000101cac176$025e0be0$071a23a0$@net
> -----Original Message-----
> From: karl williamson [mailto:public@khwilliamson.com]
> Sent: Sunday, March 07, 2010 7:10 PM
> To: Burak Gürsoy
> Cc: perl5-porters@perl.org
> Subject: Re: Is lc(\x{130}) -> i\x{307} a bug?
> 

> Perl is working correctly according to the Unicode standard.

Ok.

> The inclusion of U+0307 is the correct Unicode mapping for languages
> other than Turkish and Azerbaijani.  The mapping should be just to 'i'

It's Turkish in my case :)

> for those two languages, but it does not preserve canonical equivalence
> without further processing.  Perl currently doesn't do this, nor does
> Perl currently support locale handling of code points beyond 0xFF.
> 
> Perl is unlikely to add such support, as Unicode itself has moved away
> from defining locale dependent mappings.  They still define this one
> and
> a few others that were included very early on in Unicode, but aren't
> adding new ones.  Instead they have a CLDR project for locale data.  I
> know next to nothing about that.

I'll check that. And an OffTopic question if you don't mind: any ideas 
on the decision in perl6 on this matter?

> These mappings of U+0130 have been very problematic and have caused
> significant consternation over the years, but they (and we) are stuck
> with it now.
> 
> I thought there might be a CPAN module that changed the behavior to
> suit
> these two languages, but I just searched there and didn't see anything.

Well... defining a ToLower() seems to be the remedy for this issue:

sub ToLower {
return <<"RANGE";
0049\t\t0131
0130\t\t0069
RANGE
}

Too trivial to wrap inside a module I guess :) A module related to this
must also handle the sorting, etc. However, since ToLower/ToUpper
is by-passed for non-unicode-looking strings, the range trick will not work
for things like uc('i')/lc('I') (for Turkish). Only way to make a reliable
Turkish locale dependent thingy seems to be a combination of ToLower/ToUpper
and pre-process the string before passing to uc/lc with s/// (or tr///).
And this is a hack unfortunately.

Thanks,
Burak


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About