develooper Front page | perl.perl5.porters | Postings from March 2010

RE: Is lc(\x{130}) -> i\x{307} a bug?

Thread Previous | Thread Next
Burak Gürsoy
March 12, 2010 05:22
RE: Is lc(\x{130}) -> i\x{307} a bug?
Message ID:
> -----Original Message-----
> From: karl williamson []
> Sent: Sunday, March 07, 2010 7:10 PM
> To: Burak Gürsoy
> Cc:
> Subject: Re: Is lc(\x{130}) -> i\x{307} a bug?

> Perl is working correctly according to the Unicode standard.


> The inclusion of U+0307 is the correct Unicode mapping for languages
> other than Turkish and Azerbaijani.  The mapping should be just to 'i'

It's Turkish in my case :)

> for those two languages, but it does not preserve canonical equivalence
> without further processing.  Perl currently doesn't do this, nor does
> Perl currently support locale handling of code points beyond 0xFF.
> Perl is unlikely to add such support, as Unicode itself has moved away
> from defining locale dependent mappings.  They still define this one
> and
> a few others that were included very early on in Unicode, but aren't
> adding new ones.  Instead they have a CLDR project for locale data.  I
> know next to nothing about that.

I'll check that. And an OffTopic question if you don't mind: any ideas 
on the decision in perl6 on this matter?

> These mappings of U+0130 have been very problematic and have caused
> significant consternation over the years, but they (and we) are stuck
> with it now.
> I thought there might be a CPAN module that changed the behavior to
> suit
> these two languages, but I just searched there and didn't see anything.

Well... defining a ToLower() seems to be the remedy for this issue:

sub ToLower {
return <<"RANGE";

Too trivial to wrap inside a module I guess :) A module related to this
must also handle the sorting, etc. However, since ToLower/ToUpper
is by-passed for non-unicode-looking strings, the range trick will not work
for things like uc('i')/lc('I') (for Turkish). Only way to make a reliable
Turkish locale dependent thingy seems to be a combination of ToLower/ToUpper
and pre-process the string before passing to uc/lc with s/// (or tr///).
And this is a hack unfortunately.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About