develooper Front page | perl.perl5.porters | Postings from March 2010

Re: Is lc(\x{130}) -> i\x{307} a bug?

Thread Previous | Thread Next
From:
karl williamson
Date:
March 7, 2010 09:10
Subject:
Re: Is lc(\x{130}) -> i\x{307} a bug?
Message ID:
4B93DDF5.2040208@khwilliamson.com
Burak Gürsoy wrote:
> Hello porters,
> 
> I'm not sure if this is a bug, or some known-unsupported-thing or
> something else, so I'm asking in here first before opening a ticket.
> 
> Lower-casing \x{130} (LATIN CAPITAL LETTER I WITH DOT ABOVE) gives me
> "i\x{307}" instead of just "i".
> That junk character at the end seems to be named COMBINING DOT ABOVE.
> This seems weird to me. And locale does not seem to have any effect.
> What do you think?
> 
> Cheers,
> Burak
> 

Perl is working correctly according to the Unicode standard.

The inclusion of U+0307 is the correct Unicode mapping for languages 
other than Turkish and Azerbaijani.  The mapping should be just to 'i' 
for those two languages, but it does not preserve canonical equivalence 
without further processing.  Perl currently doesn't do this, nor does 
Perl currently support locale handling of code points beyond 0xFF.

Perl is unlikely to add such support, as Unicode itself has moved away 
from defining locale dependent mappings.  They still define this one and 
a few others that were included very early on in Unicode, but aren't 
adding new ones.  Instead they have a CLDR project for locale data.  I 
know next to nothing about that.

These mappings of U+0130 have been very problematic and have caused 
significant consternation over the years, but they (and we) are stuck 
with it now.

I thought there might be a CPAN module that changed the behavior to suit 
these two languages, but I just searched there and didn't see anything.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About