Burak Gürsoy wrote: > Hello porters, > > I'm not sure if this is a bug, or some known-unsupported-thing or > something else, so I'm asking in here first before opening a ticket. > > Lower-casing \x{130} (LATIN CAPITAL LETTER I WITH DOT ABOVE) gives me > "i\x{307}" instead of just "i". > That junk character at the end seems to be named COMBINING DOT ABOVE. > This seems weird to me. And locale does not seem to have any effect. > What do you think? > > Cheers, > Burak > Perl is working correctly according to the Unicode standard. The inclusion of U+0307 is the correct Unicode mapping for languages other than Turkish and Azerbaijani. The mapping should be just to 'i' for those two languages, but it does not preserve canonical equivalence without further processing. Perl currently doesn't do this, nor does Perl currently support locale handling of code points beyond 0xFF. Perl is unlikely to add such support, as Unicode itself has moved away from defining locale dependent mappings. They still define this one and a few others that were included very early on in Unicode, but aren't adding new ones. Instead they have a CLDR project for locale data. I know next to nothing about that. These mappings of U+0130 have been very problematic and have caused significant consternation over the years, but they (and we) are stuck with it now. I thought there might be a CPAN module that changed the behavior to suit these two languages, but I just searched there and didn't see anything.Thread Previous | Thread Next