develooper Front page | perl.perl5.porters | Postings from June 2010

Re: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} abug?)

Thread Previous | Thread Next
karl williamson
June 3, 2010 19:49
Re: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} abug?)
Message ID:
karl williamson wrote:
> demerphq wrote:
>> On 26 May 2010 21:38, karl williamson <> wrote:
>>> demerphq wrote:
>>>> 2010/5/25 karl williamson <>:
>>>>> ToFold doesn't work.  There is currently no way to override case
>>>>> insensitive
>>>>> matching.  It's not clear at the moment what can be done in that
>>>>> direction.
>>>>>  Turkish is clearly a demonstration of why that would be something
>>>>> useful.
>>>>>  As 5.14 development progresses, we'll know more. I'll tell Yves 
>>>>> who has
>>>>> some ideas of how to make folding better, but has not had time to work
>>>>> much
>>>>> on them, that it would be nice if there were a way to override the
>>>>> standard
>>>>> mappings.
>>>> This is a design error in Unicode. Probably the *best* option would be
>>>> to petition them to create a "lowercase dotless i" (that has a dot of
>>>> course).
>>> I don't know the history of this.  I'm not sure that it is a design 
>>> error,
>>> but certainly they've made those.  It could have been because of
>>> compatibility with an existing standard.  I just don't know.  I do 
>>> know it
>>> has caused them many headaches, and they're not about to revisit it,
>>> probably ever.  They decided there was no good solution and changed in
>>> something like version 3.1 or 3.2 to the current one, as the least 
>>> awful.
>>>  They continue to pretend that their case folding is not 
>>> locale-dependent,
>>> but it is in this one instance.
>> Yes, which is why i suepect there might be the possibility to get them
>> to move on the subject.
>> Basically it completely breaks round tripping Turkish script.
> Except if you have a Turkish locale, it works.

I found an FAQ about this on the Unicode site:
Q: Why aren't there extra characters to support locale-independent 
casing for Turkish?

A: The fact is that there is too much data coded in 8859-9 (with 0xDD = 
which contains both Turkish and non-Turkish text. Transcoding this data 
to Unicode would be intolerably difficult if it all had to be tagged 
first to sort out which 0x49 characters are ordinary "I" and which are 
CAPITAL LETTER DOTLESS I. Better to accept the compromise and get on 
with moving to Unicode. Moreover, there is a strong doubt that users 
will "get it right" in future either when they enter new characters. [JC]

I don't know if I buy the explanation; but it sounds like it's futile to 
get them to change.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About