karl williamson wrote: > demerphq wrote: >> On 26 May 2010 21:38, karl williamson <public@khwilliamson.com> wrote: >>> demerphq wrote: >>>> 2010/5/25 karl williamson <public@khwilliamson.com>: >>>>> ToFold doesn't work. There is currently no way to override case >>>>> insensitive >>>>> matching. It's not clear at the moment what can be done in that >>>>> direction. >>>>> Turkish is clearly a demonstration of why that would be something >>>>> useful. >>>>> As 5.14 development progresses, we'll know more. I'll tell Yves >>>>> who has >>>>> some ideas of how to make folding better, but has not had time to work >>>>> much >>>>> on them, that it would be nice if there were a way to override the >>>>> standard >>>>> mappings. >>>> This is a design error in Unicode. Probably the *best* option would be >>>> to petition them to create a "lowercase dotless i" (that has a dot of >>>> course). >>> I don't know the history of this. I'm not sure that it is a design >>> error, >>> but certainly they've made those. It could have been because of >>> compatibility with an existing standard. I just don't know. I do >>> know it >>> has caused them many headaches, and they're not about to revisit it, >>> probably ever. They decided there was no good solution and changed in >>> something like version 3.1 or 3.2 to the current one, as the least >>> awful. >>> They continue to pretend that their case folding is not >>> locale-dependent, >>> but it is in this one instance. >> >> Yes, which is why i suepect there might be the possibility to get them >> to move on the subject. >> >> Basically it completely breaks round tripping Turkish script. > > Except if you have a Turkish locale, it works. I found an FAQ about this on the Unicode site: Q: Why aren't there extra characters to support locale-independent casing for Turkish? A: The fact is that there is too much data coded in 8859-9 (with 0xDD = LATIN CAPITAL LETTER I WITH DOT and 0xFD = LATIN SMALL LETTER DOTLESS I) which contains both Turkish and non-Turkish text. Transcoding this data to Unicode would be intolerably difficult if it all had to be tagged first to sort out which 0x49 characters are ordinary "I" and which are CAPITAL LETTER DOTLESS I. Better to accept the compromise and get on with moving to Unicode. Moreover, there is a strong doubt that users will "get it right" in future either when they enter new characters. [JC] I don't know if I buy the explanation; but it sounds like it's futile to get them to change.Thread Previous | Thread Next