develooper Front page | perl.perl5.porters | Postings from May 2010

Re: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} abug?)

Thread Previous | Thread Next
karl williamson
May 26, 2010 16:02
Re: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} abug?)
Message ID:
demerphq wrote:
> On 26 May 2010 21:38, karl williamson <> wrote:
>> demerphq wrote:
>>> 2010/5/25 karl williamson <>:
>>>> ToFold doesn't work.  There is currently no way to override case
>>>> insensitive
>>>> matching.  It's not clear at the moment what can be done in that
>>>> direction.
>>>>  Turkish is clearly a demonstration of why that would be something
>>>> useful.
>>>>  As 5.14 development progresses, we'll know more. I'll tell Yves who has
>>>> some ideas of how to make folding better, but has not had time to work
>>>> much
>>>> on them, that it would be nice if there were a way to override the
>>>> standard
>>>> mappings.
>>> This is a design error in Unicode. Probably the *best* option would be
>>> to petition them to create a "lowercase dotless i" (that has a dot of
>>> course).
>> I don't know the history of this.  I'm not sure that it is a design error,
>> but certainly they've made those.  It could have been because of
>> compatibility with an existing standard.  I just don't know.  I do know it
>> has caused them many headaches, and they're not about to revisit it,
>> probably ever.  They decided there was no good solution and changed in
>> something like version 3.1 or 3.2 to the current one, as the least awful.
>>  They continue to pretend that their case folding is not locale-dependent,
>> but it is in this one instance.
> Yes, which is why i suepect there might be the possibility to get them
> to move on the subject.
> Basically it completely breaks round tripping Turkish script.

Except if you have a Turkish locale, it works.
>>> It appears this worked with latin-sharp-ess, as there is now a
>>> capitalized and lower case version even the letter has always been
>>> considered to be "lowercase that is used in title case script".
>> I do know a little about the history of this.  They generally require
>> evidence that something actually exists in the wild before they will
>> consider most things.  And in fact, someone showed the Unicode folks that E.
>> German newspapers from around 50 years ago were using this uppercase letter.
>>  The proposal was initially rejected, but revived with more evidence, then
>> accepted.  So it wasn't because someone said "wouldn't it be nice, because
>> this is causing us all sorts of implementation hassles", it was because
>> there was documented evidence that  the character, different from all other
>> characters, really existed.
> The thing is tho, latin-sharp-ess at least in German, was/is a ligature.
> It was _never_ _ever_ an uppercase letter, it is/was a lower case
> ligature (of sz) that had no uppercase equivalent so that in signs,
> which are normally uppercase, it was _also_ used.
> So even if it /was/ used in signs it was never uppercase.

I've heard that the signs were changing in Germany to use it.
> And IMO this is similar to the case with Turkish dotless I.
> No doubt Burak can find loads of photos of Turkish signs using the
> lowercase i, and since we know the turkish rule for uppercasing is
> "special" we thus can prove that the dotless I should have a lower
> case equivalent.

I'm sure there are plenty of street signs in Turkey with a dotless i, 
but that already is in Unicode as U+0131, LATIN SMALL LETTER DOTLESS I, 
and it's capital is U+0049 'I' (which as you can see doesn't have a 
dot).  Instead in Turkish, the capital of U+0069 'i' with a dot is 
U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE.  It's English and various 
other European languages that lose their dot when capitalized.  What 
should have happened (in my mostly uneducated opinion) is that they 
should not have used U+0069 as the lower case of U+0130; instead there 
should have been  a new letter that looks just like U+0069 'i', but 
whose upper case was U+0130.  I think there is too much code that would 
break if they were to change it now, and that's why we're stuck.

And these days, an issue with doing non-unification (as Unicode calls 
this) is you have the same glyph mean separate things, and that leads to 
possible spoofing attacks.  The Cyrillic and Greek alphabets aren't 
unified with Latin, and so there are glyphs that look identical, so you 
can have a url that looks like 'paypal', but its all cyrillic, so if you 
click on that url, you would get a malware imitator of the famous 
paypal. (This is not a Russian word.)  Or the issue now documented in 
perlrecharclass.pod, with various digits looking very much like the 
western european digits, but with different values, so someone could 
slip in a different \d that makes the number displayed look like a 
smaller amount than it is.

> Anyway, I suppose getting the Unicode group to change things is
> unlikely, but really that is the right solution (from the point of
> view of folding).
>> (They haven't always been so tight.  There is a Unicode Tibetan character
>> that means -1/2.  I was curious about, why of all the languages in the
>> world, would Tibetan be the only one that had thought to have a single
>> letter stand for a negative number, and a fraction at that!  It turns out
>> there is no evidence that this has ever existed. There was a stamp issued in
>> Tibet in the 1930's, I believe, that meant I think it was 7 - .5 coins
>> (whatever the currency was).  Someone in Unicode used the rule that meant
>> "subtract 1/2" to extrapolate back to create characters for 5.5, 4.5, ...
>> -0.5.  If you're curious, you can google it, as I did.)
> Yikes. :-)
>>> Ill think on a technical solution, but i must admit the plans I've
>>> been toying with go the other way, in that they would probably be
>>> compiled at perl build time.
>> The only thing that comes to me is a 'use re' option.
> Yeah, it would hook in there, but still, deferring compilation of fold
> tables to run time is not the way I wanted to go. I suppose we have no
> choice tho.

We could always punt, as we have so far on this. :)
> Yves

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About