develooper Front page | perl.perl5.porters | Postings from May 2010

Re: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} abug?)

Thread Previous | Thread Next
karl williamson
May 26, 2010 12:39
Re: Running Perl 5 on Turkish (Was : Is lc(\x{130}) -> i\x{307} abug?)
Message ID:
demerphq wrote:
> 2010/5/25 karl williamson <>:
>> ToFold doesn't work.  There is currently no way to override case insensitive
>> matching.  It's not clear at the moment what can be done in that direction.
>>  Turkish is clearly a demonstration of why that would be something useful.
>>  As 5.14 development progresses, we'll know more. I'll tell Yves who has
>> some ideas of how to make folding better, but has not had time to work much
>> on them, that it would be nice if there were a way to override the standard
>> mappings.
> This is a design error in Unicode. Probably the *best* option would be
> to petition them to create a "lowercase dotless i" (that has a dot of
> course).

I don't know the history of this.  I'm not sure that it is a design 
error, but certainly they've made those.  It could have been because of 
compatibility with an existing standard.  I just don't know.  I do know 
it has caused them many headaches, and they're not about to revisit it, 
probably ever.  They decided there was no good solution and changed in 
something like version 3.1 or 3.2 to the current one, as the least 
awful.  They continue to pretend that their case folding is not 
locale-dependent, but it is in this one instance.
> It appears this worked with latin-sharp-ess, as there is now a
> capitalized and lower case version even the letter has always been
> considered to be "lowercase that is used in title case script".

I do know a little about the history of this.  They generally require 
evidence that something actually exists in the wild before they will 
consider most things.  And in fact, someone showed the Unicode folks 
that E. German newspapers from around 50 years ago were using this 
uppercase letter.  The proposal was initially rejected, but revived with 
more evidence, then accepted.  So it wasn't because someone said 
"wouldn't it be nice, because this is causing us all sorts of 
implementation hassles", it was because there was documented evidence 
that  the character, different from all other characters, really existed.

(They haven't always been so tight.  There is a Unicode Tibetan 
character that means -1/2.  I was curious about, why of all the 
languages in the world, would Tibetan be the only one that had thought 
to have a single letter stand for a negative number, and a fraction at 
that!  It turns out there is no evidence that this has ever existed. 
There was a stamp issued in Tibet in the 1930's, I believe, that meant I 
think it was 7 - .5 coins (whatever the currency was).  Someone in 
Unicode used the rule that meant "subtract 1/2" to extrapolate back to 
create characters for 5.5, 4.5, ... -0.5.  If you're curious, you can 
google it, as I did.)
> Ill think on a technical solution, but i must admit the plans I've
> been toying with go the other way, in that they would probably be
> compiled at perl build time.

The only thing that comes to me is a 'use re' option.

> cheers,
> Yves

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About