Burak Gürsoy wrote: >> -----Original Message----- >> From: karl williamson [mailto:public@khwilliamson.com] >> Sent: Sunday, March 07, 2010 7:10 PM >> To: Burak Gürsoy >> Cc: perl5-porters@perl.org >> Subject: Re: Is lc(\x{130}) -> i\x{307} a bug? >> > >> Perl is working correctly according to the Unicode standard. > > Ok. > >> The inclusion of U+0307 is the correct Unicode mapping for languages >> other than Turkish and Azerbaijani. The mapping should be just to 'i' > > It's Turkish in my case :) > >> for those two languages, but it does not preserve canonical equivalence >> without further processing. Perl currently doesn't do this, nor does >> Perl currently support locale handling of code points beyond 0xFF. >> >> Perl is unlikely to add such support, as Unicode itself has moved away >> from defining locale dependent mappings. They still define this one >> and >> a few others that were included very early on in Unicode, but aren't >> adding new ones. Instead they have a CLDR project for locale data. I >> know next to nothing about that. > > I'll check that. And an OffTopic question if you don't mind: any ideas > on the decision in perl6 on this matter? I personally know nothing about this. > >> These mappings of U+0130 have been very problematic and have caused >> significant consternation over the years, but they (and we) are stuck >> with it now. >> >> I thought there might be a CPAN module that changed the behavior to >> suit >> these two languages, but I just searched there and didn't see anything. > > Well... defining a ToLower() seems to be the remedy for this issue: > > sub ToLower { > return <<"RANGE"; > 0049\t\t0131 > 0130\t\t0069 > RANGE > } It has been my experience that just defining these two mappings causes all other casing to become undefined. That is, lc('A') now yields 'A', unless you have a mapping for it in your function. If that isn't your experience, let me know. To get all the other mappings you should take the files in lib/unicore/To and put them into your ToXXX functions, changing the entries you need to. A very unideal method. > > Too trivial to wrap inside a module I guess :) A module related to this > must also handle the sorting, etc. However, since ToLower/ToUpper > is by-passed for non-unicode-looking strings, the range trick will not work > for things like uc('i')/lc('I') (for Turkish). Only way to make a reliable > Turkish locale dependent thingy seems to be a combination of ToLower/ToUpper > and pre-process the string before passing to uc/lc with s/// (or tr///). > And this is a hack unfortunately. I would like to figure out a way to extend this capability so it is more usable, and to work on non-unicode looking strings. But I haven't come up with one that doesn't unduly penalize (that is, slow down) the vast majority of programs that don't use it. We actually haven't been certain that anyone does use this feature. I remember saying to myself, and perhaps even writing, that Turkish would be a likely candidate for its use. I don't know about the sorting, etc. > > Thanks, > Burak >Thread Previous | Thread Next