develooper Front page | perl.perl5.porters | Postings from December 2013

Resurrected RFC: Handling utf8 locales

Thread Previous | Thread Next
From:
Karl Williamson
Date:
December 10, 2013 21:27
Subject:
Resurrected RFC: Handling utf8 locales
Message ID:
52A7871C.5050809@khwilliamson.com
I haven't given up on this proposal.  To refresh your memory, the 
proposal is for Perl to check if the current locale is a UTF-8 one, and 
if so, treat strings for LC_CTYPE purposes like strings normally are in 
Perl, and without looking at the actual locale data.  This works because 
UTF-8 is an underlying Perl string data type.  The original thread was at
http://markmail.org/message/q4vorzd2xcxbm43y

I reiterated this proposal in the discussion of
https://rt.perl.org/rt3/Ticket/Display.html?id=117787

(which this would fix) and got no responses.  I have a branch which has 
it mostly implemented, but the bitrot needs to be cleaned up.

I have a further proposal.  And that is to use, on machines that have 
it, wcsxfrm(), for LC_COLLATE.  Unicode publishes high-quality POSIX 
locale definitions, and this would use them to avoid the need and 
slowdown from using Unicode::Collate for many cases.

To summarize the proposal.
When Perl does a locale-sensitive operation within the scope of 'use 
locale' it would check if the locale is a UTF-8 one or not.  If not, it 
would behave as it currently does.  Under a UTF-8 locale, for LC_CTYPE 
operations, it would behave as if it weren't under 'use locale'.  Thus 
the LC_CTYPE operations within a UTF-8 locale are indistinguishable from 
non-locale operations.  (This means there's not much to implement, as we 
are just using existing code paths for the most part.)  For LC_COLLATE 
operations under UTF-8 locales, the wide character transform would be 
used on platforms where it is available.  This is slower than the 
existing but gives much better results, as currently things just don't 
work at all under these locales, as Tom Christiansen has lamented.

To be clear, Perl has never said it supports non-8bit locales, so this 
is an enhancement.  But on Linux, at least, these days most of our users 
seem to be using these unsupported locales, so it seems right that we 
should support them, especially as the implementation cost is not high.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About