develooper Front page | perl.perl5.porters | Postings from December 2010

RFC: Fixing locale and utf8; some backward incompatibility issues

Thread Next
karl williamson
December 31, 2010 13:47
RFC: Fixing locale and utf8; some backward incompatibility issues
Message ID:
It turns out that Perl has two disparate methods for dealing with locale 
and utf8.

One way is to lose localeness when a scalar is changed to utf8, so that 
whatever character is at ordinal X suddenly is assumed to be the 
character that Unicode thinks is at that position. I believe the 
injunctions in the documentation against mixing localeness and utf8 stem 
from this broken behavior.

For a long time I thought that this was the only method that Perl used, 
and it bothered me, as being wrong.  Eventually, I hit on a better 
solution, only to discover that in other places, Perl uses exactly what 
I had thought of.  That method is to treat latin1-range characters as if 
they were in their locale, even if encoded in utf8; and to treat above 
latin1-range characters as Unicode.

I propose to convert the code that doesn't use the second method to do 
so.  This presents various backwards compatibility issues, as behavior 
will change; probably for the better, though.  The biggest change is 
that no \p{} properties would apply to latin1 characters.  If you think 
about it, that is how it should be, as we don't really know that 0x41 
represents an alphabetic in the locale, for example.  One should not be 
using Unicode properties in locales; instead one should be using the 
[:posix:] ones or \s, \w, \d.  Nor should one be using \h, \v.  I 
propose to output a warning when a \p{}, or \h, \v, \R,  is used with 
locale, the warning would say that it only applies to code points above 255.

Similarly, \N{} can't legitimately be used under locale for code points 
in the latin1 range.  I propose to output a warning when this happens.

The current behavior is demonstrably broken, as /\s/ uses the better 
approach, and /[\s]/ uses the worse approach.

These changes would mean that locale and utf8 could work together, 
reasonably, for the first time.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About