develooper Front page | perl.perl5.porters | Postings from October 2014

RFC: What should Perl do if the locale is set to something it can'thandle?

Thread Next
From:
Karl Williamson
Date:
October 9, 2014 02:54
Subject:
RFC: What should Perl do if the locale is set to something it can'thandle?
Message ID:
5435F8BE.7090702@khwilliamson.com
Some locales can't be dealt with very well by Perl.

For example some ISO 646 7-bit locales make some or many of Perl's 
metacharacters into \w ones.  For example, some of them do this for the 
backslash; to use Finnish as an example it becomes "Ö" (U+D6).  If these 
are used in double-quoted strings or regex patterns, things aren't going 
to work out too well.  Most of these at least leave the ASCII letters 
and digits alone, though 7-bit Arabic, ASMO 449, changes the letters as 
well.  My source is
http://en.wikipedia.org/wiki/ISO/IEC_646

I don't these are very likely to be used these days.  Encode has not 
found it worthwhile to make translation tables for them, for example. 
These were essentially replaced 20 years ago by the ISO 8859 series of 
8-bit locales, which are all supersets of ASCII.  I don't think we 
really need to be worried about ISO 646 (though a similar situation 
occurs nowadays on EBCDIC platforms)

Starting in v5.20, Perl handles UTF-8 locales, but that is the only 
multi-byte locale that is ever likely to be handled by Perl.  But other 
multi-byte locales are in regular use these days, as I understand it, 
mostly if not entirely for East Asian languages, such as Chinese.  Some 
of these are Big5, and various JIS ones.

On platforms that have C99 or POSIX-2001.1, we have the capability of 
detecting such a multi-byte locale.  All the ones I've come across are 
supersets of ASCII.  If a program is run in such a locale, it seems 
likely it is going to have weird problems.

I'm wondering if it would be a good thing to raise a warning on 
C99/POSIX 2001.1 platforms when we are switched into a multi-byte locale 
that isn't UTF-8.

This warning would be more for the user than the developer.  The latter 
likely has no idea that her/his code is going to be used in this manner. 
  And this cautions the user that things aren't likely to work out well.

We could warn on the 646 (and similar) ones by seeing, when we switch 
into it, if a \w character also matches one of the metacharacters.  We 
already go through all \w characters individually, calculating their 
folds, at the time of the switch anyway.

Perhaps such a warning would be of a new category, say "locale". 
Another potential warning in this category is the one Aristotle proposed 
in http://nntp.perl.org/group/perl.perl5.porters/211909

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About