develooper Front page | perl.perl5.porters | Postings from September 2010

RFC: Proposal for fixing user-defined casing issues

Thread Next
karl williamson
September 16, 2010 13:18
RFC: Proposal for fixing user-defined casing issues
Message ID:
Summary:  Deprecate current behavior and supply a better-behaved 
alternative in 5.14.  Remove current behavior in 5.16, speeding up ut8 
case changing.

Current situation:  It is possible to override the system case change 
operations (upper, lower, title) by defining subroutines ToUpper, 
ToLower, and ToTitle.  These are effective only on utf8-encoded strings. 
  They work for both the builtins like uc(), and the string-inlined 
equivalents like \U.  The mechanism is not well-behaved by modern Perl 
standards.  Here's its documentation from perlunicode.pod

"All non-threaded programs have exactly one uppercasing
behavior, one lowercasing behavior, and one titlecasing behavior in
effect for utf8-encoded strings for the duration of the program.  Each
of these behaviors is irrevocably determined the first time the
corresponding function is called to change a utf8-encoded string's case.
If a corresponding C<To-> function has been defined in the package that
makes that first call, the mapping defined by that function will be the
mapping used for the duration of the program's execution across all
packages and scopes.  If no corresponding C<To-> function has been
defined in that package, the standard official mapping will be used for
all packages and scopes, and any corresponding C<To-> function anywhere
will be ignored.  Threaded programs have similar behavior.  If the
program's casing behavior has been decided at the time of a thread's
creation, the thread will inherit that behavior.  But, if the behavior
hasn't been decided, the thread gets to decide for itself, and its
decision does not affect other threads nor its creator."

I have thought quite a bit how about how to get this to work better, and 
have come to the conclusion that it is so broken that it's not worth trying.

Instead, I propose to deprecate it.  If you want to override case 
changing, you can just override uc() and its cousins to do what you 
want.  The ToUpper() mechanism is not sufficient to do 
context-sensitive casing anyway, whereas the overridden uc() 
functions are.  Examples for Turkish are given in perlunicode.pod. 
Overriding uc() etc, also allows you to fix things so that the utf8ness 
doesn't matter.

But there is a bug in the uc() mechanism which would have to be 
fixed.  Overriding these does not affect \U  (Similarly, an 
overridden quotemeta() should be called by \Q.).  I can work on a patch 
for this.

If someone did really want to override casing outside the current 
package, they could use '*CORE::GLOBAL::uc = sub {...}' do so.

Currently, there is #ifdef'd-out code in uc() etc that, if enabled, 
would use the existing casing tables compiled into Perl for the first 
256 code points, utf8 or not.  This code is not used because of the 
possibility of user-defined case overrides.  Once the ToUpper() etc 
mechanism is removed, this code could be enabled, speeding up casing for 
Western languages, as it would not have to go out to the tables stored 
on disk for casing all utf8 strings.  I could also revise code that I've 
already written, so that it would bypass this step if there is no 
possibility that there is user-defined casing.  But this is throwaway 
work, so I would propose not doing it, and just wait until 5.16 instead.


Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About