Front page | perl.perl5.porters |
Postings from September 2010
RFC: Proposal for fixing user-defined casing issues
Thread Next
From:
karl williamson
Date:
September 16, 2010 13:18
Subject:
RFC: Proposal for fixing user-defined casing issues
Message ID:
4C927BA5.6070800@khwilliamson.com
Summary: Deprecate current behavior and supply a better-behaved
alternative in 5.14. Remove current behavior in 5.16, speeding up ut8
case changing.
Current situation: It is possible to override the system case change
operations (upper, lower, title) by defining subroutines ToUpper,
ToLower, and ToTitle. These are effective only on utf8-encoded strings.
They work for both the builtins like uc(), and the string-inlined
equivalents like \U. The mechanism is not well-behaved by modern Perl
standards. Here's its documentation from perlunicode.pod
"All non-threaded programs have exactly one uppercasing
behavior, one lowercasing behavior, and one titlecasing behavior in
effect for utf8-encoded strings for the duration of the program. Each
of these behaviors is irrevocably determined the first time the
corresponding function is called to change a utf8-encoded string's case.
If a corresponding C<To-> function has been defined in the package that
makes that first call, the mapping defined by that function will be the
mapping used for the duration of the program's execution across all
packages and scopes. If no corresponding C<To-> function has been
defined in that package, the standard official mapping will be used for
all packages and scopes, and any corresponding C<To-> function anywhere
will be ignored. Threaded programs have similar behavior. If the
program's casing behavior has been decided at the time of a thread's
creation, the thread will inherit that behavior. But, if the behavior
hasn't been decided, the thread gets to decide for itself, and its
decision does not affect other threads nor its creator."
I have thought quite a bit how about how to get this to work better, and
have come to the conclusion that it is so broken that it's not worth trying.
Instead, I propose to deprecate it. If you want to override case
changing, you can just override uc() and its cousins to do what you
want. The ToUpper() et.al mechanism is not sufficient to do
context-sensitive casing anyway, whereas the overridden uc() et.al
functions are. Examples for Turkish are given in perlunicode.pod.
Overriding uc() etc, also allows you to fix things so that the utf8ness
doesn't matter.
But there is a bug in the uc() et.al mechanism which would have to be
fixed. Overriding these does not affect \U et.al. (Similarly, an
overridden quotemeta() should be called by \Q.). I can work on a patch
for this.
If someone did really want to override casing outside the current
package, they could use '*CORE::GLOBAL::uc = sub {...}' mechanism.to do so.
Currently, there is #ifdef'd-out code in uc() etc that, if enabled,
would use the existing casing tables compiled into Perl for the first
256 code points, utf8 or not. This code is not used because of the
possibility of user-defined case overrides. Once the ToUpper() etc
mechanism is removed, this code could be enabled, speeding up casing for
Western languages, as it would not have to go out to the tables stored
on disk for casing all utf8 strings. I could also revise code that I've
already written, so that it would bypass this step if there is no
possibility that there is user-defined casing. But this is throwaway
work, so I would propose not doing it, and just wait until 5.16 instead.
Comments?
Thread Next
-
RFC: Proposal for fixing user-defined casing issues
by karl williamson