Front page | perl.perl5.porters |
Postings from July 2014
RFC: adding hooks for overriding qr//i
From:
Karl Williamson
Date:
July 30, 2014 19:21
Subject:
RFC: adding hooks for overriding qr//i
Message ID:
53D945BA.8020203@khwilliamson.com
The Unicode::Casing module on CPAN allows someone to override the
default case-changing behavior of Perl, which follows Unicode. For
example, Unicode is inadequate for case changing Turkish and Lithuanian,
and so this module can be used for that. There are 4 cases: lower,
upper, title, and fold. The last is used for case-insensitive pattern
matching. The module works by redirecting calls to the lc, uc, tc, and
fc Perl ops to ones it generates based on parameters passed to it by the
user, using the hooks to the core introduced by Zefram.
However, it does not override /i regex matching, as there are no hooks
available to do that. I propose adding hooks for this. I anticipate
doing this by having the regex compiler look at the hints hash.
I haven't worked out what I think the best way to do this is, but my
guess now is that it would be too hard to allow multi-character folds in
the overriding definitions. This is because the current core
implementation does not really lend itself to customizing these.
Doing this would allow true (except for new multi-char folds) overriding
of casing rules, without the current /i exception.
Another example of a potential use of this is the controversial nature
of Unicode having the German sharp s matching ss under /i. One could
use U::C to override this, or someone could write a different CPAN
module to use the same hooks if they don't like U::C, or have somewhat
different needs.
Most things now in regex patterns are specified at compile time and are
immutable if the pattern is executed in a different context or
interpolated into another regex in another place. For example, a
pattern compiled under 'use feature "unicode_srings" will match the same
regardless of whether the pattern match is executed under such scope or
not. It would be easier if that were not the case here. That is, the
override would be in effect only if the compilation and execution were
both done within the lexical scope of the pragma. The only ways to get
around this that I can think of is to attach the overrides to the regex
so that it would use them at execution time. To do that properly so
that two regexes with different overrides could be combined requires
that the overrides get stringified along with the rest of the regex.
That means, I think, they must become specifiable as infix regex
modifiers. And that means that we would have to allow for long regex
modifiers. That's another discussion that I think we should soon have,
but I wanted to put this out thee first.
I have an idea for similar functionality to allow Private-Use Unicode
code points to allow user-specifiable overrides of properties.
Currently, Perl allows one to specify a name for them, but that is it.
It would be nice if one could say that a given code point is \w, etc.
Again this should be in a different discussion thread, but I mention it
here, because to implement it properly would again require long regex
modifiers, and so that is another bit of functionality that would
benefit from those.
-
RFC: adding hooks for overriding qr//i
by Karl Williamson