develooper Front page | perl.perl5.porters | Postings from July 2014

RFC: adding hooks for overriding qr//i

From:
Karl Williamson
Date:
July 30, 2014 19:21
Subject:
RFC: adding hooks for overriding qr//i
Message ID:
53D945BA.8020203@khwilliamson.com
The Unicode::Casing module on CPAN allows someone to override the 
default case-changing behavior of Perl, which follows Unicode.  For 
example, Unicode is inadequate for case changing Turkish and Lithuanian, 
and so this module can be used for that.  There are 4 cases: lower, 
upper, title, and fold.  The last is used for case-insensitive pattern 
matching.  The module works by redirecting calls to the lc, uc, tc, and 
fc Perl ops to ones it generates based on parameters passed to it by the 
user, using the hooks to the core introduced by Zefram.

However, it does not override /i regex matching, as there are no hooks 
available to do that.  I propose adding hooks for this.  I anticipate 
doing this by having the regex compiler look at the hints hash.

I haven't worked out what I think the best way to do this is, but my 
guess now is that it would be too hard to allow multi-character folds in 
the overriding definitions.  This is because the current core 
implementation does not really lend itself to customizing these.

Doing this would allow true (except for new multi-char folds) overriding 
of casing rules, without the current /i exception.

Another example of a potential use of this is the controversial nature 
of Unicode having the German sharp s matching ss under /i.  One could 
use U::C to override this, or someone could write a different CPAN 
module to use the same hooks if they don't like U::C, or have somewhat 
different needs.

Most things now in regex patterns are specified at compile time and are 
immutable if the pattern is executed in a different context or 
interpolated into another regex in another place.  For example, a 
pattern compiled under 'use feature "unicode_srings" will match the same 
regardless of whether the pattern match is executed under such scope or 
not.  It would be easier if that were not the case here.  That is, the 
override would be in effect only if the compilation and execution were 
both done within the lexical scope of the pragma.  The only ways to get 
around this that I can think of is to attach the overrides to the regex 
so that it would use them at execution time.  To do that properly so 
that two regexes with different overrides could be combined requires 
that the overrides get stringified along with the rest of the regex. 
That means, I think, they must become specifiable as infix regex 
modifiers.  And that means that we would have to allow for long regex 
modifiers.  That's another discussion that I think we should soon have, 
but I wanted to put this out thee first.

I have an idea for similar functionality to allow Private-Use Unicode 
code points to allow user-specifiable overrides of properties. 
Currently, Perl allows one to specify a name for them, but that is it. 
It would be nice if one could say that a given code point is \w, etc. 
Again this should be in a different discussion thread, but I mention it 
here, because to implement it properly would again require long regex 
modifiers, and so that is another bit of functionality that would 
benefit from those.



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About