develooper Front page | perl.perl5.porters | Postings from November 2012

[perl #108164] Re: regex property extensions: \p{X-Confusable=A}from UTS#39

From:
Karl Williamson
Date:
November 23, 2012 16:04
Subject:
[perl #108164] Re: regex property extensions: \p{X-Confusable=A}from UTS#39
Message ID:
50B00EAA.2030700@khwilliamson.com
On 01/13/2012 08:26 AM, Tom Christiansen wrote:
> Currently, there is no (reasonable) way for the user to implement
> properties like \p{X-Confusable=A} (that is, from UTS#39) on their own.
>
> I feel this is a bug; hence, this filing.
>
> Here are issues blocking the user-level implementation of such a scheme:
>
>   *  The super-annoying new restriction that all user-defined properties *must*
>      start with /^I[sn]/ for them to be paid any attention to.
>
>   *  There is no way to have "parameterized" \p{NAME=VALUE} user properties, even
>      when the NAME is an X-foo user name (let alone an X-VALUE user value for an
>      existing property.) Consider whow X-Confusable=VALUE needs to be able to
>      take at a minimum, an arbitrary code point, and in fact probably an
>      arbitrary string, as its value.
>
>   *  Apropos locating user-defined properties, there may be concerns about which
>      package the pattern was compiled in versus which one it is executed in,
>      along with the related issue of serialization needed for qr// recompilation.
>
> Because this is not possible for the user to do this for himself, I
> necessarily request that it be fully implemented in the core for v5.18.
>
> Currently only user-defined binary properties are allowed, which is not good
> enough, because it's nuts to expect people to write a \p{Is_X-Confusable__A}
> binary property or similar ridiculousness.  Even worse, you'd have to have a
> special function for *EVERY POSSIBLE UNICODE CODE POINT*, and you could never
> do full strings.  You surely do not want a hundred thousand things in the
> symbol table -- or a million -- nor do you not want a hundred thousand little
> "XConfus" *.pl files, either.
>
> Yes, that's asking a great deal, but we are given no choice: currently only
> the core can do this because of these bugs related to user properties.
>
> Therefore a perfectly reasonable alternative to implementing it in the core
> is *TO MAKE IT POSSIBLE* for a user to implement it as a module outside the
> core.  I would actually prefer that solution.  But right now, bugs get in
> the way, so an in-core implementation tracking UTS#39 is the only way to do
> this under current technology.
>
> See http://stackoverflow.com/a/8841591/471272 for elaboration of the
> "confusable" issue and proposed property, including how this relates
> to UTS#39.
>
> --tom
>

I have come to the following conclusions (open to debate) after having 
thought about this issue for some time

 From UTS39:

"To see whether two strings X and Y are confusable according to a given 
table (abbreviated as X ≅ Y), an implementation uses a transform of X 
called a skeleton(X) defined by:

     Converting X to NFD format, as described in [UAX15].
     Successively mapping each source character in X to the target 
string according to the specified data table.
     Reapplying NFD.

The resulting strings skeleton(X) and skeleton(Y) are then compared. If 
they are identical (codepoint-for-codepoint), then X ≅ Y according to 
the table."

It seems to me that this is should be implemented first through a CPAN 
module rather than through a regular expression extension.

Effectively, what is being asked for by UTS39 is applying a new kind of 
normalization form (Normalization Form Confusable) or a new kind of 
casefold (Confusable case).

A module could be easily written that takes the data files from UTS39 
and creates a function, call it skeleton(), that implements their 
transform, much like Brian Fraser added a module that implements 
foldcase, fc().  Or there could be multiple functions to handle the 
single vs multiple -script confusables, etc.

This would be far easier than extending the regex engine to do this, and 
I think would be better.  The Perl regex engine has never dealt with 
normalization, which is critical in this application to get it right. 
And it is likely that the regex engine will never deal with that, for 
reasons that Tom is well aware, and which caused Unicode to admit that 
their previous exhortations to do so can not succeed, hence they 
retracted those portions of their regex document UTS 18.

After gaining experience with the CPAN module, if there were some subset 
that made sense to implement in the regex engine, we could do so at that 
time.

I believe that the mixed-script detection parts of UTS39 are something 
that warrants a regular expression extension, and I plan to eventually 
implement something to do this.




nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About