develooper Front page | perl.perl5.porters | Postings from December 2017

Re: Basic implementation of script runs is available

Thread Previous
From:
Karl Williamson
Date:
December 16, 2017 07:17
Subject:
Re: Basic implementation of script runs is available
Message ID:
d254396f-97eb-e9fb-de9d-0492dafa2585@khwilliamson.com
A new version of the script run feature is now available at

https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-scriptrun

This uses the syntax proposed by Zefram.  I have barely tested it, and 
would appreciate people trying it out.

Documentation is in perlre under "Script Runs"

It was actually not hard to implement, so far.  What's missing AFAIK is 
conversion to using the better Script_Extensions property.

Since I don't understand the regex compiler optimizer, I would 
especially appreciate if someone who knows something about that telling 
me considerations I may have overlooked in regcomp.c


On 11/07/2017 12:06 AM, Karl Williamson wrote:
> In case you want to play around with it.
> 
> https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-scriptrun 
> 
> 
> This does not have the syntax that the final form will be, but it was 
> far easier to implement so that it can be tried out.
> 
> To use, you have to temporarily use \M as a zero-length assertion that 
> the most recent previous capturing group is a script run.  There is 
> another thread in this list discussing the real eventual syntax.
> 
> The feature will be marked as experimental for at least the first release.
> 
> Here is an example:
> 
> ./perl -Ilib -Dr -le 'use utf8; "раураl" =~ qr/^(.*)\M$/'
> 
> fails because the text is Cyrillic except for the final 'l'; whereas
> 
> ./perl -Ilib -Dr -le 'use utf8; "paypal" =~ qr/^(.*)\M$/'
> 
> succeeds because paypal is all ASCII Latin script.
> 
> Here are some details:
> 
> This temporarily uses the Unicode plain script property, rather than the 
> better Script Extension property.
> 
> The ASCII 0-9 digits are used all over the world.  Some scripts have 
> more than one set of 10 digits.  To cope with these realities, the 
> script run assertion passes only if all digits within it are in the same 
> set of 10 digits.
> 
> Otherwise, characters belonging to the Common script are accepted within 
> another script.  This allows things like colon, etc. to be mixed in. 
> This may have to be tweaked.  Emoji are in the Common script.  Do we 
> want them to be able to be mixed with Cyrillic or Hieroglyphics?  I 
> don't know.
> 
> It won't pass a sequence of code points that are unassigned (and hence 
> not in a script) except if the sequence is a single character.
> 
> It also currently doesn't accept an Inherited script character in the 
> first position.  That probably will change to allow sequences of 
> entirely inherited characters.  Inherited tends to be combining marks, 
> and they inherit the script of the character they combine with.  It 
> doesn't make much sense to have the first character in a run be such a 
> thing, as there is nothing there yet for it to modify.
> 
> There is some concern that this could go quadratic.  That is true, but 
> not on well-formed input.  An attacker could create a DOS if they could 
> feed text that they know will cause a lot of backtracking.  The 
> pattern's author needs to be aware of this possibility.
> 

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About