develooper Front page | perl.perl5.porters | Postings from November 2017

Basic implementation of script runs is available

Thread Next
From:
Karl Williamson
Date:
November 7, 2017 07:07
Subject:
Basic implementation of script runs is available
Message ID:
19d8472f-0e88-c9ea-6519-0e957e046dae@khwilliamson.com
In case you want to play around with it.

https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-scriptrun

This does not have the syntax that the final form will be, but it was 
far easier to implement so that it can be tried out.

To use, you have to temporarily use \M as a zero-length assertion that 
the most recent previous capturing group is a script run.  There is 
another thread in this list discussing the real eventual syntax.

The feature will be marked as experimental for at least the first release.

Here is an example:

./perl -Ilib -Dr -le 'use utf8; "раураl" =~ qr/^(.*)\M$/'

fails because the text is Cyrillic except for the final 'l'; whereas

./perl -Ilib -Dr -le 'use utf8; "paypal" =~ qr/^(.*)\M$/'

succeeds because paypal is all ASCII Latin script.

Here are some details:

This temporarily uses the Unicode plain script property, rather than the 
better Script Extension property.

The ASCII 0-9 digits are used all over the world.  Some scripts have 
more than one set of 10 digits.  To cope with these realities, the 
script run assertion passes only if all digits within it are in the same 
set of 10 digits.

Otherwise, characters belonging to the Common script are accepted within 
another script.  This allows things like colon, etc. to be mixed in. 
This may have to be tweaked.  Emoji are in the Common script.  Do we 
want them to be able to be mixed with Cyrillic or Hieroglyphics?  I 
don't know.

It won't pass a sequence of code points that are unassigned (and hence 
not in a script) except if the sequence is a single character.

It also currently doesn't accept an Inherited script character in the 
first position.  That probably will change to allow sequences of 
entirely inherited characters.  Inherited tends to be combining marks, 
and they inherit the script of the character they combine with.  It 
doesn't make much sense to have the first character in a run be such a 
thing, as there is nothing there yet for it to modify.

There is some concern that this could go quadratic.  That is true, but 
not on well-formed input.  An attacker could create a DOS if they could 
feed text that they know will cause a lot of backtracking.  The 
pattern's author needs to be aware of this possibility.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About