A new version of the script run feature is now available at https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-scriptrun This uses the syntax proposed by Zefram. I have barely tested it, and would appreciate people trying it out. Documentation is in perlre under "Script Runs" It was actually not hard to implement, so far. What's missing AFAIK is conversion to using the better Script_Extensions property. Since I don't understand the regex compiler optimizer, I would especially appreciate if someone who knows something about that telling me considerations I may have overlooked in regcomp.c On 11/07/2017 12:06 AM, Karl Williamson wrote: > In case you want to play around with it. > > https://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-scriptrun > > > This does not have the syntax that the final form will be, but it was > far easier to implement so that it can be tried out. > > To use, you have to temporarily use \M as a zero-length assertion that > the most recent previous capturing group is a script run. There is > another thread in this list discussing the real eventual syntax. > > The feature will be marked as experimental for at least the first release. > > Here is an example: > > ./perl -Ilib -Dr -le 'use utf8; "раураl" =~ qr/^(.*)\M$/' > > fails because the text is Cyrillic except for the final 'l'; whereas > > ./perl -Ilib -Dr -le 'use utf8; "paypal" =~ qr/^(.*)\M$/' > > succeeds because paypal is all ASCII Latin script. > > Here are some details: > > This temporarily uses the Unicode plain script property, rather than the > better Script Extension property. > > The ASCII 0-9 digits are used all over the world. Some scripts have > more than one set of 10 digits. To cope with these realities, the > script run assertion passes only if all digits within it are in the same > set of 10 digits. > > Otherwise, characters belonging to the Common script are accepted within > another script. This allows things like colon, etc. to be mixed in. > This may have to be tweaked. Emoji are in the Common script. Do we > want them to be able to be mixed with Cyrillic or Hieroglyphics? I > don't know. > > It won't pass a sequence of code points that are unassigned (and hence > not in a script) except if the sequence is a single character. > > It also currently doesn't accept an Inherited script character in the > first position. That probably will change to allow sequences of > entirely inherited characters. Inherited tends to be combining marks, > and they inherit the script of the character they combine with. It > doesn't make much sense to have the first character in a run be such a > thing, as there is nothing there yet for it to modify. > > There is some concern that this could go quadratic. That is true, but > not on well-formed input. An attacker could create a DOS if they could > feed text that they know will cause a lot of backtracking. The > pattern's author needs to be aware of this possibility. >Thread Previous