develooper Front page | perl.perl5.porters | Postings from September 2014

RFC: implementing script runs

Thread Next
From:
Karl Williamson
Date:
September 25, 2014 05:17
Subject:
RFC: implementing script runs
Message ID:
5423A557.4030300@khwilliamson.com
Unicode defines a "script run" to be contiguous characters from the same 
script, like all Latin or all Greek.

These can be important for security.  See
http://www.unicode.org/reports/tr36/

It seems to me that Perl should offer an easy way to specify that a 
regex pattern element should match only a script run.  I'm proposing the 
only current illegal syntax that is easy to type that I'm aware of; 
other suggestions welcome.

The idea I had is to have an extra '*' following the quantifier mean to 
use a script run.  For example, qr/\w+*/ would match all the consecutive 
word characters that are in the same script as the first one found.

In the case of digits, not only should they be from the same script but 
from the same group of consecutive 10 digits.  There are a couple of 
cases where a script has multiple ways of specifying the 10 digits. 
Arabic, for example, has two different sets of digits.  I learned 
recently that one set is used by the Sunnis and the other by the Shiites.

The Common script in Unicode is used to mean that the character is in 
wide-spread use across many scripts.  0-9 are in the Common script, as 
is most punctuation and symbols.  The Inherited script is used for 
characters that don't stand on their own, but modify other characters. 
This includes the combining accents and the like.  I think that a script 
run should include not only the script of the first character in it, but 
also any contiguous Inherited characters.  I'm leaning to including 
contiguous Common ones as well, but am less certain.



Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About