develooper Front page | perl.perl5.porters | Postings from September 2014

Re: RFC: implementing script runs

Thread Previous | Thread Next
Leon Timmermans
September 26, 2014 15:07
Re: RFC: implementing script runs
Message ID:
On Thu, Sep 25, 2014 at 7:17 AM, Karl Williamson <>

> Unicode defines a "script run" to be contiguous characters from the same
> script, like all Latin or all Greek.
> These can be important for security.  See
> It seems to me that Perl should offer an easy way to specify that a regex
> pattern element should match only a script run.  I'm proposing the only
> current illegal syntax that is easy to type that I'm aware of; other
> suggestions welcome.
> The idea I had is to have an extra '*' following the quantifier mean to
> use a script run.  For example, qr/\w+*/ would match all the consecutive
> word characters that are in the same script as the first one found.
> In the case of digits, not only should they be from the same script but
> from the same group of consecutive 10 digits.  There are a couple of cases
> where a script has multiple ways of specifying the 10 digits. Arabic, for
> example, has two different sets of digits.  I learned recently that one set
> is used by the Sunnis and the other by the Shiites.
> The Common script in Unicode is used to mean that the character is in
> wide-spread use across many scripts.  0-9 are in the Common script, as is
> most punctuation and symbols.  The Inherited script is used for characters
> that don't stand on their own, but modify other characters. This includes
> the combining accents and the like.  I think that a script run should
> include not only the script of the first character in it, but also any
> contiguous Inherited characters.  I'm leaning to including contiguous
> Common ones as well, but am less certain.

I agree this feature is desirable, I disagree this is a desirable syntax.
For a moment I thought about a \i+ that means "multiple characters of the
same script, but then I realized it may be more sensible to have a \w\j*,
where j means "character of the previous alphabet". I suppose this this
allows for various variations wrt Inherited and Common characters too.


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About