develooper Front page | perl.perl5.porters | Postings from September 2014

Re: RFC: implementing script runs

Thread Previous | Thread Next
From:
Leon Timmermans
Date:
September 26, 2014 15:07
Subject:
Re: RFC: implementing script runs
Message ID:
CAHhgV8htv3bwDjDYbXcaY0Dbhagfo-_-5U=G0GuMomDvmVTrAw@mail.gmail.com
On Thu, Sep 25, 2014 at 7:17 AM, Karl Williamson <public@khwilliamson.com>
wrote:

> Unicode defines a "script run" to be contiguous characters from the same
> script, like all Latin or all Greek.
>
> These can be important for security.  See
> http://www.unicode.org/reports/tr36/
>
> It seems to me that Perl should offer an easy way to specify that a regex
> pattern element should match only a script run.  I'm proposing the only
> current illegal syntax that is easy to type that I'm aware of; other
> suggestions welcome.
>
> The idea I had is to have an extra '*' following the quantifier mean to
> use a script run.  For example, qr/\w+*/ would match all the consecutive
> word characters that are in the same script as the first one found.
>
> In the case of digits, not only should they be from the same script but
> from the same group of consecutive 10 digits.  There are a couple of cases
> where a script has multiple ways of specifying the 10 digits. Arabic, for
> example, has two different sets of digits.  I learned recently that one set
> is used by the Sunnis and the other by the Shiites.
>
> The Common script in Unicode is used to mean that the character is in
> wide-spread use across many scripts.  0-9 are in the Common script, as is
> most punctuation and symbols.  The Inherited script is used for characters
> that don't stand on their own, but modify other characters. This includes
> the combining accents and the like.  I think that a script run should
> include not only the script of the first character in it, but also any
> contiguous Inherited characters.  I'm leaning to including contiguous
> Common ones as well, but am less certain.
>

I agree this feature is desirable, I disagree this is a desirable syntax.
For a moment I thought about a \i+ that means "multiple characters of the
same script, but then I realized it may be more sensible to have a \w\j*,
where j means "character of the previous alphabet". I suppose this this
allows for various variations wrt Inherited and Common characters too.

Leon

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About