develooper Front page | perl.perl5.porters | Postings from September 2014

Re: RFC: implementing script runs

Thread Previous
From:
Abigail
Date:
September 25, 2014 09:17
Subject:
Re: RFC: implementing script runs
Message ID:
20140925091705.GB15111@almanda.fritz.box
On Wed, Sep 24, 2014 at 11:17:11PM -0600, Karl Williamson wrote:
> Unicode defines a "script run" to be contiguous characters from the same  
> script, like all Latin or all Greek.
>
> These can be important for security.  See
> http://www.unicode.org/reports/tr36/
>
> It seems to me that Perl should offer an easy way to specify that a  
> regex pattern element should match only a script run.  I'm proposing the  
> only current illegal syntax that is easy to type that I'm aware of;  
> other suggestions welcome.
>
> The idea I had is to have an extra '*' following the quantifier mean to  
> use a script run.  For example, qr/\w+*/ would match all the consecutive  
> word characters that are in the same script as the first one found.

I think that's a great idea, and it will make \d+ and \w+ slightly less
useless.

> In the case of digits, not only should they be from the same script but  
> from the same group of consecutive 10 digits.  There are a couple of  
> cases where a script has multiple ways of specifying the 10 digits.  
> Arabic, for example, has two different sets of digits.  I learned  
> recently that one set is used by the Sunnis and the other by the Shiites.

I did not know that, but forcing digits to be from the same consecutive
group seems like a good idea.

> The Common script in Unicode is used to mean that the character is in  
> wide-spread use across many scripts.  0-9 are in the Common script, as  
> is most punctuation and symbols.  The Inherited script is used for  
> characters that don't stand on their own, but modify other characters.  
> This includes the combining accents and the like.  I think that a script  
> run should include not only the script of the first character in it, but  
> also any contiguous Inherited characters.  I'm leaning to including  
> contiguous Common ones as well, but am less certain.

I've no opinion about that.


What I do wonder, what are the following going to mean:

    /\D+*/
    /a+*/                                   # Same as /a+/ I presume
    /(\p{Script:Greek}\p{Script:Thai})+*/   # Never match?

Can you combine that with possessive or stingy modifiers? So, one finally
will be able to write:

    /\d+?*/ 

(or should that be written as /\d+*?/? Or either way?).



Abigail

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About