develooper Front page | perl.perl5.porters | Postings from June 2016

RFC: Make \p{Foo} match \p{scx: foo| instead of \p{sc:foo}

Thread Next
Karl Williamson
June 24, 2016 21:27
RFC: Make \p{Foo} match \p{scx: foo| instead of \p{sc:foo}
Message ID:
The Script_Extensions property (scx) is an improved version of the 
Script property (sc).  The latter would not exist if they had thought of 
Script_Extensions earlier.

All Unicode properties are bipartite; you are supposed to give a 
property name, then a colon (or equals) then the value.  But Perl 
creates single-value synonyms for many properties, and Unicode actually 
encourages this.

Script names are one case of this.  If you say \p{Foo} where "Foo" is 
the name of a script, like Latin or Greek, Perl assumes you meant 
\p{Script=Foo}.  I'm proposing to change that assumption to 

The principal difference between the two properties, is that in the old 
Script property (which Unicode keeps around for backwards 
compatibility), if a character occurs in more than one script, it is 
placed into a pseudo-script named "Common".  Script_Extensions differs 
in that a character goes into "Common" only if it really is common to a 
lot of scripts.  The numerals 0-9 and some punctuation are properly in 
"Common".  But for characters that are used in only a few scripts, scx 
places the character in each of those scripts.  For example the two 
major Japanese scripts have some characters that occur in either, but in 
no other of the world's scripts.  A bunch of scripts from India have 
some characters in common, but you wouldn't see them anywhere else.

For code that needs to care, it is far easier to use the scx property 
than the sc property.  And it seems to me the Perl default should be for 
the easiest to use, DWIM, property, provided it doesn't break too much 
existing code.  Making this change would automatically fix most code 
that isn't smart enough to take into account the subtleties of how 
scripts work.  Careful code using the plain sc property needs to also 
accept certain other characters, or it may just also accept anything in 
"Common", which yields less desirable results, and potential security 
issues.  The scx property has had experts look at the scripts and 
determine which characters are used in more than one script; it's going 
to be better than the average programmer is even aware of.  There are 
fewer than 10 modules on CPAN that use the Common property explicitly 
(and none use the Inherited script which is also slightly affected by 
this change.)\\p\{(common|zyyy|zinh|inherited)})

Looking at those cases, it looks like this change would make most of 
them work better.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About