develooper Front page | perl.perl5.porters | Postings from July 2016

Re: RFC: Make \p{Foo} match \p{scx: foo| instead of \p{sc:foo}

Thread Previous | Thread Next
Karl Williamson
July 1, 2016 04:23
Re: RFC: Make \p{Foo} match \p{scx: foo| instead of \p{sc:foo}
Message ID:
On 06/29/2016 01:46 PM, Sawyer X wrote:
> [Top-posted]
> This seems to me like a reasonable and desired change. \p{Script=Foo}
> will still work and \p{Foo} will find a more accurate location based on
> Unicode's better understanding.
> I want to note the "breakage" Karl mentions relates to anyone who uses
> \p{Foo}, expecting it to be \p{Script=Foo} under Common, which - when
> using scx, would be translating to \p{Script_Extensions=Foo} which might
> not be in Common anymore (because it had presumably found a better, more
> accurate location).
> Karl++, as usual.
> On 06/24/2016 11:27 PM, Karl Williamson wrote:
>> The Script_Extensions property (scx) is an improved version of the
>> Script property (sc).  The latter would not exist if they had thought
>> of Script_Extensions earlier.
>> All Unicode properties are bipartite; you are supposed to give a
>> property name, then a colon (or equals) then the value.  But Perl
>> creates single-value synonyms for many properties, and Unicode
>> actually encourages this.
>> Script names are one case of this.  If you say \p{Foo} where "Foo" is
>> the name of a script, like Latin or Greek, Perl assumes you meant
>> \p{Script=Foo}.  I'm proposing to change that assumption to
>> \p{Script_Extensions=Foo}.
>> The principal difference between the two properties, is that in the
>> old Script property (which Unicode keeps around for backwards
>> compatibility), if a character occurs in more than one script, it is
>> placed into a pseudo-script named "Common".  Script_Extensions differs
>> in that a character goes into "Common" only if it really is common to
>> a lot of scripts.  The numerals 0-9 and some punctuation are properly
>> in "Common".  But for characters that are used in only a few scripts,
>> scx places the character in each of those scripts.  For example the
>> two major Japanese scripts have some characters that occur in either,
>> but in no other of the world's scripts.  A bunch of scripts from India
>> have some characters in common, but you wouldn't see them anywhere else.
>> For code that needs to care, it is far easier to use the scx property
>> than the sc property.  And it seems to me the Perl default should be
>> for the easiest to use, DWIM, property, provided it doesn't break too
>> much existing code.  Making this change would automatically fix most
>> code that isn't smart enough to take into account the subtleties of
>> how scripts work.  Careful code using the plain sc property needs to
>> also accept certain other characters, or it may just also accept
>> anything in "Common", which yields less desirable results, and
>> potential security issues.  The scx property has had experts look at
>> the scripts and determine which characters are used in more than one
>> script; it's going to be better than the average programmer is even
>> aware of.  There are fewer than 10 modules on CPAN that use the Common
>> property explicitly (and none use the Inherited script which is also
>> slightly affected by this change.)
>> Looking at those cases, it looks like this change would make most of
>> them work better.

Now pushed as 48791bf1d9612a84d71edc00af8610da1a6cf34b

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About