develooper Front page | perl.perl5.porters | Postings from December 2016

RFC: Forbid Unicode GCB non-starters as delimiters

Thread Next
From:
Karl Williamson
Date:
December 12, 2016 15:23
Subject:
RFC: Forbid Unicode GCB non-starters as delimiters
Message ID:
175ed1ae-3ec6-f20f-1454-65eaf92447c9@khwilliamson.com
A grapheme cluster is a sequence of characters that look on paper to a 
native speaker as if they were a single character.  For example 'n' + 
Combining Tilde looks like a single ñ.  A GCB is the position between 
two grapheme clusters.

Currently perl allows things like a combining tilde to be delimiters, 
such as string delimiters.  This causes visual confusion, as the tilde 
would show as being attached to the character previous to it, and not 
showing as delimiting anything.

I propose that we deprecate the case where a delimiter is not separated 
from the character before it by a GCB.  This would be the first step in 
eventually allowing entire grapheme clusters to be delimiters, as 
someone using them would expect if they didn't know the internals.

An issue is with unassigned code points.  When finally assigned, one 
could become a combining mark, and code that used that as a delimiter 
would no longer compile.  We could solve this by either forbidding 
unassigned code points from being delimiters (unless we know that it can 
never be assigned, as in non-characters or above-Unicode code points) or 
more likely just caution that use of unassigned code points as 
delimiters is at your own risk.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About