develooper Front page | perl.perl5.porters | Postings from November 2009

What should \X match?

Thread Next
From:
karl williamson
Date:
November 27, 2009 08:45
Subject:
What should \X match?
Message ID:
4B1001D0.6050903@khwilliamson.com
I believe the current definition of \X is flawed.  First of all it isn't 
the Unicode concept it purports to be.

\X is defined as qr/(?>\PM\pM*)/, and in several places in the 
documentation, it says that this is a Unicode "combining character 
sequence".  The current definition for that concept is qr/ {base}? \pM | 
\N{ZWJ} | \N{ZWNJ} )+/x where {base} is something but not exactly like 
\PM.  Assume for the sake of argument that it is exactly \PM.  Note that 
it is optional in the Unicode definition, but not the Perl.  Further 
note that the \pM is optional in the Perl definition, but not the 
Unicode one.  This means, for example, that \X matches 'A', but a 
'combining character sequence' does not.

In other places in the Perl documentation, it says that \X is an 
"extended Unicode combining character sequence".  I don't know what this 
means.  Unicode has an "extended combining character sequence" which is 
similar to the regular one, but includes Hangul (Korean) syllables, 
again it's not the Perl definition.  Perhaps what is meant is that Perl 
has extended (modified) the Unicode concept to be more like what it wanted.

The Perl documentation says "\X matches quite well what normal 
(non-Unicode-programmer) usage would consider a single character", that 
is, a logical character.  As, Unicode TR29 says, "It is important to 
recognize that what the user thinks of as a "character"—a basic unit of 
a writing system for a language—may not be just a single Unicode code 
point. Instead, that basic unit may be made up of multiple Unicode code 
points. To avoid ambiguity with the computer use of the term character, 
this is called a user-perceived character. For example, “G” + 
acute-accent is a user-perceived character: users think of it as a 
single character, yet is actually represented by two Unicode code 
points. These user-perceived characters are approximated by what is 
called a grapheme cluster, which can be determined programmatically."

What this means, I believe, is that \X should be a "grapheme cluster" 
instead of something like a "combining character sequence", .  And I 
propose to change it to be so.  Actually, I propose to change it to be 
the latest type of grapheme cluster, the "extended grapheme cluster". 
The "combining character sequence" is intended by Unicode to be used
for normalization purposes, not to define logical characters.

(Unicode did not always have the concept of the grapheme cluster 
defined, but it does now, and it appears to me to be what \X is supposed 
to mean, and in the areas that Unicode has examined, to use Jarkko's 
words, "Unicode knows better than Perl")

What are the implications of changing?  It turns out that the \pM* 
component of the Perl definition of \X is almost the same as that of the 
same part of an extended grapheme cluster.  The difference is that the 
Unicode definition includes 11 more characters than the current Perl 
one, the ZWJ, ZWNJ, two Japanese that should be marks but aren't because 
they were brought in with characteristics based on a pre-existing 
standard, and 4 Thai, and 3 Laotian characters.  So this part of the 
definition change I think should not adversely affect any existing code.

The principal difference is the beginning component.  The Perl 
definition can fail to match input if the next character is a mark.  The 
Unicode definition is guaranteed to match at least one character.  And 
this seems like a bug in the Perl definition to me.  \X is like a 
logical '.'  '.' always matches a character; therefore so should \X, and 
Unicode agrees.  It is rare to have a mark be in isolation, but it can 
happen.  The Standard gives the example of text talking about a mark.

The Unicode definition also forbids the splitting of a CR NL sequence. 
As far as I know, these rarely happen in Perl because of the input 
processing, but one can certainly create a string with this sequence.

I started to have a side discussion with Tom about this, but now think 
the wider community should be involved.  If people feel this would break 
existing code, we could add the capability to revert to the old 
semantics, by adding another switch to the legacy pragma.

For your information, the Unicode extended grapheme base definition, 
using their terminology, is reproduced below.  For further information, 
see http://www.unicode.org/reports/tr29/.

Comments


( CRLF
| Prepend* ( Hangul-syllable | !Control )
   ( Grapheme_Extend | Spacing_Mark)*
| . )

Prepend matches 5 Thai and 5 Lao characters that behave weirdly in other 
ways as well.  Control is not just a control character, but a few other 
things as well to make it come out. The combination of Grapheme_Extend 
or'd with Spacing_Mark is the same as \pM plus the 11 characters I 
mentioned above.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About