Front page | perl.perl5.porters |
Postings from November 2009
What should \X match?
Thread Next
From:
karl williamson
Date:
November 27, 2009 08:45
Subject:
What should \X match?
Message ID:
4B1001D0.6050903@khwilliamson.com
I believe the current definition of \X is flawed. First of all it isn't
the Unicode concept it purports to be.
\X is defined as qr/(?>\PM\pM*)/, and in several places in the
documentation, it says that this is a Unicode "combining character
sequence". The current definition for that concept is qr/ {base}? \pM |
\N{ZWJ} | \N{ZWNJ} )+/x where {base} is something but not exactly like
\PM. Assume for the sake of argument that it is exactly \PM. Note that
it is optional in the Unicode definition, but not the Perl. Further
note that the \pM is optional in the Perl definition, but not the
Unicode one. This means, for example, that \X matches 'A', but a
'combining character sequence' does not.
In other places in the Perl documentation, it says that \X is an
"extended Unicode combining character sequence". I don't know what this
means. Unicode has an "extended combining character sequence" which is
similar to the regular one, but includes Hangul (Korean) syllables,
again it's not the Perl definition. Perhaps what is meant is that Perl
has extended (modified) the Unicode concept to be more like what it wanted.
The Perl documentation says "\X matches quite well what normal
(non-Unicode-programmer) usage would consider a single character", that
is, a logical character. As, Unicode TR29 says, "It is important to
recognize that what the user thinks of as a "character"—a basic unit of
a writing system for a language—may not be just a single Unicode code
point. Instead, that basic unit may be made up of multiple Unicode code
points. To avoid ambiguity with the computer use of the term character,
this is called a user-perceived character. For example, “G” +
acute-accent is a user-perceived character: users think of it as a
single character, yet is actually represented by two Unicode code
points. These user-perceived characters are approximated by what is
called a grapheme cluster, which can be determined programmatically."
What this means, I believe, is that \X should be a "grapheme cluster"
instead of something like a "combining character sequence", . And I
propose to change it to be so. Actually, I propose to change it to be
the latest type of grapheme cluster, the "extended grapheme cluster".
The "combining character sequence" is intended by Unicode to be used
for normalization purposes, not to define logical characters.
(Unicode did not always have the concept of the grapheme cluster
defined, but it does now, and it appears to me to be what \X is supposed
to mean, and in the areas that Unicode has examined, to use Jarkko's
words, "Unicode knows better than Perl")
What are the implications of changing? It turns out that the \pM*
component of the Perl definition of \X is almost the same as that of the
same part of an extended grapheme cluster. The difference is that the
Unicode definition includes 11 more characters than the current Perl
one, the ZWJ, ZWNJ, two Japanese that should be marks but aren't because
they were brought in with characteristics based on a pre-existing
standard, and 4 Thai, and 3 Laotian characters. So this part of the
definition change I think should not adversely affect any existing code.
The principal difference is the beginning component. The Perl
definition can fail to match input if the next character is a mark. The
Unicode definition is guaranteed to match at least one character. And
this seems like a bug in the Perl definition to me. \X is like a
logical '.' '.' always matches a character; therefore so should \X, and
Unicode agrees. It is rare to have a mark be in isolation, but it can
happen. The Standard gives the example of text talking about a mark.
The Unicode definition also forbids the splitting of a CR NL sequence.
As far as I know, these rarely happen in Perl because of the input
processing, but one can certainly create a string with this sequence.
I started to have a side discussion with Tom about this, but now think
the wider community should be involved. If people feel this would break
existing code, we could add the capability to revert to the old
semantics, by adding another switch to the legacy pragma.
For your information, the Unicode extended grapheme base definition,
using their terminology, is reproduced below. For further information,
see http://www.unicode.org/reports/tr29/.
Comments
( CRLF
| Prepend* ( Hangul-syllable | !Control )
( Grapheme_Extend | Spacing_Mark)*
| . )
Prepend matches 5 Thai and 5 Lao characters that behave weirdly in other
ways as well. Control is not just a control character, but a few other
things as well to make it come out. The combination of Grapheme_Extend
or'd with Spacing_Mark is the same as \pM plus the 11 characters I
mentioned above.
Thread Next
-
What should \X match?
by karl williamson