Front page | perl.perl5.porters |
Postings from October 2011
Re: [perl #99928] \R doesn't match correctly
Thread Previous
|
Thread Next
From:
Tom Christiansen
Date:
October 27, 2011 11:55
Subject:
Re: [perl #99928] \R doesn't match correctly
Message ID:
9941.1319741735@chthon
>> Tom Christiansen has persuaded me that this is not a bug; that it is
>> working correctly. The reason is that Unicode considers CR-LF to be a
>> unit, and so they should not be separated, even if it causes a match to
>> fail, that with backtracking would succeed.
> Makes sense to me.
> Does \X use (?>...) too?
Yes. It has to, and for just the same reason. A close reading of tr18
will convince you that putting \p{Any} to either side of \X must under
no circumstances ever permit \X to become unaligned with respect to extended
grapheme cluster boundaries. Here's the text:
Syntax Description
\b{g} Zero-width match at a Unicode extended grapheme cluster boundary
\b{w} Zero-width match at a Unicode word boundary.
Note that this is different than \b alone, which corresponds to \w and \W.
See Annex C: Compatibility Properties.
\b{l} Zero-width match at a Unicode line break boundary
\b{s} Zero-width match at a Unicode sentence boundary
Thus \X is equivalent to .+?\b{g}; proceed the minimal number of
characters (but at least one) to get to the next extended grapheme
cluster boundary.
Although they could have been clearer there, to me the only valid
reading of the that text is that \X is not supposed to break up
graphemes, to match on a non-(grapheme boundary), or on a grapheme
nonboundary, or whatever you want to call it.
\X can *only* match a grapheme extend character if that character has a
linebreak before it or is at the start of the string. This is the
degenerate extender case. That's the only way you can get a grapheme
cluster boundary in front of a grapheme extender.
Here's my latest stab at writing about this. Karl, please feel free to
insert this, whether verbatim or distantly derived, into perlre if you
think that would help people understand.
The C<\X> metasymbol matches a character in a more extended sense. It
matches a string of one or more Unicode characters known as a “grapheme
cluster”. It’s meant to grab several characters in a row that together
represent a single glyph to the user. Typically it’s a base character
followed by combining diacritics like cedillas or diaereses that combine
with that base character to form one logical unit. It can also be any
Unicode linebreak sequence including C<"\r\n">, and, because one doesn’t
apply marks to linebreaks, it can even be a lone mark at the start of the
string or line.
Perl’s original C<\X> worked mostly like C<(?>\PM\pM*)>, but that doesn’t
work out so well, since Unicode refined its notion of grapheme clusters.
Its actual definition is complicated, but this is close enough:
(?> \R
| \p{Grapheme_Base} \p{Grapheme_Extend}*
| \p{Grapheme_Extend}
)
The point is that C<\X> matches one user-visible character (grapheme) even
if it takes several programmer-visible characters (codepoints) to do so. The
length of the string matched by C</\X/> could exceed one character if the
C<\R> in the pseudo-expansion above matched a CRLF pair, or if a grapheme
base character were followed by one or more grapheme extend characters.
N<Usually combining marks; currently the only non-mark grapheme
extend characters are S<ZERO WIDTH NON-JOINER>, S<ZERO WIDTH
JOINER>, S<HALFWIDTH KATAKANA VOICED SOUND MARK>, and
S<HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK>.>
The possessive group means C<\X> can’t change its mind once it’s found a
base character with any extend characters after it. For example, C</\X.\z/>
can never match “C<cafe\x{301}>”, where U+0301 is S<COMBINING ACUTE ACCENT>,
because C<\X> cannot be backtracked into.
Pseudopod legend for where at variance from normal pod:
* The N<> tag above is for a footnote. In unpseudopod it could just
be a parenthetical statement.
* Those S<> tags are not unbreakable spaces here; they select
the font's "small capitals" feature. That's because one is supposed
to use small capitals to typeset names from the combined namesspace
of Unicode named characters, named aliases, and name sequences.
Hope this helps.
--tom
PS: Suggested revisions are welcome -- if you *hurry*. :)
Thread Previous
|
Thread Next