develooper Front page | perl.perl5.porters | Postings from October 2011

Re: [perl #99928] \R doesn't match correctly

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
October 27, 2011 11:55
Subject:
Re: [perl #99928] \R doesn't match correctly
Message ID:
9941.1319741735@chthon
>> Tom Christiansen has persuaded me that this is not a bug; that it is
>> working correctly.  The reason is that Unicode considers CR-LF to be a
>> unit, and so they should not be separated, even if it causes a match to
>> fail, that with backtracking would succeed.

> Makes sense to me.

> Does \X use (?>...) too?

Yes.  It has to, and for just the same reason.  A close reading of tr18
will convince you that putting \p{Any} to either side of \X must under
no circumstances ever permit \X to become unaligned with respect to extended
grapheme cluster boundaries.  Here's the text:

    Syntax	Description

    \b{g}	Zero-width match at a Unicode extended grapheme cluster boundary
    \b{w}	Zero-width match at a Unicode word boundary. 
                Note that this is different than \b alone, which corresponds to \w and \W.
                See Annex C: Compatibility Properties. 
    \b{l}	Zero-width match at a Unicode line break boundary
    \b{s}	Zero-width match at a Unicode sentence boundary

    Thus \X is equivalent to .+?\b{g}; proceed the minimal number of
    characters (but at least one) to get to the next extended grapheme
    cluster boundary.

Although they could have been clearer there, to me the only valid
reading of the that text is that \X is not supposed to break up
graphemes, to match on a non-(grapheme boundary), or on a grapheme
nonboundary, or whatever you want to call it.

\X can *only* match a grapheme extend character if that character has a
linebreak before it or is at the start of the string.  This is the
degenerate extender case. That's the only way you can get a grapheme
cluster boundary in front of a grapheme extender.

Here's my latest stab at writing about this.  Karl, please feel free to
insert this, whether verbatim or distantly derived, into perlre if you
think that would help people understand.

    The C<\X> metasymbol matches a character in a more extended sense. It
    matches a string of one or more Unicode characters known as a “grapheme
    cluster”. It’s meant to grab several characters in a row that together
    represent a single glyph to the user.  Typically it’s a base character
    followed by combining diacritics like cedillas or diaereses that combine
    with that base character to form one logical unit. It can also be any
    Unicode linebreak sequence including C<"\r\n">, and, because one doesn’t
    apply marks to linebreaks, it can even be a lone mark at the start of the
    string or line.

    Perl’s original C<\X> worked mostly like C<(?>\PM\pM*)>, but that doesn’t
    work out so well, since Unicode refined its notion of grapheme clusters.
    Its actual definition is complicated, but this is close enough:

         (?> \R
           | \p{Grapheme_Base} \p{Grapheme_Extend}*
           | \p{Grapheme_Extend}
         )

    The point is that C<\X> matches one user-visible character (grapheme) even
    if it takes several programmer-visible characters (codepoints) to do so. The
    length of the string matched by C</\X/> could exceed one character if the
    C<\R> in the pseudo-expansion above matched a CRLF pair, or if a grapheme
    base character were followed by one or more grapheme extend characters.
        N<Usually combining marks; currently the only non-mark grapheme
         extend characters are S<ZERO WIDTH NON-JOINER>, S<ZERO WIDTH
         JOINER>, S<HALFWIDTH KATAKANA VOICED SOUND MARK>, and
         S<HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK>.>
    The possessive group means C<\X> can’t change its mind once it’s found a
    base character with any extend characters after it.  For example, C</\X.\z/>
    can never match “C<cafe\x{301}>”, where U+0301 is S<COMBINING ACUTE ACCENT>,
    because C<\X> cannot be backtracked into.

Pseudopod legend for where at variance from normal pod:

 *  The N<> tag above is for a footnote.  In unpseudopod it could just 
    be a parenthetical statement.

 *  Those S<> tags are not unbreakable spaces here; they select
    the font's "small capitals" feature.  That's because one is supposed 
    to use small capitals to typeset names from the combined namesspace 
    of Unicode named characters, named aliases, and name sequences.

Hope this helps.

--tom

    PS: Suggested revisions are welcome -- if you *hurry*. :)

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About