develooper Front page | perl.perl5.porters | Postings from November 2013

Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode, all \\p{} matches fail; all \\P{} matches succeed"

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
November 28, 2013 15:40
Subject:
Re: RFC: What to do about warning: "Code point 0xFOO is not Unicode, all \\p{} matches fail; all \\P{} matches succeed"
Message ID:
24847.1385653199@chthon
Karl, thanks for all the work.

I have only one other comment.  The \p{Any} property is not a Perl-
special, as I think some people (not you) may have thought.  It is
required by tr18’s RL1.2 on required properties:

    ‘Any’ matches all code points. This could also be captured with
    [\u{0}-\u{10FFFF}]. In some regular expression languages, \p{Any} may
    be expressed by a period, but that may exclude newline characters.

I’d swear that once upon a time there was some notion that \p{Any}
might match a locale-tailored grapheme cluster.  Those are the ones
from the CLDR locales used in collation, like how “ch” and “ll” are
each a single collation grapheme cluster in traditional Spanish, as
well as “ñ”. (Hungarian and Irish Gaelic each offer many more such
examples.)

However, I can no longer find any mention of such a thing in the Level 3
section of tr18.  Instead they’ve replaced it with things like

    \X{locale-id}

or even 

    \T{locale-id} 
	... \X ... 
    \E

So for example, I quote from tr18 v17:

    For example, an implementation could interpret \X{es-u-co-trad} as
    matching a collation grapheme cluster for a traditional Spanish
    ordering, or use a switch to change the meaning of \X during some 
    span of the regular expression.

So I guess they have reneged on \p{Any} ever someday possibly matching
something whose length in code points is other than one.

There is of course a potential problem with the \X{locale_id} syntax
conflicting with the normal \X{m,n} \X{m,} \X{m} syntax, where m and 
n are integers for a repetition count. 

But I think that if we restrict locale ids to legal identifiers
(possibly even ASCII-only), and which therefore cannot begin with a(n
ASCII) digit, then we would be fine.

However, I have the feeling that this thought isn’t fully baked yet, 
if you know what I mean.

For example, first look at these two sets, where I’ll use the same locale
names as the Unicode::Collate::Locale module uses

    es:
        a b c    d e f g h i j k l    m n ñ o p q r s t u v w x y z
    es__traditional:
        a b c ch d e f g h i j k l ll m n ñ o p q r s t u v w x y z

I can imagine both postfix bracketing styles coëxisting 

    \X{es__traditional}{4}

Which would match words like “chico” and “niño”, whereas

    \X{es}{4}

would only match the second one, word breaks not-withstanding.

They like this for tailored work breaks:

    \b{w}       # word boundary
    \b{w:es}    # Spanish boundary

So I guess you could have (in /x mode)

    \b{w:es} \X{es}{4} \b{w:es}

Or equivalently, allowing the locale to distribute to both
the boundaries and the clusters:

    \T{es} \b{w} \X{4} \b{w} \E

The problem with this conflicting notion of \X is that you can have an
acute accent on any vowel in Spanish, but a diaeresis over the ‘u’
only.  Does that mean a borrowed word like “Noël” would fail?   I
don’t know.  It’s even harder when dealing with Old Spanish, where
many words now spelled with c or z were then spelled with ç, as many
still are in Portuguese.  For example, the name of the epic Spanish
poem when it was written was “El Cantar de Myo Çid”, so it would be
rude to disgard the Ç.

I do like the Level 3 ideas, but even they recognize that there has to be 
able to be a way to turn it off at times:  

    There must be some sort of syntax that will allow Level 3 support
    to be turned on and off, for two reasons. Level 3 support may be
    considerably slower than Level 2, and most regular expressions may
    require Level 1 or Level 2 matches to work properly. The syntax
    should also specify the particular locale or other tailoring
    customization that the pattern was designed for, because tailored
    regular expression patterns are usually quite specific to the
    locale, and will generally not work across different locales.

They’ve also been retracting portions that they realized were impossible or
at least infeasible, so it’s something a moving target.  For example, they’ve
retracted both RL3.4 Tailored Loose Matches and RL3.5 Tailored Ranges.  

A tailored range would be a square-bracketed character class that includes 
national collation sequences.  On Linux, even in the en_us.utf-8 locale,
GNU grep does weird stuff.  For example, this matches:

    $ perl -CS -le 'print "\xF1"' | grep '^[a-z]$'
    ñ

Isn’t that bizarre?

And I have no idea why or how.  They are doing something very odd with
ranges, and I don’t know at what level it is happening either. I could
imagine that matching in Spanish locale, but it surprises me that it
matches in an English one (perhaps from imports like “jalapeño”, but I
rather doubt it).

In any event, tr18 dumped tailored ranges, so we shouldn’t ever have
to worry about that.  We of course don’t do any such thing, no matter
how hard you ask Perl:

    $ perl -CS -le 'print "\xF1" =~ /^[a-z]$/ || "FAIL"'
    FAIL

    $ perl -CS -Mlocale -le 'print "\xF1" =~ /^[a-z]$/ || "FAIL"'
    FAIL

So grep and perl differ. Frankly, that’s fine by me.  Plus in any event,
tr18 dumped tailored ranges, so we shouldn’t ever have to worry about
that — I *hope*.

--tom

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About