develooper Front page | perl.perl5.porters | Postings from December 2011

Re: Solving the *real* Dot Problem

Thread Previous | Thread Next
From:
Karl Williamson
Date:
December 23, 2011 13:40
Subject:
Re: Solving the *real* Dot Problem
Message ID:
4EF4F502.7080600@khwilliamson.com
On 12/23/2011 01:22 PM, Brian Fraser wrote:
> On Wed, Jul 6, 2011 at 10:33 PM, Tom Christiansen <tchrist@perl.com
> <mailto:tchrist@perl.com>> wrote:
>
>     THESIS:  Perl’s /./ is fundamentally broken, be it (?s:.) or (?-s:.).
>              It’s long past time we fast‐forward a few decades in how we
>              think about all this.
>
>     SUMMARY: Perl needs to stop making it so easy to do the wrong thing
>     here,
>              and instead start making it easy to do the right thing.  Let’s
>              stop wasting time/brain/etc diddling around with a 1980s‐style
>              ASCII solution in our Unicode world of the 2010s and beyond!
>
>     Zsbán Ambrus <ambrus@math.bme.hu <mailto:ambrus@math.bme.hu>> wrote
>     on Wed, 06 Jul 2011 20:58:57 +0200:
>
>      > On Wed, Jul 6, 2011 at 1:48 AM, Jesse Vincent <jesse@fsck.com
>     <mailto:jesse@fsck.com>> wrote:
>
>      > [On the new \N regex escape that matches any one character except
>     \n.]
>
>      >> Is it being used? (Are folks cpanning modules that use it?)
>
>      > It may get more use once perl 5.14 spreads, because there you can
>     'use
>      > re "/s";' to make the dot have the more useful meaning and then
>     \N has
>      > the occasionally useful meaning.  Further, if 'use 5.016;' enabled
>      > 'use re "/s";' by default, it would see even more use.
>
>     Upgrading the status of \N from experimental to something more solid is
>     a timely and necessary, but sadly insufficient step, toward solving the
>     Dot Problem.  Diddling around with . and \N and such ignores the *real*
>     issue: that those are ASCII thingamaboogers — but Perl needs Unicode
>     ones.
>
>
>     By the Dot Problem, I mean a regex metacharacter matching just “one” of
>     “anything”, for a broad sense of anything but a narrow sense of one.
>
>       {  NB: I am not referring to a literal FULL STOP nor its 3 other
>     NFKD or
>             \p{SB=AT} aliases, let alone the \p{SB=ST} stuff.   Use NFKD
>     eq “.”
>             or /\p{SB=AT}/ if that’s the sort of literal dot you want.  }
>
>     Here are 5 possible meanings for dot.  I start with the original and
>     *LEAST
>     USEFUL OF ALL POSSIBLE MEANINGS*, and progress to the most useful
>     ones, the
>     ones that I think people should usually be using these days:
>
>         1 = no  re /s       (traditional and annoying)
>         2 = use re /s       (necessary but insufficient)
>         3 = \V              (improved #1)
>         4 = \X              (improved #2)
>         5 = \X unless \R    (improved #2, #3)
>
>     See?  How often do you guys write the *wrong* one of those?
>     If you are like most of us, almost always.  And that’s a problem.
>
>     What we really need are dots that means  something OTHER than just 1
>     or 2.
>     1 and 2 are from the bad, old pre‐Unicode days.  They are not only of
>     limited utility today, they can actually be harmful, because they
>     break text!
>
>     For most of my work, “.” is simply wrong, not because of the pitiably
>     insignificant newline issue, but rather because it can destroys
>     graphemes.
>     So, in fact, does 3, although that at least stops being idiotic about
>     linebreak stuff.  Whether I want 4 or 5 depends, but I certainly nearly
>     never want 1 *or* 2 — which is all anyone is even talking about
>     right now.
>
>     Here is each of those same 1–5, sometimes now with a tiby bit of
>     elaboration. While I go through these, please be thinking not just
>     about Huffman optimization, but also about having sane defaults.
>
>       1. Any one code point except for a linefeed (with weirdness on old
>     Macs).
>          These are all currently exactly equivalent:
>
>              1a:   (?-s:.)
>             =1b:   \N
>             =1c:   [^\x{0A}]
>             =1d:   [^\N{LINE FEED}]
>
>       2. Any one code point whatsoblinkingever. These are both
>          exactly equivalent:
>
>              2a:   (?s:.)
>             =2b:   \p{Any}
>
>       3. Any one code point that lacks the Vert_Space binary property.
>          These are all (currently?) exactly equivalent:
>
>              3a.   \V
>             =3b.   \P{Vert_Space}
>             =3c.   [^\p{LB=CR}\p{LB=LF}\p{LB=NL}\p{LB=BK}]
>
>          BTW, here are the linebreak properties of the 7 \v code points:
>
>               LB=LF  Line_Break=Line_Feed         U+000A  LINE FEED (LF)
>               LB=BR  Line_Break=Mandatory_Break   U+000B  LINE TABULATION
>               LB=BR  Line_Break=Mandatory_Break   U+000C  FORM FEED (FF)
>               LB=CR  Line_Break=Carriage_Return   U+000D  CARRIAGE
>     RETURN (CR)
>               LB=NL  Line_Break=Next_Line         U+0085  NEXT LINE (NEL)
>               LB=BR  Line_Break=Mandatory_Break   U+2028  LINE SEPARATOR
>               LB=BR  Line_Break=Mandatory_Break   U+2029  PARAGRAPH
>     SEPARATOR
>
>           We’re eventually going to have to do something a huge whole lot
>           smarter with line breaks (UAX#14), word breaks (UAX#29), and
>           more, but I’m for now deferring the discussion of \b{line},
>           \b{word}, &c.
>
>       4:  Any one grapheme (=EGC), including even a single CRLF:
>
>              4a.  \X
>             =4b:  /* see "case CLUMP" in perl/regexec.c; go ahead, I
>     dare ya! */
>
>       5:  Any grapheme except for \R, which being itself (?:\x0A\x0D|\v),
>           gets rid of the CRLFs and the verticals:
>
>              5.  (?!\R)\X
>
>     Just how all this works with UAX#14 or \b{LINE} or whatever, not to
>     mention Perl5’s completely broken version of “^” and “$” (which AHEM!
>     both Java7 and Perl6 got/get/shall-get right — at least if compile stuff
>     with the right flags) I don’t know.  But please please do not think
>     about
>     addressing the Dot Problem without understanding all these issues.
>
>       { I am delaying for now an exact syntactic proposal, although I have
>        several concrete ideas about how to go about this.  Joyfully, none
>        requires dinking around with silly /modifiers.  Rather, they involve
>        certain regex‐embedded pragmas.  (I am *so* done with single‐letter
>        identifiers, hello!)  We need these for a lot more than just this,
>        too.  That’s a topic for another say.  Several proposals are pending.
>        More on that later.  Sometime.  It won’t be the /dual route, I
>     promise. }
>
>     Meanwhile, before rushing in where angels fear to tread, let’s
>     please step
>     back and evaluate the original sense of “.” (and probably also of
>     “^” and
>     “$” too).  Deduce those principles and apply them to today’s world, not
>     yesterdays. An ASCII‐only solution isn’t worth the cost tradeoff.
>     The Dot
>     Problem will never be solved until people start thinking in Unicode not
>     ASCII. Otherwise you’ll “solve” the “wrong” “problem”.
>
>     --tom
>
>
>
> So uh.
> I'm reviving this because I just found something interesting and
> somewhat tangentially related.
> (?^s:.) and \p{Any} are not equivalent!
> The dot really does mean "any character", while \p{Any} means "any
> Unicode character." Watch:
>
> for my $re ( qr/\p{Any}/, qr/./s ) {
>      my $matched = () = "a\x{FFFF_FFF}b" =~ /$re/g;
>      say "$re matched $matched times";
> }
>
> With warnings on, the \p{Any} will throw a warning about non-Unicode,
> and match twice. Meanwhile, the dot will not throw a warning, and match
> three times.
> I don't think that this is a bug, since \p{} is a _Unicode_ property,
> and those aren't Unicode code points, but it was still surprising.
>
> (I came upon this whilst reinventing a wheel at work: We needed to
> roundtrip a list to string, and then back to list, and for reasons I
> couldn't quite grasp, storing the list somewhere wasn't possible. This
> is one of the abominations that came out.)
>

This text from perlrecharclass comes to mind:

Unicode properties are defined (surprise!) only on Unicode code points.
A warning is raised and all matches fail on non-Unicode code points
(those above the legal Unicode maximum of 0x10FFFF).  This can be
somewhat surprising,

  chr(0x110000) =~ \p{ASCII_Hex_Digit=True}      # Fails.
  chr(0x110000) =~ \p{ASCII_Hex_Digit=False}     # Also fails!

Even though these two matches might be thought of as complements, they
are so only on Unicode code points.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About