develooper Front page | perl.perl5.porters | Postings from July 2011

Solving the *real* Dot Problem (was: Is 5.16 the time to remove \N, the complement of \n, from being experimental?)

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
July 6, 2011 18:33
Subject:
Solving the *real* Dot Problem (was: Is 5.16 the time to remove \N, the complement of \n, from being experimental?)
Message ID:
26464.1310002397@chthon
THESIS:  Perl’s /./ is fundamentally broken, be it (?s:.) or (?-s:.).
         It’s long past time we fast‐forward a few decades in how we 
         think about all this.

SUMMARY: Perl needs to stop making it so easy to do the wrong thing here,
         and instead start making it easy to do the right thing.  Let’s 
         stop wasting time/brain/etc diddling around with a 1980s‐style
         ASCII solution in our Unicode world of the 2010s and beyond!

Zsbán Ambrus <ambrus@math.bme.hu> wrote on Wed, 06 Jul 2011 20:58:57 +0200: 

> On Wed, Jul 6, 2011 at 1:48 AM, Jesse Vincent <jesse@fsck.com> wrote:

> [On the new \N regex escape that matches any one character except \n.]

>> Is it being used? (Are folks cpanning modules that use it?)

> It may get more use once perl 5.14 spreads, because there you can 'use
> re "/s";' to make the dot have the more useful meaning and then \N has
> the occasionally useful meaning.  Further, if 'use 5.016;' enabled
> 'use re "/s";' by default, it would see even more use.

Upgrading the status of \N from experimental to something more solid is
a timely and necessary, but sadly insufficient step, toward solving the
Dot Problem.  Diddling around with . and \N and such ignores the *real*
issue: that those are ASCII thingamaboogers — but Perl needs Unicode ones.


By the Dot Problem, I mean a regex metacharacter matching just “one” of 
“anything”, for a broad sense of anything but a narrow sense of one.

 {  NB: I am not referring to a literal FULL STOP nor its 3 other NFKD or
        \p{SB=AT} aliases, let alone the \p{SB=ST} stuff.   Use NFKD eq “.”
        or /\p{SB=AT}/ if that’s the sort of literal dot you want.  }

Here are 5 possible meanings for dot.  I start with the original and *LEAST
USEFUL OF ALL POSSIBLE MEANINGS*, and progress to the most useful ones, the
ones that I think people should usually be using these days:

    1 = no  re /s       (traditional and annoying)
    2 = use re /s       (necessary but insufficient)
    3 = \V              (improved #1)
    4 = \X              (improved #2)
    5 = \X unless \R    (improved #2, #3)

See?  How often do you guys write the *wrong* one of those?  
If you are like most of us, almost always.  And that’s a problem.

What we really need are dots that means  something OTHER than just 1 or 2. 
1 and 2 are from the bad, old pre‐Unicode days.  They are not only of 
limited utility today, they can actually be harmful, because they break text!

For most of my work, “.” is simply wrong, not because of the pitiably
insignificant newline issue, but rather because it can destroys graphemes.
So, in fact, does 3, although that at least stops being idiotic about
linebreak stuff.  Whether I want 4 or 5 depends, but I certainly nearly 
never want 1 *or* 2 — which is all anyone is even talking about right now.

Here is each of those same 1–5, sometimes now with a tiby bit of
elaboration. While I go through these, please be thinking not just 
about Huffman optimization, but also about having sane defaults.

  1. Any one code point except for a linefeed (with weirdness on old Macs).
     These are all currently exactly equivalent:

         1a:   (?-s:.)             
        =1b:   \N
        =1c:   [^\x{0A}] 
        =1d:   [^\N{LINE FEED}] 

  2. Any one code point whatsoblinkingever. These are both 
     exactly equivalent:

         2a:   (?s:.)
        =2b:   \p{Any}

  3. Any one code point that lacks the Vert_Space binary property.
     These are all (currently?) exactly equivalent:

         3a.   \V
        =3b.   \P{Vert_Space}
        =3c.   [^\p{LB=CR}\p{LB=LF}\p{LB=NL}\p{LB=BK}]

     BTW, here are the linebreak properties of the 7 \v code points:

          LB=LF  Line_Break=Line_Feed         U+000A  LINE FEED (LF)
          LB=BR  Line_Break=Mandatory_Break   U+000B  LINE TABULATION
          LB=BR  Line_Break=Mandatory_Break   U+000C  FORM FEED (FF)
          LB=CR  Line_Break=Carriage_Return   U+000D  CARRIAGE RETURN (CR)
          LB=NL  Line_Break=Next_Line         U+0085  NEXT LINE (NEL)
          LB=BR  Line_Break=Mandatory_Break   U+2028  LINE SEPARATOR
          LB=BR  Line_Break=Mandatory_Break   U+2029  PARAGRAPH SEPARATOR

      We’re eventually going to have to do something a huge whole lot
      smarter with line breaks (UAX#14), word breaks (UAX#29), and
      more, but I’m for now deferring the discussion of \b{line},
      \b{word}, &c.

  4:  Any one grapheme (=EGC), including even a single CRLF:

         4a.  \X
        =4b:  /* see "case CLUMP" in perl/regexec.c; go ahead, I dare ya! */

  5:  Any grapheme except for \R, which being itself (?:\x0A\x0D|\v),
      gets rid of the CRLFs and the verticals:

         5.  (?!\R)\X

Just how all this works with UAX#14 or \b{LINE} or whatever, not to
mention Perl5’s completely broken version of “^” and “$” (which AHEM! 
both Java7 and Perl6 got/get/shall-get right — at least if compile stuff
with the right flags) I don’t know.  But please please do not think about
addressing the Dot Problem without understanding all these issues.

 { I am delaying for now an exact syntactic proposal, although I have
   several concrete ideas about how to go about this.  Joyfully, none
   requires dinking around with silly /modifiers.  Rather, they involve
   certain regex‐embedded pragmas.  (I am *so* done with single‐letter
   identifiers, hello!)  We need these for a lot more than just this,
   too.  That’s a topic for another say.  Several proposals are pending.
   More on that later.  Sometime.  It won’t be the /dual route, I promise. }

Meanwhile, before rushing in where angels fear to tread, let’s please step
back and evaluate the original sense of “.” (and probably also of “^” and
“$” too).  Deduce those principles and apply them to today’s world, not
yesterdays. An ASCII‐only solution isn’t worth the cost tradeoff. The Dot
Problem will never be solved until people start thinking in Unicode not
ASCII. Otherwise you’ll “solve” the “wrong” “problem”.

--tom

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About