develooper Front page | perl.perl5.porters | Postings from December 2011

Re: Solving the *real* Dot Problem (was: Is 5.16 the time to remove\N, the complement of \n, from being experimental?)

Thread Previous | Thread Next
From:
Brian Fraser
Date:
December 23, 2011 12:23
Subject:
Re: Solving the *real* Dot Problem (was: Is 5.16 the time to remove\N, the complement of \n, from being experimental?)
Message ID:
CA+nL+na8R402FTnxW6o7AzoBaB7b1gVVLaCE5SG4DgQfDTN41g@mail.gmail.com
On Wed, Jul 6, 2011 at 10:33 PM, Tom Christiansen <tchrist@perl.com> wrote:

> THESIS:  Perl’s /./ is fundamentally broken, be it (?s:.) or (?-s:.).
>         It’s long past time we fast‐forward a few decades in how we
>         think about all this.
>
> SUMMARY: Perl needs to stop making it so easy to do the wrong thing here,
>         and instead start making it easy to do the right thing.  Let’s
>         stop wasting time/brain/etc diddling around with a 1980s‐style
>         ASCII solution in our Unicode world of the 2010s and beyond!
>
> Zsbán Ambrus <ambrus@math.bme.hu> wrote on Wed, 06 Jul 2011 20:58:57
> +0200:
>
> > On Wed, Jul 6, 2011 at 1:48 AM, Jesse Vincent <jesse@fsck.com> wrote:
>
> > [On the new \N regex escape that matches any one character except \n.]
>
> >> Is it being used? (Are folks cpanning modules that use it?)
>
> > It may get more use once perl 5.14 spreads, because there you can 'use
> > re "/s";' to make the dot have the more useful meaning and then \N has
> > the occasionally useful meaning.  Further, if 'use 5.016;' enabled
> > 'use re "/s";' by default, it would see even more use.
>
> Upgrading the status of \N from experimental to something more solid is
> a timely and necessary, but sadly insufficient step, toward solving the
> Dot Problem.  Diddling around with . and \N and such ignores the *real*
> issue: that those are ASCII thingamaboogers — but Perl needs Unicode ones.
>
>
> By the Dot Problem, I mean a regex metacharacter matching just “one” of
> “anything”, for a broad sense of anything but a narrow sense of one.
>
>  {  NB: I am not referring to a literal FULL STOP nor its 3 other NFKD or
>        \p{SB=AT} aliases, let alone the \p{SB=ST} stuff.   Use NFKD eq “.”
>        or /\p{SB=AT}/ if that’s the sort of literal dot you want.  }
>
> Here are 5 possible meanings for dot.  I start with the original and *LEAST
> USEFUL OF ALL POSSIBLE MEANINGS*, and progress to the most useful ones, the
> ones that I think people should usually be using these days:
>
>    1 = no  re /s       (traditional and annoying)
>    2 = use re /s       (necessary but insufficient)
>    3 = \V              (improved #1)
>    4 = \X              (improved #2)
>    5 = \X unless \R    (improved #2, #3)
>
> See?  How often do you guys write the *wrong* one of those?
> If you are like most of us, almost always.  And that’s a problem.
>
> What we really need are dots that means  something OTHER than just 1 or 2.
> 1 and 2 are from the bad, old pre‐Unicode days.  They are not only of
> limited utility today, they can actually be harmful, because they break
> text!
>
> For most of my work, “.” is simply wrong, not because of the pitiably
> insignificant newline issue, but rather because it can destroys graphemes.
> So, in fact, does 3, although that at least stops being idiotic about
> linebreak stuff.  Whether I want 4 or 5 depends, but I certainly nearly
> never want 1 *or* 2 — which is all anyone is even talking about right now.
>
> Here is each of those same 1–5, sometimes now with a tiby bit of
> elaboration. While I go through these, please be thinking not just
> about Huffman optimization, but also about having sane defaults.
>
>  1. Any one code point except for a linefeed (with weirdness on old Macs).
>     These are all currently exactly equivalent:
>
>         1a:   (?-s:.)
>        =1b:   \N
>        =1c:   [^\x{0A}]
>        =1d:   [^\N{LINE FEED}]
>
>  2. Any one code point whatsoblinkingever. These are both
>     exactly equivalent:
>
>         2a:   (?s:.)
>        =2b:   \p{Any}
>
>  3. Any one code point that lacks the Vert_Space binary property.
>     These are all (currently?) exactly equivalent:
>
>         3a.   \V
>        =3b.   \P{Vert_Space}
>        =3c.   [^\p{LB=CR}\p{LB=LF}\p{LB=NL}\p{LB=BK}]
>
>     BTW, here are the linebreak properties of the 7 \v code points:
>
>          LB=LF  Line_Break=Line_Feed         U+000A  LINE FEED (LF)
>          LB=BR  Line_Break=Mandatory_Break   U+000B  LINE TABULATION
>          LB=BR  Line_Break=Mandatory_Break   U+000C  FORM FEED (FF)
>          LB=CR  Line_Break=Carriage_Return   U+000D  CARRIAGE RETURN (CR)
>          LB=NL  Line_Break=Next_Line         U+0085  NEXT LINE (NEL)
>          LB=BR  Line_Break=Mandatory_Break   U+2028  LINE SEPARATOR
>          LB=BR  Line_Break=Mandatory_Break   U+2029  PARAGRAPH SEPARATOR
>
>      We’re eventually going to have to do something a huge whole lot
>      smarter with line breaks (UAX#14), word breaks (UAX#29), and
>      more, but I’m for now deferring the discussion of \b{line},
>      \b{word}, &c.
>
>  4:  Any one grapheme (=EGC), including even a single CRLF:
>
>         4a.  \X
>        =4b:  /* see "case CLUMP" in perl/regexec.c; go ahead, I dare ya! */
>
>  5:  Any grapheme except for \R, which being itself (?:\x0A\x0D|\v),
>      gets rid of the CRLFs and the verticals:
>
>         5.  (?!\R)\X
>
> Just how all this works with UAX#14 or \b{LINE} or whatever, not to
> mention Perl5’s completely broken version of “^” and “$” (which AHEM!
> both Java7 and Perl6 got/get/shall-get right — at least if compile stuff
> with the right flags) I don’t know.  But please please do not think about
> addressing the Dot Problem without understanding all these issues.
>
>  { I am delaying for now an exact syntactic proposal, although I have
>   several concrete ideas about how to go about this.  Joyfully, none
>   requires dinking around with silly /modifiers.  Rather, they involve
>   certain regex‐embedded pragmas.  (I am *so* done with single‐letter
>   identifiers, hello!)  We need these for a lot more than just this,
>   too.  That’s a topic for another say.  Several proposals are pending.
>   More on that later.  Sometime.  It won’t be the /dual route, I promise. }
>
> Meanwhile, before rushing in where angels fear to tread, let’s please step
> back and evaluate the original sense of “.” (and probably also of “^” and
> “$” too).  Deduce those principles and apply them to today’s world, not
> yesterdays. An ASCII‐only solution isn’t worth the cost tradeoff. The Dot
> Problem will never be solved until people start thinking in Unicode not
> ASCII. Otherwise you’ll “solve” the “wrong” “problem”.
>
> --tom
>


So uh.
I'm reviving this because I just found something interesting and somewhat
tangentially related.
(?^s:.) and \p{Any} are not equivalent!
The dot really does mean "any character", while \p{Any} means "any Unicode
character." Watch:

for my $re ( qr/\p{Any}/, qr/./s ) {
    my $matched = () = "a\x{FFFF_FFF}b" =~ /$re/g;
    say "$re matched $matched times";
}

With warnings on, the \p{Any} will throw a warning about non-Unicode, and
match twice. Meanwhile, the dot will not throw a warning, and match three
times.
I don't think that this is a bug, since \p{} is a _Unicode_ property, and
those aren't Unicode code points, but it was still surprising.

(I came upon this whilst reinventing a wheel at work: We needed to
roundtrip a list to string, and then back to list, and for reasons I
couldn't quite grasp, storing the list somewhere wasn't possible. This is
one of the abominations that came out.)

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About