Front page | perl.perl5.porters |
Postings from December 2011
Re: Solving the *real* Dot Problem (was: Is 5.16 the time to remove\N, the complement of \n, from being experimental?)
Thread Previous
|
Thread Next
From:
Brian Fraser
Date:
December 23, 2011 12:23
Subject:
Re: Solving the *real* Dot Problem (was: Is 5.16 the time to remove\N, the complement of \n, from being experimental?)
Message ID:
CA+nL+na8R402FTnxW6o7AzoBaB7b1gVVLaCE5SG4DgQfDTN41g@mail.gmail.com
On Wed, Jul 6, 2011 at 10:33 PM, Tom Christiansen <tchrist@perl.com> wrote:
> THESIS: Perl’s /./ is fundamentally broken, be it (?s:.) or (?-s:.).
> It’s long past time we fast‐forward a few decades in how we
> think about all this.
>
> SUMMARY: Perl needs to stop making it so easy to do the wrong thing here,
> and instead start making it easy to do the right thing. Let’s
> stop wasting time/brain/etc diddling around with a 1980s‐style
> ASCII solution in our Unicode world of the 2010s and beyond!
>
> Zsbán Ambrus <ambrus@math.bme.hu> wrote on Wed, 06 Jul 2011 20:58:57
> +0200:
>
> > On Wed, Jul 6, 2011 at 1:48 AM, Jesse Vincent <jesse@fsck.com> wrote:
>
> > [On the new \N regex escape that matches any one character except \n.]
>
> >> Is it being used? (Are folks cpanning modules that use it?)
>
> > It may get more use once perl 5.14 spreads, because there you can 'use
> > re "/s";' to make the dot have the more useful meaning and then \N has
> > the occasionally useful meaning. Further, if 'use 5.016;' enabled
> > 'use re "/s";' by default, it would see even more use.
>
> Upgrading the status of \N from experimental to something more solid is
> a timely and necessary, but sadly insufficient step, toward solving the
> Dot Problem. Diddling around with . and \N and such ignores the *real*
> issue: that those are ASCII thingamaboogers — but Perl needs Unicode ones.
>
>
> By the Dot Problem, I mean a regex metacharacter matching just “one” of
> “anything”, for a broad sense of anything but a narrow sense of one.
>
> { NB: I am not referring to a literal FULL STOP nor its 3 other NFKD or
> \p{SB=AT} aliases, let alone the \p{SB=ST} stuff. Use NFKD eq “.”
> or /\p{SB=AT}/ if that’s the sort of literal dot you want. }
>
> Here are 5 possible meanings for dot. I start with the original and *LEAST
> USEFUL OF ALL POSSIBLE MEANINGS*, and progress to the most useful ones, the
> ones that I think people should usually be using these days:
>
> 1 = no re /s (traditional and annoying)
> 2 = use re /s (necessary but insufficient)
> 3 = \V (improved #1)
> 4 = \X (improved #2)
> 5 = \X unless \R (improved #2, #3)
>
> See? How often do you guys write the *wrong* one of those?
> If you are like most of us, almost always. And that’s a problem.
>
> What we really need are dots that means something OTHER than just 1 or 2.
> 1 and 2 are from the bad, old pre‐Unicode days. They are not only of
> limited utility today, they can actually be harmful, because they break
> text!
>
> For most of my work, “.” is simply wrong, not because of the pitiably
> insignificant newline issue, but rather because it can destroys graphemes.
> So, in fact, does 3, although that at least stops being idiotic about
> linebreak stuff. Whether I want 4 or 5 depends, but I certainly nearly
> never want 1 *or* 2 — which is all anyone is even talking about right now.
>
> Here is each of those same 1–5, sometimes now with a tiby bit of
> elaboration. While I go through these, please be thinking not just
> about Huffman optimization, but also about having sane defaults.
>
> 1. Any one code point except for a linefeed (with weirdness on old Macs).
> These are all currently exactly equivalent:
>
> 1a: (?-s:.)
> =1b: \N
> =1c: [^\x{0A}]
> =1d: [^\N{LINE FEED}]
>
> 2. Any one code point whatsoblinkingever. These are both
> exactly equivalent:
>
> 2a: (?s:.)
> =2b: \p{Any}
>
> 3. Any one code point that lacks the Vert_Space binary property.
> These are all (currently?) exactly equivalent:
>
> 3a. \V
> =3b. \P{Vert_Space}
> =3c. [^\p{LB=CR}\p{LB=LF}\p{LB=NL}\p{LB=BK}]
>
> BTW, here are the linebreak properties of the 7 \v code points:
>
> LB=LF Line_Break=Line_Feed U+000A LINE FEED (LF)
> LB=BR Line_Break=Mandatory_Break U+000B LINE TABULATION
> LB=BR Line_Break=Mandatory_Break U+000C FORM FEED (FF)
> LB=CR Line_Break=Carriage_Return U+000D CARRIAGE RETURN (CR)
> LB=NL Line_Break=Next_Line U+0085 NEXT LINE (NEL)
> LB=BR Line_Break=Mandatory_Break U+2028 LINE SEPARATOR
> LB=BR Line_Break=Mandatory_Break U+2029 PARAGRAPH SEPARATOR
>
> We’re eventually going to have to do something a huge whole lot
> smarter with line breaks (UAX#14), word breaks (UAX#29), and
> more, but I’m for now deferring the discussion of \b{line},
> \b{word}, &c.
>
> 4: Any one grapheme (=EGC), including even a single CRLF:
>
> 4a. \X
> =4b: /* see "case CLUMP" in perl/regexec.c; go ahead, I dare ya! */
>
> 5: Any grapheme except for \R, which being itself (?:\x0A\x0D|\v),
> gets rid of the CRLFs and the verticals:
>
> 5. (?!\R)\X
>
> Just how all this works with UAX#14 or \b{LINE} or whatever, not to
> mention Perl5’s completely broken version of “^” and “$” (which AHEM!
> both Java7 and Perl6 got/get/shall-get right — at least if compile stuff
> with the right flags) I don’t know. But please please do not think about
> addressing the Dot Problem without understanding all these issues.
>
> { I am delaying for now an exact syntactic proposal, although I have
> several concrete ideas about how to go about this. Joyfully, none
> requires dinking around with silly /modifiers. Rather, they involve
> certain regex‐embedded pragmas. (I am *so* done with single‐letter
> identifiers, hello!) We need these for a lot more than just this,
> too. That’s a topic for another say. Several proposals are pending.
> More on that later. Sometime. It won’t be the /dual route, I promise. }
>
> Meanwhile, before rushing in where angels fear to tread, let’s please step
> back and evaluate the original sense of “.” (and probably also of “^” and
> “$” too). Deduce those principles and apply them to today’s world, not
> yesterdays. An ASCII‐only solution isn’t worth the cost tradeoff. The Dot
> Problem will never be solved until people start thinking in Unicode not
> ASCII. Otherwise you’ll “solve” the “wrong” “problem”.
>
> --tom
>
So uh.
I'm reviving this because I just found something interesting and somewhat
tangentially related.
(?^s:.) and \p{Any} are not equivalent!
The dot really does mean "any character", while \p{Any} means "any Unicode
character." Watch:
for my $re ( qr/\p{Any}/, qr/./s ) {
my $matched = () = "a\x{FFFF_FFF}b" =~ /$re/g;
say "$re matched $matched times";
}
With warnings on, the \p{Any} will throw a warning about non-Unicode, and
match twice. Meanwhile, the dot will not throw a warning, and match three
times.
I don't think that this is a bug, since \p{} is a _Unicode_ property, and
those aren't Unicode code points, but it was still surprising.
(I came upon this whilst reinventing a wheel at work: We needed to
roundtrip a list to string, and then back to list, and for reasons I
couldn't quite grasp, storing the list somewhere wasn't possible. This is
one of the abominations that came out.)
Thread Previous
|
Thread Next