develooper Front page | perl.perl5.porters | Postings from August 2008

Re: [PATCH] Add open "|-" and open "-|" to perlopentut

From:
Abigail
Date:
August 26, 2008 16:39
Subject:
Re: [PATCH] Add open "|-" and open "-|" to perlopentut
Message ID:
20080826233909.GB19684@almanda

[ Reply-To set, this discussion is becoming off-topic for p5p ]

On Tue, Aug 26, 2008 at 04:20:48PM -0600, Tom Christiansen wrote:
> 
> [...]
>
> And its output:
> 
>     "a" =~ /\ba\b/ == 1
>     "a" =~ /\Ba\B/ == 0
> 
>     "=" =~ /\b=\b/ == 0
>     "=" =~ /\B=\B/ == 1
> 
>     "b" =~ /\bb\b/ == 1
>     "b" =~ /\Bb\B/ == 0
> 
>     "&" =~ /\b&\b/ == 0
>     "&" =~ /\B&\B/ == 1
> 
>     "c" =~ /\bc\b/ == 1
>     "c" =~ /\Bc\B/ == 0

Oh, but it goes much deeper than that.

My recommendation nowadays is to think really, really hard before using
\w, \W, \d, \D, \s, \S, \b, and \B, and that you almost always don't want
to use them. What characters they match depends on whether the source
string is in UTF-8 format or not. And if not, whether the pattern is in
UTF-8 format. But not always. And if neither is in UTF-8 format, it depends
on whether you are using a locale, and what that locale says. 

Here we match "ê"  (\x{EA}, \N{LATIN SMALL LETTER E WITH CIRCUMFLEX}) 
eight times against \w. Try to guess which ones match, and which ones
don't. (\x{263A} is "WHITE SMILING FACE").


    $a = $b = $c = "ê";
    say $a =~ /\w/               ? "Yes 1" : "No 1";
    utf8::upgrade $a;
    say $a =~ /\w/               ? "Yes 2" : "No 2";
    say $b =~ /\w|A/             ? "Yes 3" : "No 3";
    say $b =~ /\w|\x{263A}/      ? "Yes 4" : "No 4";
    say $c =~ /[\wA]/            ? "Yes 5" : "No 5";
    say $c =~ /[\w\x{263A}]/     ? "Yes 6" : "No 6";
    $d = "ê\x{263A}"; chop $d;
    say $d =~ /\w/               ? "Yes 7" : "No 7";
    {
        use locale;
        say $b =~ /\w/           ? "Yes 8" : "No 8";
    }

    __END__

You're much better of with using [a-zA-Z0-9_], [\p{LC}\p{Nd}_], or
something else that is explicite, and independent of UTF-8 flags and
locales than with using \w. Otherwise, it'll bite you.

> [...]
>
> Worse still, it sometimes works for reasons far from what they appear,
> as in this very non-parallel pair:
> 
>     % perl -WE 'say $& if "[" =~ /[[]/'
>     [
> 
>     % perl -WE 'say $& if "]" =~ /[]]/'
>     ]

Ah, yes. When I explain character classes, at one moment, I usually show

    /[][]/

which isn't two empty character classes, but a character class matching
either ] or [.


> I'm not saying not to use his advice here; Damian's written enough regex
> code that he must have found it helpful for him.  I'm just saying to know
> when the trick works and when (and how) it fails.  On the other hand,
> backslashing instead always works.  But I wholly agree that it's terribly
> ugly and risks confusion.  That's why we have all these pick-your-own-quote
> constructs in q, qq, qx, qr, s, m, tr, and y, plus any others you've added
> while I wasn't looking. :-)

Backslashing always works, but I still prefer using [] in many cases.
I do write a lot of regexp code as well, and I tend to build my regexes
out of small building blocks. But if the building blocks consist of
strings, you need to double the slashes. Although, if you use '', it
doesn't matter.

    warn $& if "?" =~ /\?/;   # Match
    warn $& if "?" =~ "\?";   # Regexp syntax error.
    warn $& if "?" =~ "\\?";  # Match.
    warn $& if "?" =~ '\\?';  # Match.
    warn $& if "?" =~ '\?';   # Match (!).

OTOH, if you write [?], it doesn't matter what the quotes are:

    warn $& if "?" =~ /[?]/;  # Match.
    warn $& if "?" =~ "[?]";  # Match.
    warn $& if "?" =~ '[?]';  # Match.

So, I prefer [] instead of backslashing it; then I don't have to check
whether it's a qr, qq or q construct - it always acts the same. 

Granted, it won't work for ^, but I find I seldomly have to match a ^
(matching parens, stars or plusses seems to be far more common). And
matching \ remains problematic. I've recently written code that looks
like:

    my $open = "!(?!\\\\)";  # Matches a ! not followed by a \

Using [] doesn't reduce the number of backslashes needed. qr would reduce
the number of slashes, but building your regexp from qr constructs instead
of qq? constructs make for a much longer (and slower) regexp.

And due to Yves work [?] is about as fast as \? in 5.10.



Abigail
^L
"ê" matches \w in cases 2, 6, and 7, and doesn't match in 1, 3, 4, and 5.
For case 8, it will depend on the setting of LC_CTYPE whether it will match.



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About