develooper Front page | perl.perl5.porters | Postings from March 2016

Re: [perl #127670] New type of 'word boundary' - true when not inthe middle of a word

Thread Previous | Thread Next
From:
demerphq
Date:
March 8, 2016 09:52
Subject:
Re: [perl #127670] New type of 'word boundary' - true when not inthe middle of a word
Message ID:
CANgJU+XY0hszoxCu3aiFT4MP+JNGBf2CrbiCTUKT3_YSDnEUdw@mail.gmail.com
On 7 March 2016 at 19:16, Ed Avis <perlbug-followup@perl.org> wrote:
> # New Ticket Created by  "Ed Avis"
> # Please include the string:  [perl #127670]
> # in the subject line of all future correspondence about this issue.
> # <URL: https://rt.perl.org/Ticket/Display.html?id=127670 >
>
>
>
> This is a bug report for perl from eda@waniasset.com,
> generated with the help of perlbug 1.40 running under perl 5.22.1.
>
>
> -----------------------------------------------------------------
> [Please describe your issue here]
>
> When doing a search-and-replace you may wrap the regular expression in
> \b anchors to stop it matching in the middle of a word.  s/red/green/g
> will change credit to cgreenit but s/\bred\b/green/g does not have
> this bug.
>
> However, you may not know ahead of time whether your source regexp is
> itself a word.  If you unconditionally wrap it in \b anchors then that
> in turn will break if the start or end is not a word character.
>
>     /\b x[(][)] \b/x     # will fail to match 'x()-1' or 'x()'
>
> What you need to do instead is something like
>
>     say 'please enter source and replacement strings:';
>     chomp (my $source = <>);
>     chomp (my $replacement = <>);
>     while (<>) {
>         s/(?:\\b|(?!\\w))\Q$source\E(?:\\b|(?<!\\w))/$replacement/g
>           && print "replaced: $_";
>     }
>
> These (?:\\b|(?!\\w)) and (?:\\b|(?<!\\w)) incantations are useful
> enough that they deserve their own anchor.

I don't know about that. It is not clear to me that that *is* actually
so useful or commonplace, or a complete solution to the underlying
problem that it is worthy for taking a escape.

On the other hand a better solution for this would be useful.

> Rather than matching only
> at a word boundary, it would match only at a point that is not in the
> middle of a word.  That could be a word boundary or it could just be
> some point in between two non-word characters.  In other words the
> new anchor matches
>
>   - at start of string
>   - at end of string
>   - when either or both of the surrounding characters are \W
>
> (Subjective experience: this has come up a couple of times, and the
> 'solution' of wrapping a regexp in \b anchors is obvious and only
> subtly wrong, so I do think this would help avoid a common regular
> expression bug, and falls under "easy things should be easy".)
>
> FWIW, the different definition \b{wb} works means that it does not
> suffer from this problem.  Normally you can wrap an arbitrary regexp
> in \b{wb} anchors and it will match only when not partway through
> a word at start or end.  So this might argue for steering users
> towards \b{wb} instead of \b.

That is interesting. I would like to hear more opinions from people on
how important they think this is.

Yves

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About