develooper Front page | perl.perl5.porters | Postings from May 2021

Re: Revisiting trim

Thread Previous | Thread Next
From:
demerphq
Date:
May 29, 2021 07:37
Subject:
Re: Revisiting trim
Message ID:
CANgJU+XFMbugbQUxWRNgc+O90F2SJrBeRe-OUHMoU4r=4Txq=g@mail.gmail.com
On Fri, 28 May 2021 at 12:02, André Warnier (tomcat/perl) <aw@ice-sa.com> wrote:
> $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends

This is as correct a way to do it as you can do in perl regex. I'd
probably replace the $ with \z to be absolutely clear on my intent.  I
had to go double check the behavior of $ here, where \z is
unambiguous.

The point here is that people often write this:

$stripped_line=~/^\s+|\s+$/g;

which causes the regex engine to perform the scan through the string
in a really inefficient way. Your splitting it into two calls avoids
the main mistake that people make.

But this question also illustrates the problem here. The regex engine
doesn't know how to go backwards. Even for the split form of the regex
the *second* regex, the one that does the rtrim() functionality, is
the problem performance wise. The regex engine will do a scan of the
whole string, every time it finds a space character it will scan
forward until it find either a non-string, or the end of the string.
There is some cleverness in the engine to make this case not be
quadratic, but its not far off. The run time will be proportional to
the length of the string and number of space nonspace sequences it
contains.

So the reason to add trimmed() to the language at an optimization
level is that while its hard to teach the regex engine to go
backwards, its not hard to create a custom dfa or similar logic that
scans through the string from the right and finds the rightmost
non-space character in the string. For instance even doing a naïve
implementation of using the utf8-skip-backwards-one-character logic
would be O(N) where N is the number of characters at the end of the
string.

This performance issue with rtrim() I would argue supports your point,
adding trim() without rtrim() is to a certain extent a missed
opportunity. Stripping whitespace from the end of the string will
still be inefficient and difficult to read. Eg, consider I would call
myself a regex expert, but every time someone posts this pattern with
$ in it I have to double check the rules. Making people use an
inefficient and cryptic regex for a common task seems undesirable.
The cryptic argument applies for ltrim(), but that at least *is*
efficient in the regex engine.

cheers
Yves

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About