develooper Front page | perl.perl5.porters | Postings from May 2021

Re: Revisiting trim

Thread Previous | Thread Next
Karl Williamson
May 30, 2021 21:21
Re: Revisiting trim
Message ID:
On 5/29/21 1:37 AM, demerphq wrote:
> On Fri, 28 May 2021 at 12:02, André Warnier (tomcat/perl) <> wrote:
>> $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends
> This is as correct a way to do it as you can do in perl regex. I'd
> probably replace the $ with \z to be absolutely clear on my intent.  I
> had to go double check the behavior of $ here, where \z is
> unambiguous.
> The point here is that people often write this:
> $stripped_line=~/^\s+|\s+$/g;
> which causes the regex engine to perform the scan through the string
> in a really inefficient way. Your splitting it into two calls avoids
> the main mistake that people make.
> But this question also illustrates the problem here. The regex engine
> doesn't know how to go backwards. Even for the split form of the regex
> the *second* regex, the one that does the rtrim() functionality, is
> the problem performance wise. The regex engine will do a scan of the
> whole string, every time it finds a space character it will scan
> forward until it find either a non-string, or the end of the string.
> There is some cleverness in the engine to make this case not be
> quadratic, but its not far off. The run time will be proportional to
> the length of the string and number of space nonspace sequences it
> contains.
> So the reason to add trimmed() to the language at an optimization
> level is that while its hard to teach the regex engine to go
> backwards, its not hard to create a custom dfa or similar logic that
> scans through the string from the right and finds the rightmost
> non-space character in the string. For instance even doing a naïve
> implementation of using the utf8-skip-backwards-one-character logic
> would be O(N) where N is the number of characters at the end of the
> string.
> This performance issue with rtrim() I would argue supports your point,
> adding trim() without rtrim() is to a certain extent a missed
> opportunity. Stripping whitespace from the end of the string will
> still be inefficient and difficult to read. Eg, consider I would call
> myself a regex expert, but every time someone posts this pattern with
> $ in it I have to double check the rules. Making people use an
> inefficient and cryptic regex for a common task seems undesirable.
> The cryptic argument applies for ltrim(), but that at least *is*
> efficient in the regex engine.

Maybe you and I should have a chat about what can and should be done to 
improve the matching speed of right-anchored patterns.

I suppose it is theoretically possible to create reverse 
Perl_re_intuit_start() and S_find_byclass() functions, if one could wrap 
one's mind around that, though the libc support is limited.  But I could 
be wrong about the feasibility and it would be more work than anyone 
would care to undertake.

But there are things that could be done.  It had never occurred to me 
before that the hop_back functions could be called with large numbers. 
Backing up in a UTF-8 string could be improved by a factor of 8 by doing 
per-word operations.  (You load a whole word.  One can isolate and count 
the continuation bytes in it by some shifting/masking/ etc operations. 
Everything that isn't a continuation byte marks a character.) 
Similarly, functions like S_find_next_masked() could have a 
corresponding reversed version, though slower on UTF-8 than the forward 
because of the forward bias of UTF-8.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About