Front page | perl.perl5.porters |
Postings from May 2021
Re: Revisiting trim
From: Karl Williamson
May 30, 2021 21:21
Re: Revisiting trim
Message ID: email@example.com
On 5/29/21 1:37 AM, demerphq wrote:
> On Fri, 28 May 2021 at 12:02, André Warnier (tomcat/perl) <firstname.lastname@example.org> wrote:
>> $stripped_line =~ s/^\s+//; $stripped_line =~ /\s+$//; # or only one of those, depends
> This is as correct a way to do it as you can do in perl regex. I'd
> probably replace the $ with \z to be absolutely clear on my intent. I
> had to go double check the behavior of $ here, where \z is
> The point here is that people often write this:
> which causes the regex engine to perform the scan through the string
> in a really inefficient way. Your splitting it into two calls avoids
> the main mistake that people make.
> But this question also illustrates the problem here. The regex engine
> doesn't know how to go backwards. Even for the split form of the regex
> the *second* regex, the one that does the rtrim() functionality, is
> the problem performance wise. The regex engine will do a scan of the
> whole string, every time it finds a space character it will scan
> forward until it find either a non-string, or the end of the string.
> There is some cleverness in the engine to make this case not be
> quadratic, but its not far off. The run time will be proportional to
> the length of the string and number of space nonspace sequences it
> So the reason to add trimmed() to the language at an optimization
> level is that while its hard to teach the regex engine to go
> backwards, its not hard to create a custom dfa or similar logic that
> scans through the string from the right and finds the rightmost
> non-space character in the string. For instance even doing a naïve
> implementation of using the utf8-skip-backwards-one-character logic
> would be O(N) where N is the number of characters at the end of the
> This performance issue with rtrim() I would argue supports your point,
> adding trim() without rtrim() is to a certain extent a missed
> opportunity. Stripping whitespace from the end of the string will
> still be inefficient and difficult to read. Eg, consider I would call
> myself a regex expert, but every time someone posts this pattern with
> $ in it I have to double check the rules. Making people use an
> inefficient and cryptic regex for a common task seems undesirable.
> The cryptic argument applies for ltrim(), but that at least *is*
> efficient in the regex engine.
Maybe you and I should have a chat about what can and should be done to
improve the matching speed of right-anchored patterns.
I suppose it is theoretically possible to create reverse
Perl_re_intuit_start() and S_find_byclass() functions, if one could wrap
one's mind around that, though the libc support is limited. But I could
be wrong about the feasibility and it would be more work than anyone
would care to undertake.
But there are things that could be done. It had never occurred to me
before that the hop_back functions could be called with large numbers.
Backing up in a UTF-8 string could be improved by a factor of 8 by doing
per-word operations. (You load a whole word. One can isolate and count
the continuation bytes in it by some shifting/masking/ etc operations.
Everything that isn't a continuation byte marks a character.)
Similarly, functions like S_find_next_masked() could have a
corresponding reversed version, though slower on UTF-8 than the forward
because of the forward bias of UTF-8.