Front page | perl.perl5.porters |
Postings from January 2018
From: Karl Williamson
January 20, 2018 18:53
Message ID: firstname.lastname@example.org
On 01/18/2018 08:52 AM, Abigail wrote:
> On Sun, Dec 31, 2017 at 04:21:04PM -0700, Karl Williamson wrote:
>> On 12/31/2017 4:13 PM, Andy Lester wrote:
>>>> My question is, does this happen often enough in real life to justify the extra code?
>>> I'd bet that the case of the object of the + being a single character is the most common use of +, starting with things like s/ +$//;
>>> Andy Lester => www.petdance.com
>> Yes, but the issue is in the strings it matches, are there a lot of
>> spaces in a row? Only in that case is there a noticeable speed up
> I think multiple spaces in a row is the most common in practise.
> However, I also think most people would not write / +/, but /\s+/
> to match a sequence of spaces.
> Some questions:
> - What is the effect of this proposed patch if /a+/ is used, and it
> matches just one or two characters (or even fails to match?).
> - Can it still partially match a string of 'a'? "aaaaa" =~ /a+(?=..)/
> should match just three a's.
> - Is there any effect on modifiers? /a+?/, /a++/
The patch just adds vectorization to the current operation, so it should
have zero effect, except for performance, on anything.
Effectively what is happening is that we are looking for the end of a
span of a given repeated byte.
There is extra code executed, which is more than balanced by the
performance gain, should the data be such that there are enough bytes to
be spanned in a row. But these instructions are wasted on data that has
too few such bytes. Which is why I asked the question.
You can see the existing code here:
Note that this is now a static function, instead of an inline loop.
There is the risk that the compiler will not inline it, so that there is
added function call overhead. I think that modern compilers will DTRT.
We don't know ahead of time where the span will end, but we do know the
maximum potential span. We shouldn't bother with trying this code
unless the potential gain is worth it. I have set an arbitrary value
for now for what that is.
Otherwise the single loop currently used becomes potentially 3 loops.
The first loop advances to the next word boundary if not already there,
returning from the function if the match fails. So if the span is too
short to reach the next word boundary, the only extra code executed is
the initial test that the potential gain is significant enough.
Otherwise, there is an integer multiply to set up a word-full of copies
of the spanned byte.
Then the second loop does the word-at-a-time parallelism. This
continues until we reach the end of input, or more likely, the span is
Finally is the third loop to handle bytes after the final word boundary.
Even if there are none, there is an extra conditional required to
discover that fact.
The final loop is also used to determine which byte terminated the span
when the 2nd loop drops out after finding such a byte. As I write this,
I realize that there is a way for the 2nd loop to figure out which byte
it is without conditionals. The fastest way is to use xor and ffs(), if
available (which on modern computers becomes a single machine
instruction), but there are other ways, using shift, mask and integer
multiply, similar to what S__variant_byte_number() in inline.h currently
So, there's extra work involved when the span is short, but just a few