Front page | perl.perl5.porters |
Postings from September 2019
Re: RFC what about long regex EXACT nodes
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
September 29, 2019 17:48
Subject:
Re: RFC what about long regex EXACT nodes
Message ID:
67479ac8-414d-c89d-47ff-0444899ded70@khwilliamson.com
On 9/12/19 6:08 PM, Karl Williamson wrote:
> On 9/11/19 3:25 AM, demerphq wrote:
>>
>>
>> On Wed, 11 Sep 2019, 07:20 Karl Williamson, <public@khwilliamson.com
>> <mailto:public@khwilliamson.com>> wrote:
>>
>> Currently if there is a long string of text that is to be matched
>> exactly (or under /i) that data is chunked into pieces of at most 256
>> bytes. The reason for this limit is that there happen to be 8 bits
>> available.
>>
>> But why not have a new node type for longer strings which wasn't
>> limited
>> to 256 bytes. Is there a reason we haven't done this other than
>> lack of
>> tuits?
>>
>> The advantages of such a node are less overhead when matching, as you
>> can just keep going longer in the matching loop, and your memcmp for
>> exact matches will be a single one rather than multiple.
>>
>> I don't know if the optimizer currently strings such nodes together
>> when
>> computing the min and maximum lengths for strings to be able to
>> match.
>> It may be that it stops at 256. If so this would improve the
>> ability to
>> avoid matching trivially if the criteria weren't met.
>>
>> So is there a reason not to have this?
>>
>>
>> I think this sounds reasonable.
>>
>> Yves
>>
>
> As an example of uses of this, a biologist friend gave me this info:
>
> "For me, the longest string I would ever use would be a whole
> chromosome. In humans that ranges from 50 million -300 million letters
> long (DNA bases). In some other organisms chromosomes can be billions of
> bases long, I think the longest maybe 5 billion, estimated, but so far
> the technology isn't there to accurately sequence and assemble those
> super well.... If you want to put a whole genome into a single string,
> that pushes the max to 130 billion bases, for Paris japonica, if you
> combine all 40 chromosomes into one string...
> Anyway, for me, I would say that 1 billion bases would be a good max
> size, for most organisms, though a few go to 5-6, that I know of."
>
> So we have a target string up to 130 billion bytes (way above rhe U32
> capacity of 4.3 billion), but more likely the maximum is less than a
> billion.
>
> And they chunk the matching due to memory, etc limitations, but they
> could have a pattern of length 100K, but usually just a few thousand.
>
Now done by these two commits
commit 3363c7035ff1df0c3ffeae0cd18bb86cc39d62e4
Author: Karl Williamson <khw@cpan.org>
Date: Thu Sep 26 21:38:46 2019 -0600
regex: Create and handle LEXACT nodes
See the previous commit for info on these.
I am not changing trie code to recognize these at this time.
commit ae06e581c6e9944620eed4980fe89a3749886ed0
Author: Karl Williamson <khw@cpan.org>
Date: Wed Sep 25 10:12:32 2019 -0600
Add regnode LEXACT, for long strings
This commit adds a new regnode for strings that don't fit in a regular
one, and adds a structure for that regnode to use. Actually using
them
is deferred to the next commit.
This new regnode structure is needed because the previous
structure only
allows for an 8 bit length field, 255 max bytes. This commit puts the
length instead in a new field, the same place single-argument regnodes
put their argument. Hence this long string is an extra 32 bits of
overhead, but at no string length is this node ever bigger than the
combination of the smaller nodes it replaces.
I also considered simply combining the original 8 bit length field
(which is now unused) with the first byte of the string field to get a
16 bit length, and have the actual string be offset by 1. But I
rejected that because it would mean the string would usually not be
aligned, slowing down memory accesses.
This new LEXACT regnode can hold up to what 1024 regular EXACT
ones hold,
using 4K fewer overhead bytes to do so. That means it can handle
strings containing 262000 bytes. The comments give ideas for
expanding
that should it become necessary or desirable.
Besides the space advantage, any hardware acceleration in memcmp
can be done in much bigger chunks, and otherwise the memcmp inner loop
(often written in assembly) will run many more times in a row, and our
outer loop that calls it, correspondingly fewer.
Thread Previous
|
Thread Next