develooper Front page | perl.perl5.porters | Postings from September 2019

Re: RFC what about long regex EXACT nodes

Thread Previous
From:
Sawyer X
Date:
September 29, 2019 20:22
Subject:
Re: RFC what about long regex EXACT nodes
Message ID:
e672fb31-23ae-4f99-c67a-4e4f3a79abcf@gmail.com
This is impressive, Karl!


Do you think there's a benchmark we can run to see the benefits this
could yield?


On 9/29/19 8:48 PM, Karl Williamson wrote:
> On 9/12/19 6:08 PM, Karl Williamson wrote:
>> On 9/11/19 3:25 AM, demerphq wrote:
>>>
>>>
>>> On Wed, 11 Sep 2019, 07:20 Karl Williamson, <public@khwilliamson.com
>>> <mailto:public@khwilliamson.com>> wrote:
>>>
>>>     Currently if there is a long string of text that is to be matched
>>>     exactly (or under /i) that data is chunked into pieces of at
>>> most 256
>>>     bytes.  The reason for this limit is that there happen to be 8 bits
>>>     available.
>>>
>>>     But why not have a new node type for longer strings which wasn't
>>>     limited
>>>     to 256 bytes.  Is there a reason we haven't done this other than
>>>     lack of
>>>     tuits?
>>>
>>>     The advantages of such a node are less overhead when matching,
>>> as you
>>>     can just keep going longer in the matching loop, and your memcmp
>>> for
>>>     exact matches will be a single one rather than multiple.
>>>
>>>     I don't know if the optimizer currently strings such nodes together
>>>     when
>>>     computing the min and maximum lengths for strings to be able to
>>> match.
>>>     It may be that it stops at 256.  If so this would improve the
>>>     ability to
>>>     avoid matching trivially if the criteria weren't met.
>>>
>>>     So is there a reason not to have this?
>>>
>>>
>>> I think this sounds reasonable.
>>>
>>> Yves
>>>
>>
>> As an example of uses of this, a biologist friend gave me this info:
>>
>> "For me, the longest string I would ever use would be a whole
>> chromosome. In humans that ranges from 50 million -300 million
>> letters long (DNA bases). In some other organisms chromosomes can be
>> billions of bases long, I think the longest maybe 5 billion,
>> estimated, but so far the technology isn't there to accurately
>> sequence and assemble those super well....  If you want to put a
>> whole genome into a single string, that pushes the max to 130 billion
>> bases, for Paris japonica, if you combine all 40 chromosomes into one
>> string...
>> Anyway, for me, I would say that 1 billion bases would be a good max
>> size, for most organisms, though a few go to 5-6, that I know of."
>>
>> So we have a target string up to 130 billion bytes (way above rhe U32
>> capacity of 4.3 billion), but more likely the maximum is less than a
>> billion.
>>
>> And they chunk the matching due to memory, etc limitations, but they
>> could have a pattern of length 100K, but usually just a few thousand.
>>
>
> Now done by these two commits
>
>  commit 3363c7035ff1df0c3ffeae0cd18bb86cc39d62e4
>  Author: Karl Williamson <khw@cpan.org>
>  Date:   Thu Sep 26 21:38:46 2019 -0600
>
>      regex: Create and handle LEXACT nodes
>
>      See the previous commit for info on these.
>
>      I am not changing trie code to recognize these at this time.
>
>  commit ae06e581c6e9944620eed4980fe89a3749886ed0
>  Author: Karl Williamson <khw@cpan.org>
>  Date:   Wed Sep 25 10:12:32 2019 -0600
>
>      Add regnode LEXACT, for long strings
>
>      This commit adds a new regnode for strings that don't fit in a
> regular
>      one, and adds a structure for that regnode to use.  Actually
> using them
>      is deferred to the next commit.
>
>      This new regnode structure is needed because the previous
> structure only
>      allows for an 8 bit length field, 255 max bytes.  This commit
> puts the
>      length instead in a new field, the same place single-argument
> regnodes
>      put their argument.  Hence this long string is an extra 32 bits of
>      overhead, but at no string length is this node ever bigger than the
>      combination of the smaller nodes it replaces.
>
>      I also considered simply combining the original 8 bit length field
>      (which is now unused) with the first byte of the string field to
> get a
>      16 bit length, and have the actual string be offset by 1.  But I
>      rejected that because it would mean the string would usually not be
>      aligned, slowing down memory accesses.
>
>      This new LEXACT regnode can hold up to what 1024 regular EXACT
> ones hold,
>      using 4K fewer overhead bytes to do so.  That means it can handle
>      strings containing 262000 bytes.  The comments give ideas for
> expanding
>      that should it become necessary or desirable.
>
>      Besides the space advantage, any hardware acceleration in memcmp
>      can be done in much bigger chunks, and otherwise the memcmp inner
> loop
>      (often written in assembly) will run many more times in a row,
> and our
>      outer loop that calls it, correspondingly fewer.

Thread Previous


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About