develooper Front page | perl.perl5.porters | Postings from September 2019

Re: RFC what about long regex EXACT nodes

Thread Previous | Thread Next
Karl Williamson
September 29, 2019 17:48
Re: RFC what about long regex EXACT nodes
Message ID:
On 9/12/19 6:08 PM, Karl Williamson wrote:
> On 9/11/19 3:25 AM, demerphq wrote:
>> On Wed, 11 Sep 2019, 07:20 Karl Williamson, < 
>> <>> wrote:
>>     Currently if there is a long string of text that is to be matched
>>     exactly (or under /i) that data is chunked into pieces of at most 256
>>     bytes.  The reason for this limit is that there happen to be 8 bits
>>     available.
>>     But why not have a new node type for longer strings which wasn't
>>     limited
>>     to 256 bytes.  Is there a reason we haven't done this other than
>>     lack of
>>     tuits?
>>     The advantages of such a node are less overhead when matching, as you
>>     can just keep going longer in the matching loop, and your memcmp for
>>     exact matches will be a single one rather than multiple.
>>     I don't know if the optimizer currently strings such nodes together
>>     when
>>     computing the min and maximum lengths for strings to be able to 
>> match.
>>     It may be that it stops at 256.  If so this would improve the
>>     ability to
>>     avoid matching trivially if the criteria weren't met.
>>     So is there a reason not to have this?
>> I think this sounds reasonable.
>> Yves
> As an example of uses of this, a biologist friend gave me this info:
> "For me, the longest string I would ever use would be a whole 
> chromosome. In humans that ranges from 50 million -300 million letters 
> long (DNA bases). In some other organisms chromosomes can be billions of 
> bases long, I think the longest maybe 5 billion, estimated, but so far 
> the technology isn't there to accurately sequence and assemble those 
> super well....  If you want to put a whole genome into a single string, 
> that pushes the max to 130 billion bases, for Paris japonica, if you 
> combine all 40 chromosomes into one string...
> Anyway, for me, I would say that 1 billion bases would be a good max 
> size, for most organisms, though a few go to 5-6, that I know of."
> So we have a target string up to 130 billion bytes (way above rhe U32 
> capacity of 4.3 billion), but more likely the maximum is less than a 
> billion.
> And they chunk the matching due to memory, etc limitations, but they 
> could have a pattern of length 100K, but usually just a few thousand.

Now done by these two commits

  commit 3363c7035ff1df0c3ffeae0cd18bb86cc39d62e4
  Author: Karl Williamson <>
  Date:   Thu Sep 26 21:38:46 2019 -0600

      regex: Create and handle LEXACT nodes

      See the previous commit for info on these.

      I am not changing trie code to recognize these at this time.

  commit ae06e581c6e9944620eed4980fe89a3749886ed0
  Author: Karl Williamson <>
  Date:   Wed Sep 25 10:12:32 2019 -0600

      Add regnode LEXACT, for long strings

      This commit adds a new regnode for strings that don't fit in a regular
      one, and adds a structure for that regnode to use.  Actually using 
      is deferred to the next commit.

      This new regnode structure is needed because the previous 
structure only
      allows for an 8 bit length field, 255 max bytes.  This commit puts the
      length instead in a new field, the same place single-argument regnodes
      put their argument.  Hence this long string is an extra 32 bits of
      overhead, but at no string length is this node ever bigger than the
      combination of the smaller nodes it replaces.

      I also considered simply combining the original 8 bit length field
      (which is now unused) with the first byte of the string field to get a
      16 bit length, and have the actual string be offset by 1.  But I
      rejected that because it would mean the string would usually not be
      aligned, slowing down memory accesses.

      This new LEXACT regnode can hold up to what 1024 regular EXACT 
ones hold,
      using 4K fewer overhead bytes to do so.  That means it can handle
      strings containing 262000 bytes.  The comments give ideas for 
      that should it become necessary or desirable.

      Besides the space advantage, any hardware acceleration in memcmp
      can be done in much bigger chunks, and otherwise the memcmp inner loop
      (often written in assembly) will run many more times in a row, and our
      outer loop that calls it, correspondingly fewer.

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About