develooper Front page | perl.perl5.porters | Postings from September 2019

Re: RFC what about long regex EXACT nodes

Thread Previous | Thread Next
From:
Karl Williamson
Date:
September 13, 2019 00:09
Subject:
Re: RFC what about long regex EXACT nodes
Message ID:
6d1807ed-ce2c-8349-05e2-d65f26538d88@khwilliamson.com
On 9/11/19 3:25 AM, demerphq wrote:
> 
> 
> On Wed, 11 Sep 2019, 07:20 Karl Williamson, <public@khwilliamson.com 
> <mailto:public@khwilliamson.com>> wrote:
> 
>     Currently if there is a long string of text that is to be matched
>     exactly (or under /i) that data is chunked into pieces of at most 256
>     bytes.  The reason for this limit is that there happen to be 8 bits
>     available.
> 
>     But why not have a new node type for longer strings which wasn't
>     limited
>     to 256 bytes.  Is there a reason we haven't done this other than
>     lack of
>     tuits?
> 
>     The advantages of such a node are less overhead when matching, as you
>     can just keep going longer in the matching loop, and your memcmp for
>     exact matches will be a single one rather than multiple.
> 
>     I don't know if the optimizer currently strings such nodes together
>     when
>     computing the min and maximum lengths for strings to be able to match.
>     It may be that it stops at 256.  If so this would improve the
>     ability to
>     avoid matching trivially if the criteria weren't met.
> 
>     So is there a reason not to have this?
> 
> 
> I think this sounds reasonable.
> 
> Yves
> 

As an example of uses of this, a biologist friend gave me this info:

"For me, the longest string I would ever use would be a whole 
chromosome. In humans that ranges from 50 million -300 million letters 
long (DNA bases). In some other organisms chromosomes can be billions of 
bases long, I think the longest maybe 5 billion, estimated, but so far 
the technology isn't there to accurately sequence and assemble those 
super well....  If you want to put a whole genome into a single string, 
that pushes the max to 130 billion bases, for Paris japonica, if you 
combine all 40 chromosomes into one string...
Anyway, for me, I would say that 1 billion bases would be a good max 
size, for most organisms, though a few go to 5-6, that I know of."

So we have a target string up to 130 billion bytes (way above rhe U32 
capacity of 4.3 billion), but more likely the maximum is less than a 
billion.

And they chunk the matching due to memory, etc limitations, but they 
could have a pattern of length 100K, but usually just a few thousand.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About