develooper Front page | perl.perl6.users | Postings from February 2019

Re: binary test and position?

Thread Previous | Thread Next
From:
ToddAndMargo via perl6-users
Date:
February 6, 2019 04:47
Subject:
Re: binary test and position?
Message ID:
192aba86-ac96-a7fe-efd8-fb720f087132@zoho.com
On 2/5/19 7:55 AM, Brad Gilbert wrote:
> `index` is an NQP op, which means in this case that it is written in C
> (assuming you are using MoarVM)
> 
> https://github.com/MoarVM/MoarVM/blob/ddde09508310a5f60c63474db8f9682bc922700b/src/strings/ops.c#L557-L656
> 
> The code I gave for finding a Buf inside of another one was quickly
> made in a way to prevent certain bugs from even being possible.
> 
> 
> I rewrote it using knowledge of the internals to be a bit faster:
> 
>      sub blob-index ( Blob:D $buffer, Blob:D $needle, UInt:D $init = 0
> --> Int ) {
>          use nqp;
>          my int $needle-width = $needle.elems;
>          my int $elems = $buffer.elems;
> 
>          if $elems < $needle-width + $init {
>              fail 'needle is larger than the buffer'
>          }
> 
>          my uint $from = $init;
>          my uint $to   = $from + $needle-width - 1;
> 
>          loop ( ; $to < $elems ; ++$from,++$to ) {
>              return $from if $needle eq nqp::slice($buffer,$from,$to)
>              # eq is safe since they are two Blobs/Bufs
>          }
>          return Nil
>      }
> 
> It's faster mainly because it doesn't use an iterator.
> 
> This was quickly written.
> 
> Note that Buf is a subtype of Blob, so it will also work if given a Buf.
> 
> It could be written to be faster, but it still isn't going to be as
> fast as `index`.
> For one `index` uses a subroutine named `knuth_morris_pratt_string_index`.
> 
> ---
> 
> Also Unicode strings are not stored flat in MoarVM
> 
>      'a' x 10
> 
> That is stored something like the following internally.
> 
>      REPEAT( STR('a'), 10 )
> 
> What this means is that if you are looking for a 'b' for example, it
> could check the first repetition for 'b', and skip the rest of the
> REPEAT.
> That would save 9 string comparisons in this example.
> 
> That is not how Bufs are stored in Perl6 at all.
> They are declared with `is repr('VMArray')` which basically means they
> are stored as C arrays.
> 
> On Tue, Feb 5, 2019 at 1:47 AM ToddAndMargo via perl6-users
> <perl6-users@perl.org> wrote:
>>
>> On 2/2/19 9:29 PM, Brad Gilbert wrote:
>>> Subs do not need to have a `return` statement if it is returning the last value.
>>>
>>> You also broke the return value of the subroutine that I wrote by
>>> assigning it to a variable.
>>>
>>> What I wrote would return `Nil` if it failed to find a match, yours
>>> will return an undefined `Int`.
>>> It should return `Nil` because that is what `index` would return.
>>>
>>> Doing bytewise conversion from Buf to Str is pointless. It will break
>>> on Unicode data.
>>> It would also be exactly the same as converting ASCII if it worked.
>>> (It won't work on binary data)
>>>
>>> If you are dealing with something that is mostly Unicode but also has
>>> binary data
>>> decode using 'utf8-c8'.
>>>
>>> If you are dealing with something that is mostly binary, decode using 'latin1',
>>> or just use raw bytes in a buffer.
>>>
>>>       my Buf $buffer = $fh.read(10);
>>>       my Str $string = $buffer.decode('latin1');
>>>
>>>       # the first three bytes were really a Utf8 encoded character
>>>       my Str $char = $string.substr(0,3).encode('latin1').decode('utf8');
>>>       # or
>>>       my Str $char = $buffer.subbuf(0,3).decode('utf8');
>>>
>>> Also note that `encode` doesn't always return a Buf.
>>>
>>>       my Buf $buf = Buf.new( 'hello'.encode('utf8') );
>>>
>>> ---
>>>
>>> The subroutine I wrote was simplified to work for an Array or List, not a Buf.
>>>
>>> It is also weird that you are using CamelCase for variables,
>>> and a mixture of CamelCase and snake-case for the subroutine name.
>>>
>>>
>>> Improving your variant, and changing it so the second parameter is a Buf.
>>>
>>>       sub Buf-Index ( Buf $Buffer, Buf $SubBuf ) {
>>>           my List $Matcher = $SubBuf.List; # only call .List once
>>>           my Any $Position is default(Nil) = Nil;
>>>           my Int $Elems = $Matcher.elems;
>>>
>>>           $Position = $Buffer.rotor($Elems => 1- $Elems).first(* eqv
>>> $Matcher, :k);
>>>           return $Position;
>>>       }
>>>
>>> `$Position` has to be `Any` (or `Mu`) so that it can store the value `Nil`.
>>> `Nil` sets a variable to its default, so we have to change the default
>>> with `is default(Nil)`.
>>> (The normal default is the same as the container type)
>>> (The `= Nil;` is always pointless in the declaration of a `$scalar` variable.)
>>>
>>> One simplification is to just have the return value as the last thing
>>> in the subroutine without a `return`.
>>> (It may also be slightly faster, but not by much.)
>>>
>>>       sub Buf-Index ( Buf $Buffer, Buf $SubBuf ) {
>>>           my List $Matcher = $SubBuf.List;
>>>           my Any $Position is default(Nil) = Nil;
>>>           my Int $Elems = $Matcher.elems;
>>>
>>>           $Position = $Buffer.rotor($Elems => 1- $Elems).first(* eqv
>>> $Matcher, :k);
>>>           $Position; # <------------
>>>       }
>>>
>>> Assignment is a rvalue, so we can remove that last line
>>>
>>>       sub Buf-Index ( Buf $Buffer, Buf $SubBuf ) {
>>>           my List $Matcher = $SubBuf.List;
>>>           my Any $Position is default(Nil) = Nil;
>>>           my Int $Elems = $Matcher.elems;
>>>
>>>           $Position = $Buffer.rotor($Elems => 1- $Elems).first(* eqv
>>> $Matcher, :k);
>>>           # <-----------
>>>       }
>>>
>>> `$Position` is now completely pointless.
>>>
>>>       sub Buf-Index ( Buf $Buffer, Buf $SubBuf ) {
>>>           my List $Matcher = $SubBuf.List;
>>>           my Int $Elems = $Matcher.elems;
>>>
>>>           $Buffer.rotor($Elems => 1- $Elems).first(* eqv $Matcher, :k);
>>>           # ^
>>>       }
>>>
>>> If you want `return` (even though it isn't doing anything)
>>>
>>>       sub Buf-Index ( Buf $Buffer, Buf $SubBuf ) {
>>>           my List $Matcher = $SubBuf.List;
>>>           my Int $Elems = $Matcher.elems;
>>>
>>>           return $Buffer.rotor($Elems => 1- $Elems).first(* eqv $Matcher, :k);
>>>           # ^
>>>       }
>>>
>>> You could also declare the type of the return value
>>>
>>>       sub Buf-Index ( Buf $Buffer, Buf $SubBuf --> Int ) { # <----
>>>           my List $Matcher = $SubBuf.List;
>>>           my Int $Elems = $Matcher.elems;
>>>
>>>           return $Buffer.rotor($Elems => 1- $Elems).first(* eqv $Matcher, :k);
>>>       }
>>>
>>> Note that `Nil` can sneak around the return value type check.
>>>
>>> ---
>>>
>>> As an added bonus, here is a subroutine that returns all of the indices.
>>> (Note that the only differences are `grep` rather than `first`, and
>>> the return type)
>>>
>>>       sub Buf-Indices ( Buf $Buffer, Buf $SubBuf --> Seq ) {
>>>           my List $Matcher = $SubBuf.List;
>>>           my Int $Elems = $Matcher.elems;
>>>
>>>           return $Buffer.rotor($Elems => 1- $Elems).grep(* eqv $Matcher, :k);
>>>       }
>>>
>>>       }
>>>
>>> On Sat, Feb 2, 2019 at 10:05 PM ToddAndMargo via perl6-users
>>> <perl6-users@perl.org> wrote:
>>>>
>>>> On 2/2/19 6:09 AM, Brad Gilbert wrote:
>>>>>        sub buf-index ( Buf $buf, +@match ) {
>>>>>            my $elems = @match.elems;
>>>>>            $buf.rotor( $elems => 1 - $elems ).first(* eqv @match.List, :k)
>>>>>        }
>>>>>
>>>>>        my $buf = Buf[uint8].new(0x4D, 0x5A, 0x90, 0x00, 0x03);
>>>>>
>>>>>        say buf-index( $buf, (0x90, 0x00, 0x03)); # 2
>>>>
>>>> What did I do wrong?
>>>>
>>>> First I did a byte wise conversion of
>>>>
>>>>       Buf $BinaryFile   to   Str $StrFile
>>>>
>>>> and
>>>>
>>>>       Buf $VersionInfoBuf  to  Str $VersionInfoStr
>>>>
>>>>
>>>>
>>>> sub Buf-Index ( Buf $Buffer, +@SubBuf ) {
>>>>       # `index` for buffers
>>>>       # $Buffer is the buffer to search through
>>>>       # $ +@SubBuf is the sub buffer pattern to search for in $Buffer
>>>>       # returns the first instance of a match, Nil if no match
>>>>
>>>>       my Int $Position = Nil;
>>>>       my $Elems = @SubBuf.elems;
>>>>
>>>>       $Position = $Buffer.rotor( $Elems => 1 - $Elems ).first( * eqv
>>>> @SubBuf.List, :k );
>>>>       return $Position;
>>>> }
>>>>
>>>>       $i  = index(     $StrFile,    $VersionInfoStr );
>>>>       $bi = Buf-Index( $BinaryFile, $VersionInfoBuf );
>>>>       say "i = <$i>   bi = <$bi>";
>>>>
>>>>
>>>>
>>>>
>>>> $ FileVer.pl6
>>>> i = <11371>   bi = <>
>>>>
>>>>
>>>> 11371 is correct.
>>>>
>>>>
>>>>
>>>> What did I do wrong?
>>>>
>>>> Many thanks,
>>>> -T
>>
>>
>> Got it working.  Thank you!
>>
>> It is a tad slow.  Depending on the file's size, it is
>> 20 to 190 times slower than "index".
>>
>> I have a lot of thinking to do.


Thank you!

Hmmmm.  Buf's are C arrays.  Interesting.

I still can't get away with

     my Buf $x=Buf.new(0x55, 0x66, 0x77, 0x78);
     my Buf $y=$x[ 1..2 ];

I have to
     my Buf $x=Buf.new(0x55, 0x66, 0x77, 0x78);
     my Buf $y=$x.subbuf(1, 2); say $y;

-T


-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Computers are like air conditioners.
They malfunction when you open windows
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About