develooper Front page | perl.perl6.users | Postings from February 2019

Re: binary test and position?

Thread Previous | Thread Next
From:
Brad Gilbert
Date:
February 5, 2019 15:56
Subject:
Re: binary test and position?
Message ID:
CAD2L-T3Ute-0zN=oJvC2fdchMo4oMG7EeufhWFans18_C=-z6Q@mail.gmail.com
`index` is an NQP op, which means in this case that it is written in C
(assuming you are using MoarVM)

https://github.com/MoarVM/MoarVM/blob/ddde09508310a5f60c63474db8f9682bc922700b/src/strings/ops.c#L557-L656

The code I gave for finding a Buf inside of another one was quickly
made in a way to prevent certain bugs from even being possible.


I rewrote it using knowledge of the internals to be a bit faster:

    sub blob-index ( Blob:D $buffer, Blob:D $needle, UInt:D $init = 0
--> Int ) {
        use nqp;
        my int $needle-width = $needle.elems;
        my int $elems = $buffer.elems;

        if $elems < $needle-width + $init {
            fail 'needle is larger than the buffer'
        }

        my uint $from = $init;
        my uint $to   = $from + $needle-width - 1;

        loop ( ; $to < $elems ; ++$from,++$to ) {
            return $from if $needle eq nqp::slice($buffer,$from,$to)
            # eq is safe since they are two Blobs/Bufs
        }
        return Nil
    }

It's faster mainly because it doesn't use an iterator.

This was quickly written.

Note that Buf is a subtype of Blob, so it will also work if given a Buf.

It could be written to be faster, but it still isn't going to be as
fast as `index`.
For one `index` uses a subroutine named `knuth_morris_pratt_string_index`.

---

Also Unicode strings are not stored flat in MoarVM

    'a' x 10

That is stored something like the following internally.

    REPEAT( STR('a'), 10 )

What this means is that if you are looking for a 'b' for example, it
could check the first repetition for 'b', and skip the rest of the
REPEAT.
That would save 9 string comparisons in this example.

That is not how Bufs are stored in Perl6 at all.
They are declared with `is repr('VMArray')` which basically means they
are stored as C arrays.

On Tue, Feb 5, 2019 at 1:47 AM ToddAndMargo via perl6-users
<perl6-users@perl.org> wrote:
>
> On 2/2/19 9:29 PM, Brad Gilbert wrote:
> > Subs do not need to have a `return` statement if it is returning the last value.
> >
> > You also broke the return value of the subroutine that I wrote by
> > assigning it to a variable.
> >
> > What I wrote would return `Nil` if it failed to find a match, yours
> > will return an undefined `Int`.
> > It should return `Nil` because that is what `index` would return.
> >
> > Doing bytewise conversion from Buf to Str is pointless. It will break
> > on Unicode data.
> > It would also be exactly the same as converting ASCII if it worked.
> > (It won't work on binary data)
> >
> > If you are dealing with something that is mostly Unicode but also has
> > binary data
> > decode using 'utf8-c8'.
> >
> > If you are dealing with something that is mostly binary, decode using 'latin1',
> > or just use raw bytes in a buffer.
> >
> >      my Buf $buffer = $fh.read(10);
> >      my Str $string = $buffer.decode('latin1');
> >
> >      # the first three bytes were really a Utf8 encoded character
> >      my Str $char = $string.substr(0,3).encode('latin1').decode('utf8');
> >      # or
> >      my Str $char = $buffer.subbuf(0,3).decode('utf8');
> >
> > Also note that `encode` doesn't always return a Buf.
> >
> >      my Buf $buf = Buf.new( 'hello'.encode('utf8') );
> >
> > ---
> >
> > The subroutine I wrote was simplified to work for an Array or List, not a Buf.
> >
> > It is also weird that you are using CamelCase for variables,
> > and a mixture of CamelCase and snake-case for the subroutine name.
> >
> >
> > Improving your variant, and changing it so the second parameter is a Buf.
> >
> >      sub Buf-Index ( Buf $Buffer, Buf $SubBuf ) {
> >          my List $Matcher = $SubBuf.List; # only call .List once
> >          my Any $Position is default(Nil) = Nil;
> >          my Int $Elems = $Matcher.elems;
> >
> >          $Position = $Buffer.rotor($Elems => 1- $Elems).first(* eqv
> > $Matcher, :k);
> >          return $Position;
> >      }
> >
> > `$Position` has to be `Any` (or `Mu`) so that it can store the value `Nil`.
> > `Nil` sets a variable to its default, so we have to change the default
> > with `is default(Nil)`.
> > (The normal default is the same as the container type)
> > (The `= Nil;` is always pointless in the declaration of a `$scalar` variable.)
> >
> > One simplification is to just have the return value as the last thing
> > in the subroutine without a `return`.
> > (It may also be slightly faster, but not by much.)
> >
> >      sub Buf-Index ( Buf $Buffer, Buf $SubBuf ) {
> >          my List $Matcher = $SubBuf.List;
> >          my Any $Position is default(Nil) = Nil;
> >          my Int $Elems = $Matcher.elems;
> >
> >          $Position = $Buffer.rotor($Elems => 1- $Elems).first(* eqv
> > $Matcher, :k);
> >          $Position; # <------------
> >      }
> >
> > Assignment is a rvalue, so we can remove that last line
> >
> >      sub Buf-Index ( Buf $Buffer, Buf $SubBuf ) {
> >          my List $Matcher = $SubBuf.List;
> >          my Any $Position is default(Nil) = Nil;
> >          my Int $Elems = $Matcher.elems;
> >
> >          $Position = $Buffer.rotor($Elems => 1- $Elems).first(* eqv
> > $Matcher, :k);
> >          # <-----------
> >      }
> >
> > `$Position` is now completely pointless.
> >
> >      sub Buf-Index ( Buf $Buffer, Buf $SubBuf ) {
> >          my List $Matcher = $SubBuf.List;
> >          my Int $Elems = $Matcher.elems;
> >
> >          $Buffer.rotor($Elems => 1- $Elems).first(* eqv $Matcher, :k);
> >          # ^
> >      }
> >
> > If you want `return` (even though it isn't doing anything)
> >
> >      sub Buf-Index ( Buf $Buffer, Buf $SubBuf ) {
> >          my List $Matcher = $SubBuf.List;
> >          my Int $Elems = $Matcher.elems;
> >
> >          return $Buffer.rotor($Elems => 1- $Elems).first(* eqv $Matcher, :k);
> >          # ^
> >      }
> >
> > You could also declare the type of the return value
> >
> >      sub Buf-Index ( Buf $Buffer, Buf $SubBuf --> Int ) { # <----
> >          my List $Matcher = $SubBuf.List;
> >          my Int $Elems = $Matcher.elems;
> >
> >          return $Buffer.rotor($Elems => 1- $Elems).first(* eqv $Matcher, :k);
> >      }
> >
> > Note that `Nil` can sneak around the return value type check.
> >
> > ---
> >
> > As an added bonus, here is a subroutine that returns all of the indices.
> > (Note that the only differences are `grep` rather than `first`, and
> > the return type)
> >
> >      sub Buf-Indices ( Buf $Buffer, Buf $SubBuf --> Seq ) {
> >          my List $Matcher = $SubBuf.List;
> >          my Int $Elems = $Matcher.elems;
> >
> >          return $Buffer.rotor($Elems => 1- $Elems).grep(* eqv $Matcher, :k);
> >      }
> >
> >      }
> >
> > On Sat, Feb 2, 2019 at 10:05 PM ToddAndMargo via perl6-users
> > <perl6-users@perl.org> wrote:
> >>
> >> On 2/2/19 6:09 AM, Brad Gilbert wrote:
> >>>       sub buf-index ( Buf $buf, +@match ) {
> >>>           my $elems = @match.elems;
> >>>           $buf.rotor( $elems => 1 - $elems ).first(* eqv @match.List, :k)
> >>>       }
> >>>
> >>>       my $buf = Buf[uint8].new(0x4D, 0x5A, 0x90, 0x00, 0x03);
> >>>
> >>>       say buf-index( $buf, (0x90, 0x00, 0x03)); # 2
> >>
> >> What did I do wrong?
> >>
> >> First I did a byte wise conversion of
> >>
> >>      Buf $BinaryFile   to   Str $StrFile
> >>
> >> and
> >>
> >>      Buf $VersionInfoBuf  to  Str $VersionInfoStr
> >>
> >>
> >>
> >> sub Buf-Index ( Buf $Buffer, +@SubBuf ) {
> >>      # `index` for buffers
> >>      # $Buffer is the buffer to search through
> >>      # $ +@SubBuf is the sub buffer pattern to search for in $Buffer
> >>      # returns the first instance of a match, Nil if no match
> >>
> >>      my Int $Position = Nil;
> >>      my $Elems = @SubBuf.elems;
> >>
> >>      $Position = $Buffer.rotor( $Elems => 1 - $Elems ).first( * eqv
> >> @SubBuf.List, :k );
> >>      return $Position;
> >> }
> >>
> >>      $i  = index(     $StrFile,    $VersionInfoStr );
> >>      $bi = Buf-Index( $BinaryFile, $VersionInfoBuf );
> >>      say "i = <$i>   bi = <$bi>";
> >>
> >>
> >>
> >>
> >> $ FileVer.pl6
> >> i = <11371>   bi = <>
> >>
> >>
> >> 11371 is correct.
> >>
> >>
> >>
> >> What did I do wrong?
> >>
> >> Many thanks,
> >> -T
>
>
> Got it working.  Thank you!
>
> It is a tad slow.  Depending on the file's size, it is
> 20 to 190 times slower than "index".
>
> I have a lot of thinking to do.

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About