develooper Front page | perl.perl5.porters | Postings from August 2012

Re: [PATCH] Module::CoreList delta support

Thread Previous | Thread Next
From:
Aristotle Pagaltzis
Date:
August 5, 2012 13:39
Subject:
Re: [PATCH] Module::CoreList delta support
Message ID:
20120805203932.GA5008@fernweh.plasmasturm.org
* David Golden <xdaveg@gmail.com> [2012-08-05 19:40]:
> On Sun, Aug 5, 2012 at 3:20 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
> > That is a weakness in the scheme I outlined. Lines need to be fixed
> > length for the unpack sleight of hand to work, which means adding
> > a perl
>
> (Without actually reading your code) my gut reaction is "don't do
> unpack slight of hand" then.  Store lines in sorted module name order
> like this:
>
>     "$name $json\n" # where $json has no newlines in it
>
> Then use Search::Dict to bisect to the right line, split the line into
> two fields, decode the JSON part into a hash and dig into it as
> needed.

Err, the whole point was not to have to do that.

You have not considered what happens for queries on the by-perl-release
axis, in which case under your scheme the only solution is to load and
decode every single line and then do a hash access on it.

With fixed width records, queries on both axes are equally fast. The
cost is storing a bunch of whitespace – in memory, but not on disk,
because gzip crunches that away with abandon.

> The bisection means it will be faster than reading every line and
> loading the whole thing into memory anyway.
>
> Only in the rarest of cases do you need to load the whole file -- most
> uses are still "corelist Foo", which benefits hugely from bisection.

Err. You are proposing that doing repeated disk seeks will be faster
than reading 25KB in a single I/O operation and then inflating that to
500KB in memory using gzip (which decompresses at near memcpy speed),
and then scanning that string.

I, uhm… Are you sure you want to have that match?

> And if the JSON data for each module is in some delta format and only
> changes when a module is updated, a line only gets touched when it
> actually changes in a release, so the diff is annoying (a very long
> line) but not horrible (every line changing).

That is the benefit of the JSON-based scheme.

I posit that it loses on every other count.

> Why JSON vs some other ascii delimited format?  Because it allows
> structured data instead of just fields.  (Plus anyone with JSON:XS
> installed gets a speed boost for free.)

Microöptimising a fundamentally slow algorithm…

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About