develooper Front page | perl.perl5.porters | Postings from August 2011

Re: [perl #95986] [PATCH] Superficial separation of INSTRUCTION fromOP.

Thread Previous | Thread Next
Gerard Goossen
August 8, 2011 11:50
Re: [perl #95986] [PATCH] Superficial separation of INSTRUCTION fromOP.
Message ID:
On Wed, Aug 03, 2011 at 01:45:31PM +0100, Nicholas Clark wrote:
> On Wed, Aug 03, 2011 at 02:11:38PM +0200, Gerard Goossen wrote:
> > On Mon, Aug 01, 2011 at 08:32:30AM -0700, Dave Mitchell via RT wrote:
> > > On Sun, Jul 31, 2011 at 03:58:17AM -0700, Gerard Goossen wrote:
> > > > Preparation for having INSTRUCTION which is different from OP.
> > > > This patch only introduces the INSTRUCTION type which is a synonym for
> > > > OP, changes function and variable definitions to use the type and
> > > > renames variables from *op to *instr.
> > > 
> > > This patch changes function signatures which are part of the public API,
> > > so needs to be treated with caution.
> > > 
> > > To put it in it's wider context, can you provide a link to a thread or
> > > website which describes your overall plan in more detail?
> >  
> > The original proposal:
> >
> > The report:
> >
> > 
> > The lastest version can be found at git://
> > in the codegen-instruction branch (it has some minor problems from
> > rebasing on the latest blead).
> It's still very hard from that to get any breakdown of where things
> are headed, given that the most recent change on the `codegen` branch
> is still a massive:
>  98 files changed, 8347 insertions(+), 6604 deletions(-)
> Have a missed a branch with a proposed sequence of patches for blead?

My main development branch is now "codegen-instruction" (which is
development branch is and rather unorganized).
The main change is a lot smaller, but still massive:
  51 files changed, 4080 insertions(+), 2756 deletions(-)
(I guess about a quarter is auto generated)
I am still working on making it into smaller patches, and submitting
them directly to p5p one by one (like this patch about INSTRUCTION).

> > In short:
> > 
> >   at compile time:
> >     The building of the op_next chain is gone.
> >     A final walk is done through the optree in finalize_optree doing
> >         required checking now done by the peephole optimizer. This
> >         step has already been added to blead.
> > 
> >   before the first run time:
> >     The code generation does a walk through the optree generating a
> >         list of instruction, these instruction correspond more or less
> >         with the current op_next chain
> How does that fit with ithreads, where the optree is shared between threads,
> and hence can't be modified at runtime?

The optree isn't modfied when generating the optree.

> >   at run time:
> >     Starting with the first instruction from the CV the function from
> >         the instr_ppaddr field is called. The function is responsible
> >         for returning the next instruction, until a NULL instruction is
> >         returned.
> I'm still extremely uncomfortable that a change this radical to the
> entire *runtime* side of the OP system is sane to make. It breaks a
> chunk of CPAN, for no proven runtime efficiency gain. The timing
> results you posted previously were not conclusive of any real speedup.
> Partly, I think, because you inverted the calling convention from
> OPs wrap new-style multi-arg functions to new-style multi-arg functions
> wrap OPs way too early, before sufficient OPs had been converted to
> avoid the thunking layer.
> Moreover, I don't see good prospects for it enabling future speedups.
> Unladen Swallow has failed to demonstrate that LLVM is the way to go.
> In anything, possibly the opposite. One of those working on it wrote
>     Unfortunately, LLVM in its current state is really designed as a
>     static compiler optimizer and back end. LLVM code generation and
>     optimization is good but expensive. The optimizations are all
>     designed to work on IR generated by static C-like languages. Most
>     of the important optimizations for optimizing Python require
>     high-level knowledge of how the program executed on previous
>     iterations, and LLVM didn't help us do that.
> and
>     LLVM also comes with other constraints. For example, LLVM doesn't
>     really support back-patching, which PyPy uses for fixing up their
>     guard side exits. It's a fairly large dependency with high memory
>     usage, but I would argue that based on the work Steven Noonan did
>     for his GSOC that it could be reduced, especially considering that
>     PyPy's memory usage had been higher.
> I'm not even convinced that I'd say that the thing is production ready,
> given that (as best I can tell) you remove 2 pointers from BASEOP
> (op_next and op_ppaddr), but in turn add a structure for an instruction:
> struct instruction {
>     Perl_ppaddr_t       instr_ppaddr;
>     OP*                 instr_op;
>     INSTR_FLAGS         instr_flags;
>     void*               instr_arg;
> };
> meaning that (as best I can tell), the memory cost *for every
> instruction* is now 2 pointers more (given alignment constraints).
The current changes don't really change anything in the pp_* being
executed, therefore I wouldn't really expect any speedup.

Changing the calling convention for pp_* function to multi-arg
functions probably wasn't a very good idea, therefore I removed
it (information which was supplied through the args is now accessed
using PL_instruction->...). Changing this didn't have any real effect
on performance.

I expect the main speedup to come from doing some static analysis to
give the most benefit, Although many typical examples like statement
reordering or common subexpression elimination are difficult due to
magic and overloading. 

Apparantly the greatest cost is 
The first column is the blead and the second is with a CODESEQ which
is reference counted, but the start instruction is the normal CvSTART
(change 1d46ffca9d34fb9e963b0943a31d7d2e3e78b735)
call/0arg                100      81
call/1arg                100      87
call/2arg                100      84
call/9arg                100     107
call/empty               100      79
call/fib                 100      90
call/method              100      99
call/wantarray           100     100

Note that this change also wasn't optimized very much, for example
most of the accessors are still functions instead of macros, but I
don't think that is the main cause of the slowdown. There is an extra
level of indirection after the CV, i.e. the CV points to the CODESEQ
which points to the start op, instead of the CV directly pointing to
the startop, also the CODESEQ is reference counted, which is also
maintained. This allows to have multiple CODESEQ for the same CV. We
could also decide not to do that and integrate the CODESEQ directly
into the CV which would remove the extra pointer indirection and
remove the need for reference counting the CODESEQ.

On the memory side: the number of OP element is reduced, for example
'for (@ARGV){}' is only 9 OPs instead of 17 OPs. Also the instructions
are only created when the CV is executed, which especially for
programs using lots of modules isn't the case.

> I'd love to see a proper AST being used internally to generate (something
> close to) the existing OP structure. But I'm not confident that we've yet
> proven the case that moving from the existing OP structure is worth the
> pain.
> Nicholas Clark

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About