develooper Front page | perl.perl5.porters | Postings from April 2021

Re: threads and stuff (was Re: =?utf-8?Q?P?==?utf-8?B?ZXJs4oCZcw==?= leaky bucket)

Thread Previous | Thread Next
Nicholas Clark
April 9, 2021 13:31
Re: threads and stuff (was Re: =?utf-8?Q?P?==?utf-8?B?ZXJs4oCZcw==?= leaky bucket)
Message ID:
On Wed, Apr 07, 2021 at 08:49:31PM -0500, B. Estrade wrote:
> FWIW, I can accept that fundamentally this violates any number of basic
> assumptions in the perl runtime. I know it's interpreted, dynamic, etc. I
> understand conceptually that perl data structures necessarily are designed
> to be used in such a runtime that requires "bookkeeping". Worst case for me,
> I come away with a better understanding on why what I am asking actually is
> _impossible_.

> Can perl handle such a "threaded" block internally?
> The most likely answer I can project is:
> * not directly, an embedded "SMP capable interpreter" approach is the only
> clean way to do this.

Sure, Perl doesn't have a GIL (Global Interpreter Lock), but it doesn't need
to - it doesn't expose a concept of running more than one execution context
on the same interpreter, which is what the Python GIL is about -
time-slicing the single interpreter to run more than one execution context.
(Non-concurrent parallelism. So just one CPU core)

What you're describing would need more than one execution context able to
run on the same interpreter. Right now there's a total assumption of 1-to-1
mapping from interpreter to execution context. What differs from Python is
that we *don't* have the assumption of exactly one interpreter in an OS
process, all the C APIs are ready for this, and XS extensions likely should

ithreads exploits this ability to have more than one interpreter to then
provide more than one execution context (which is what we care about) which
means that it can use more than one CPU.

On Wed, Apr 07, 2021 at 11:20:49PM -0400, Dan Book wrote:

> I'd defer to others more familiar with ithreads such as Nicholas's earlier
> post, but what you are describing is largely what does, and its
> problems of being heavyweight to spawn and causing bugs largely stem from
> trying to share memory and have the perl interpreter available in each
> thread. So I'll just say, these are a lot of nice wishes, but getting the
> tradeoffs right in the implementation is a challenge, if it is possible at
> all.

Yes. The *problem* is/remains that everything internally assumes/relies on
"interpreter" === "execution context", meaning that ithreads has to

1) rather expensively copy the entire interpreter to create a second
   execution context
2) there isn't a good way to share data between contexts

To be viable, the threading model Brett is suggesting still needs the
*first* point to be changed - it only works out if "execution context" is
decoupled from "interpreter state", and it becomes possible to have 2+
execution contexts running concurrently on the same interpreter.

ie *we* wouldn't have to solve:

    There should be one -- and preferably only one -- [interpreter]

but the rest is effectively the same problem as removing the GIL.

And that failed, despite considerable efforts

The two talks by Larry Hastings on this are quite interesting (and fun to
watch, as he is good at explaining things, and it's mostly not Python

In Removing Python's GIL: The Gilectomy - PyCon 2016

he has a slide listing what you need to add locking to. Most of that maps
one-to-one to the perl internals:

* dicts => hashes (for symbol tables)
* lists => arrays (many other interpreter global structures)
* freelists => SV arenas

He measured an immediate 30% slowdown *just* due to having to have reference
counts now be implemented by CPU atomic operations.

A year later, in his talk with the update, he notes that he's moved to
buffered reference counting (and gives a very good explanation of this)

Sufficient to say this approach makes DESTROY untimely. What's amusing is
that in the Q&A section he realises that he'd missed saying something

The implication of buffered reference counting is that all refcount
decrement to zero happens on the "reference count committing thread".  So,
na??vely, this means that all DESTROY actions happen on that thread, which
now has far too much work. Hence he uses queues to dispatch the destroy
action *back* to the thread that did the last refcount decrement. Meaning
that destruction is even more untimely.

The sad part is that a year later, the update was that he'd given up:

    He is "out of bullets" at least with that approach.

    With his complicated buffered-reference-count approach he was able to
    get his "gilectomized" interpreter to reach performance parity with
    CPython???except that his interpreter was running on around seven cores to
    keep up with CPython on one.

It's interesting that he then is considering that one would need to rewrite
CPython from reference counting to a tracing garbage collector to get
further, in that in another session at the same Python Language Summit
someone from Instagram said that they experimented with exactly that:

    It is part of the C API, though, so reference counting must be
    maintained for C extensions; in the experiment, the Instagram developers
    moved to a tracing garbage collector everywhere else.

The Instagram folks also experimented with changing CPython's data
structures to optimise for the common case, and saw speedups.
(this obviously makes the C code more complex and harder to understand, but
more importantly, this does start to break the C APIs, and hence existing
extensions). Which is interesting, because at the end of his talk the
previous year, Larry Hastings observed that Jython and IronPython can both
linearly scale threads with CPUs. Both underlying VMs are written in C (or
equivalent), hence an "existence proof" - it's clearly possible to scale
linearly with a C implementation - it was just a question of how much the C
API he needed to break to get there.

So I'm wondering (this is more [original arm waving] than [original
research]), whether in effect Jython and IronPython are just taking the
(equivalent) slowdown hit for concurrency as his Gilectomy, and then are
able to win the speed back simply by using better data structures (optimised
to the common runtime use patterns), because they aren't constrained by the

(Also they have JITs. But maybe more importantly, before that, they can
specialise the bytecode executed at runtime to remove accessor calls, and do
things like realise that some objects are not visible to other threads, or
maybe don't even need to be on the heap. There are a lot of tricks that
python, perl and VMs of that maturity don't and now can't use.)

Also interesting on the Hacker News thread on the 2016 talk was this:

    I work on a new Ruby interpreter, and our solution to this problem has
    been to interpret the C code of extensions. That way we can give it one
    API, but really implement another.

This is "Solang", "system that can execute LLVM-based languages on the JVM":

Yes, this is crazy. Ruby needs to call C extensions. To make this efficient
for Ruby running on the JVM, we'll just go compile the C into something we
can (somehow) inline. And it worked. *

However, again, despite all this very smart work, TruffleRuby doesn't seem
to have taken over the world any more than JRuby did.

Nor has PyPy taken over Python.

Despite all these other smart folks working on problems similar to ours (eg
at least 7 on TruffleRuby), I don't see any obvious strategy to steal.

Sideways from this, I started to wonder how you might start to untangle the
"one interpreter" === "one execution context". The obvious start is to try
to avoid reference counting overhead where possible. Most objects *aren't*
visible to more than one thread. (In the degenerate case of one thread,
that's all objects - this feels like a good start.). Hence have the concept
of "local objects" and "shared objects" - it's just one flag bit. You *are
going to need a branch for every reference count twiddle, but it might be
faster, because mostly it will avoid atomic operations (or buffered
reference counting, or whatever)

Everything starts local - if an object becomes referenced by a shared
object, that object must be promoted to "shared". It's the same sort of idea
as generational GC, and would also imply a write barrier on every
assignment. That's a small CPU cost, but it's a new rule that XS code on
CPAN doesn't know that it needs to conform to...

But, stashes are global. So because of this invariant, subs are global, so
pads are global, so all lexicals are global.

Oh pants!

Suddenly everything is global, without trying.

And actually it's worse - pads aren't just global in this sense - they are
one of these *implicit* assumptions of "one is one" - they are an array of
arrays - the inner array is lexicals (and temporaries) for the subroutine,
the outer array permits one of these for each depth of recursion.

There's no way to have two threads execute the same subroutine
simultaneously, without some major refactoring.

(signatures, @_, return values on the stack, sending return values back from
multiple threads to the master at the end - these are all problems, but
trivial in comparison to things like this)

On Fri, Apr 09, 2021 at 12:53:27AM -0500, B. Estrade wrote:

> For now, I want all of us to think "thread safety" and "SMP" in everything
> we do. Not in a design sense, in the unconsiousness sense; so it imbues
> itself in every aspect of any work and thinking done in the name of
> perl/Perl.
> If you ask me right now; in 2024, I'd like to be able to do something as
> simple as this as a single OS process, but utilizing "light weight" OS
> threads (e.g., pthreads):

>   my $num_threads = 8;
>   map_r($num_threads) {
>     local $thread_id = tid();
>     local $bakery_ticket = atomic_var_update; # threads race
>     # threads do stuff
>     # .. and more stuff
>     # ... and maybe more stuff
>     # busy wait
>     do {
>       local $ihazbeenserved = now_serving($bakery_ticket);
>     } while(not $ihazbeenserved);
>     # "safely" print to STDOUT without clobbering (could use regular
>     # printf and risk messages overlapping)
>     printf_r("Yay! I, thread %d, now has my %s", $tid, $ihazbeenserved);
>   }

This isn't something that we can solve by working on it for the next few
years - really it's something that could only be solved by working on it 25
years ago, and having a radically different implementation.

Basically, I don't think that the current Perl 5 internals are going to
support CPU level parallelism in the same interpreter, any more than
CPython ever will, or MRI.

You *can* get to CPU level parallelism within "leaf" calls to C - so likely
Yuki Kimoto's suggestion of SPVM seems interesting - he pasted the link to

For anything more than that, it's going to take other internals. For the
syntax you're suggesting, the fastest route to seeing it run might be to
look at which aimed to write a Perl 5
parser using Rakudo. I'm aware that FROGGS stopped working on it a few
years ago - I don't know how complete it is, but I don't think that there
was any fundamental blocker.

Nicholas Clark

* There's a talk on it. You can watch the whole thing and just s/ruby/perl/g
  I had no idea that the Ruby C "API" was just like the Perl C "API".

  Spoilers start here:
  Of the 2.1 billion lines of code in RubyGems, .5 billion is C.
  A ruby extension in C can be 10x faster than MRI
  Their approach is 15x faster than MRI *without* inlining C, 30x with.
  They actually implement much of the Ruby C ABI in Ruby.
  The Gem thinks that it's doing this:
  existing Gem written in C, compiled => C function in MRI
  when actually it's this:
  existing Gem unchanged, compiled to LLVM bytecode => their C shim => Ruby
  where *all* that is being run inside the JVM.

  The single best slide is this:

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About