develooper Front page | perl.perl5.porters | Postings from March 2013

NWCLARK TPF grant report #76

Nicholas Clark
March 27, 2013 20:21
NWCLARK TPF grant report #76
Message ID:
[Hours]		[Activity]
2013/02/11	Monday
 2.50		reading/responding to list mail
 4.00		signatures

2013/02/12	Tuesday
 0.25		investigating security tickets
 0.25		process, scalability, mentoring
 0.50		reading/responding to list mail
 6.50		signatures

2013/02/13	Wednesday
 1.00		process, scalability, mentoring
 0.25		reading/responding to list mail
 0.25		signatures
 6.75		signatures (call checker)

2013/02/14	Thursday
 4.50		PERL_HASH
 3.00		PERL_HASH
 0.50		signatures

2013/02/15	Friday
 7.50		PERL_HASH
 1.25		PERL_HASH

2013/02/16	Saturday
 0.25		PERL_HASH

2013/02/17	Sunday
 0.00		PERL_HASH
 0.25		PERL_HASH

Which I calculate is 39.75 hours

I started on Monday with a detailed review of Peter Martini's work towards
an API for subroutine signatures. In particular, I wondered how much of the
existing call checker hooks they could use. In turn, I wondered whether
the call checker hooks were robust against some of the torture tests I'd
suggested for signatures...

    Your signatures code needs some evil test cases. 
    I suspect fun things involving various combination of threads, string
    evals and closures would stress it.
    So starting with simple things like 
    *) calling a function with signatures from a subthread. 
    *) creating a function within a string eval 
    then build up to 
    *) creating a function within a string eval within subthread 
    *) creating functions in a child thread, spawning a grandchild thread, 
       child thread exists, grandchild thread runs (returning results to 
       first thread) 
    the intent being to ensure that data structures are cloned properly, and
    by using an intermediate child thread that goes away, ensure that no
    pointers dangle into the freed up memory of their parent.

For both, the implementation will be similar in how it interacts with
closures and with ithreads. Closures are implemented by run-time cloning of
the compiled subroutine. Spawning an ithread is also a runtime clone.

So I had a good look at the call checker code. There had been bugs related
to closures, but this had been discovered and fixed later on, although not
all possibilities were tested yet, so I added some tests.

Sadly I concluded that the call checker code only offers half the hooks that
signatures need. The call checker runs entirely at compile time, whereas a
key part of the proposed signatures API is to add a hook at runtime, which
is run instead of the code that creates @_ (and all the overhead associated,
not just creation, but all the explicit ops run by the user's code to unpack
@_ back into lexicals). So it's not the solution here.

However, the real fun started this week thanks to Yves persisting in
investigating what he felt was a problem with the hashing implementation
in the stable releases. He sent a short list
of lowercase ASCII strings which when used as keys for a hash would exhaust
memory, without the rehashing mechanism ever kicking in. The rehashing
mechanism was meant to protect against attacks - this attack exploits a
flaw in its implementation, and was announced as CVE-2013-1667.

I'm still not in a position to say everything, as 5 weeks later not all
vendors have shipped updates. However, the fix an its commit message have
been public for 3 weeks now, so it's safe to repeat what they reveal.

To be clear - I'll use the term "hash" or "hashes" for the Perl associative
array, named with % and implemented as a hash table, and use "hash function"
for the many-to-one function that converts a string to a fixed-size numeric
value, a value which is needed to implement a hash table.

Hashes are implemented as a power-of-two sized array of linked lists of hash
entries. The array holds the linked list heads, and the array entries are
often referred to as "buckets". If the hash is working well, the linked
lists are short. To keep the lists short requires that the hash is "split"
periodically as elements are inserted, doubling the size of the array,
reassigning list entries to new positions, and hopefully causing the entries
to be spread across more positions, shortening the lists.

The implementation flaw was that a hash split could be triggered if a linked
list became "too long". "Too long" was deemed to be 15 - much longer than
the 0 to 3 entries of a well behaved hash, and we thought that this would
only happen if a program was being purposefully attacked with crafted
malicious data. This seemed like a good idea at the time (9.5 years ago), as
it meant that an attack would be deflected more quickly (after 15
entries). Previously the split condition was that the number of keys was
greater than the array size - ie that the mean chain length was about to
exceed 1.0. This is reasonable for a well behaved hash, but would mean that
malicious data where all keys in the hash used the same chain would not be
detected until the chain had become as long as the size of the array,
potentially meaning unhappy O(n) behaviour instead of amortised O(1) for
quite a while as the hash grew.

All this went unnoticed for 9.5 years, until Yves tried to test the build by
changing hv.h to use a pathologically bad hashing function - return the same
constant whatever the string. He (and I) would expect that this would be
slow - after all it's invoking an O(n) linked-list walk for every lookup,
and loops over keys will go from O(n) to O(n²), but given enough CPU time we
would expect it to complete. However, it doesn't. It crashes. It was this
unexpected crash that caused him to think that something was wrong, and keep

What he discovered was that triggering a hsplit due to long chain length
allows an attacker to create a carefully chosen set of keys which can cause
the hash to use 2 * (2**32) * sizeof(void *) bytes of RAM. This will exhaust
the address space of a 32 bit process, and potentially chew 32Gb on a 64 bit
system. (However, the copying or even swapping this causes tends to be what
actually makes the machine unhappy.) The flaw means that the size of the
array becomes massively larger than the number of keys.

Eliminating this check, and only inspecting chain length after a normal
hsplit() prevents the attack entirely, and makes such attacks relatively
benign. (ie we've returned to the pre-2003 test of total keys > array size)

It turns out that a miss is as good as a mile - we got part of the memory
exhaustion problem identified back then in 2003, as this comment shows:
        /* Use the old HvKEYS(hv) > HvMAX(hv) condition to limit bucket 
           splits on a rehashed hash, as we're not going to split it again, 
           and if someone is lucky (evil) enough to get all the keys in one 
           list they could exhaust our memory as we repeatedly double the 
           number of buckets on every entry. Linear search feels a less worse 
           thing to do.  */ 

Whilst the problem is real (and the keys he generated easily demonstrated
that "that's a nice 32Gb server - shame if anything were to happen to it"),
I couldn't see how *it* was the cause of the build failure. Once the
rehashing mechanism kicks in, the only split test is (as above) the old (and
new) keys > array test, so with what I shall term a "chocolate teapot hash"
function*, rehashing should trigger for any hash when its 15th keys is
inserted, and after that hash splitting should not be subject to "attack" -
the array will only grow when the number of keys exceeds the array size.

So I tried to replicate his build of a maint release with the "chocolate
teapot hash". It chugged away merrily for more than a day, then failed
because a process ran out of RAM. Which, upon re-running with the debugger
attached was revealed to be a hash containing about 20 keys, but wanting to
have an array with 2**28 entries. Which isn't supposed to happen.

It turns out that there was a second problem with the "long chain" mechanism
for triggering a hash split and potential rehashing. This is also solved by
published patch addressing CVE-2013-1667. Unlike the other attack, which
requires a reasonable understanding of the internals plus several days CPU
time to reproduce a set of attack keys, this one is can easily be reproduced
as a local proof-of-concept from the text description alone, by re-using
code from one of the regression tests. So as several vendors are still
leaving their customers exposed, I'm loathe to describe it at this time,
other than to say that requires hash re-use (so is harder to exploit), and
it is fixed.

Working round this one problem in my maint build with the "chocolate teapot
hash" completed, but the tests failed, with SEGVs. This was because some of
MRO code assumes that all hashes are using the shared string table. (When
the rehashing mechanism kicks in, the hash stops using the shared string
table. This is part of what makes rehashing less efficient.) Once I was
confident that the problem wasn't caused by hash bugs themselves, I figured
that this wasn't worth dealing with. But I see it as more evidence that the
rehashing approach is a bodge, rather than a robust solution.

So I turned back to blead (which has full randomised hashes all the time, no
rehashing, and everything able to use the shared string table), and tried
building it with the "chocolate teapot hash" function. The build eventually
completes. After an even longer time, the tests pass.

Nicholas Clark

* For public reporting purposes it's a "chocolate teapot hash". I used a
  terser term myself in private. Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About