develooper Front page | perl.perl5.porters | Postings from March 2013

NWCLARK TPF grant February report

Thread Next
Nicholas Clark
March 27, 2013 20:41
NWCLARK TPF grant February report
Message ID:
As per my grant conditions, here is a report for the February period.

The first significant thing I worked on in February was a detailed review of
Peter Martini's work towards an API for subroutine signatures. In
particular, I wondered how much of the existing call checker hooks they
could use. In turn, I wondered whether the call checker hooks were robust
against some of the torture tests I'd suggested for signatures...

    Your signatures code needs some evil test cases. 
    I suspect fun things involving various combination of threads, string
    evals and closures would stress it.
    So starting with simple things like 
    *) calling a function with signatures from a subthread. 
    *) creating a function within a string eval 
    then build up to 
    *) creating a function within a string eval within subthread 
    *) creating functions in a child thread, spawning a grandchild thread, 
       child thread exists, grandchild thread runs (returning results to 
       first thread) 
    the intent being to ensure that data structures are cloned properly, and
    by using an intermediate child thread that goes away, ensure that no
    pointers dangle into the freed up memory of their parent.

For both, the implementation will be similar in how it interacts with
closures and with ithreads. Closures are implemented by run-time cloning of
the compiled subroutine. Spawning an ithread is also a runtime clone.

So I had a good look at the call checker code. There had been bugs related
to closures, but this had been discovered and fixed later on, although not
all possibilities were tested yet, so I added some tests.

Sadly I concluded that the call checker code only offers half the hooks that
signatures need. The call checker runs entirely at compile time, whereas a
key part of the proposed signatures API is to add a hook at runtime, which
is run instead of the code that creates @_ (and all the overhead associated,
not just creation, but all the explicit ops run by the user's code to unpack
@_ back into lexicals). So it's not the solution here.

However, the real fun started soon after thanks to Yves persisting in
investigating what he felt was a problem with the hashing implementation in
the stable releases. A few days after I wrote my previous monthly report, he
sent a short list of lowercase ASCII strings
which when used as keys for a hash would exhaust memory, without the
rehashing mechanism ever kicking in. The rehashing mechanism was meant to
protect against attacks - this attack exploits a flaw in its implementation,
and was announced as CVE-2013-1667.

I'm still not in a position to say everything, as 5 weeks later not all
vendors have shipped updates. However, the fix an its commit message have
been public for 3 weeks now, so it's safe to repeat what they reveal.

To be clear - I'll use the term "hash" or "hashes" for the Perl associative
array, named with % and implemented as a hash table, and use "hash function"
for the many-to-one function that converts a string to a fixed-size numeric
value, a value which is needed to implement a hash table.

Hashes are implemented as a power-of-two sized array of linked lists of hash
entries. The array holds the linked list heads, and the array entries are
often referred to as "buckets". If the hash is working well, the linked
lists are short. To keep the lists short requires that the hash is "split"
periodically as elements are inserted, doubling the size of the array,
reassigning list entries to new positions, and hopefully causing the entries
to be spread across more positions, shortening the lists.

The implementation flaw was that a hash split could be triggered if a linked
list became "too long". "Too long" was deemed to be 15 - much longer than
the 0 to 3 entries of a well behaved hash, and we thought that this would
only happen if a program was being purposefully attacked with crafted
malicious data. This seemed like a good idea at the time (9.5 years ago), as
it meant that an attack would be deflected more quickly (after 15
entries). Previously the split condition was that the number of keys was
greater than the array size - ie that the mean chain length was about to
exceed 1.0. This is reasonable for a well behaved hash, but would mean that
malicious data where all keys in the hash used the same chain would not be
detected until the chain had become as long as the size of the array,
potentially meaning unhappy O(n) behaviour instead of amortised O(1) for
quite a while as the hash grew.

All this went unnoticed for 9.5 years, until Yves tried to test the build by
changing hv.h to use a pathologically bad hashing function - return the same
constant whatever the string. He (and I) would expect that this would be
slow - after all it's invoking an O(n) linked-list walk for every lookup,
and loops over keys will go from O(n) to O(n²), but given enough CPU time we
would expect it to complete. However, it doesn't. It crashes. It was this
unexpected crash that caused him to think that something was wrong, and keep

What he discovered was that triggering a hsplit due to long chain length
allows an attacker to create a carefully chosen set of keys which can cause
the hash to use 2 * (2**32) * sizeof(void *) bytes of RAM. This will exhaust
the address space of a 32 bit process, and potentially chew 32Gb on a 64 bit
system. (However, the copying or even swapping this causes tends to be what
actually makes the machine unhappy.) The flaw means that the size of the
array becomes massively larger than the number of keys.

Eliminating this check, and only inspecting chain length after a normal
hsplit() prevents the attack entirely, and makes such attacks relatively
benign. (ie we've returned to the pre-2003 test of total keys > array size)

It turns out that a miss is as good as a mile - we got part of the memory
exhaustion  problem identified back then in 2003, as this comment shows:
        /* Use the old HvKEYS(hv) > HvMAX(hv) condition to limit bucket 
           splits on a rehashed hash, as we're not going to split it again, 
           and if someone is lucky (evil) enough to get all the keys in one 
           list they could exhaust our memory as we repeatedly double the 
           number of buckets on every entry. Linear search feels a less worse 
           thing to do.  */ 

Whilst the problem is real (and the keys he generated easily demonstrated
that "that's a nice 32Gb server - shame if anything were to happen to it"),
I couldn't see how *it* was the cause of the build failure. Once the
rehashing mechanism kicks in, the only split test is (as above) the old (and
new) keys > array test, so with what I shall term a "chocolate teapot hash"
function*, rehashing should trigger for any hash when its 15th keys is
inserted, and after that hash splitting should not be subject to "attack" -
the array will only grow when the number of keys exceeds the array size.

So I tried to replicate his build of a maint release with the "chocolate
teapot hash". It chugged away merrily for more than a day, then failed
because a process ran out of RAM. Which, upon re-running with the debugger
attached was revealed to be a hash containing about 20 keys, but wanting to
have an array with 2**28 entries. Which isn't supposed to happen.

It turns out that there was a second problem with the "long chain" mechanism
for triggering a hash split and potential rehashing. This is also solved by
published patch addressing CVE-2013-1667. Unlike the other attack, which
requires a reasonable understanding of the internals plus several days CPU
time to reproduce a set of attack keys, this one is can easily be reproduced
as a local proof-of-concept from the text description alone, by re-using
code from one of the regression tests. So as several vendors are still
leaving their customers exposed, I'm loathe to describe it at this time,
other than to say that requires hash re-use (so is harder to exploit), and
it is fixed.

Working round this one problem in my maint build with the "chocolate teapot
hash" completed, but the tests failed, with SEGVs. This was because some of
MRO code assumes that all hashes are using the shared string table. (When
the rehashing mechanism kicks in, the hash stops using the shared string
table. This is part of what makes rehashing less efficient.) Once I was
confident that the problem wasn't caused by hash bugs themselves, I figured
that this wasn't worth dealing with. But I see it as more evidence that the
rehashing approach is a bodge, rather than a robust solution.

So I turned back to blead (which has full randomised hashes all the time, no
rehashing, and everything able to use the shared string table), and tried
building it with the "chocolate teapot hash" function. The build eventually
completes. After an even longer time, the tests pass. This is good.

I then reviewed the branch Yves had pushed with his proposed hash changes to
protect against hash key discovery. As-was, it added another member to
XPVHV, the structure used for every hash. As Perl 5 only has one hashes
type, it's used for symbol tables, (most) objects, and, well, hashes, as in
associative arrays storing user data, accessed by keys. Increase XPVHV and
(nearly) every object gets bigger, which isn't ideal if the increase is only
needed for iterating the hash. (Most code doesn't poke into objects in order
to iterate over their attributes.)

So I looked to find a way to move the member into the on-demand allocated
structure which holds iterator state. This revealed some fun - there are
actually two functions for "splitting" a hash. A static function,
S_hsplit(), which doubles the size of the array, only used by hash key
insertion code, and an API function, Perl_hv_ksplit(), which can grow the
hash array to "any" size requested (which is rounded up to the next power of
two). The former makes simplifying assumptions because it knows that it is
always only doubling, whereas the latter can not, as it needs to be able to
multiply the array size by any power of two.

It's not clear why there are two functions. S_hsplit() dates back to 5.000.
Perl_hv_ksplit() was added in September 1996. Archives exist for
Perl5-Porters from 1996, but the patch from back then was actually a
reworking of an earlier patch. Unfortunately *that* patch, and any
subsequent discussion, was sent to the list before beginning of recorded
time (21st August 1995), so it's a bit of a mystery as to why it duplicated

Fortunately, with a bit of massaging, it turned out to be relatively easy to
refactor the two functions so that their implementations converged. This was
helped by S_hsplit() being a static function, so I had complete control over
both its callers, and was able to push some work out to them. (Actually,
most usefully, just to one of them. As that work wasn't needed for the other
call path. Eliminating work is the only sure form of optimisation.) When I
got them close enough, I was able to replace the guts of Perl_hv_ksplit()
with a call to S_split(), and the fork was healed.

Once it smoked clean, I merged this change back to blead, as it was
independent of the other changes. With the change made, it became a lot
easier to fix the original problem of moving the structure member, and Yves
subsequently incorporated a variant of my suggested changes into his branch.

Nothing do to with hashing, Brian Fraser wondered something about a function
in the tokeniser. The answer wasn't obvious...

Each of the core's C source code files start with a quote from Lord of the
Rings. Many are right on the money. toke.c's is:

 *  'It all comes from here, the stench and the peril.'    --Frodo

This is quite apt, as toke.c is 12127 lines of near-impenetrable magic.
Chaim Frenkel's summary is well known:

    Perl's grammar can not be reduced to BNF. The work of parsing perl is
    distributed between yacc, the lexer, smoke and mirrors.

Larry recently offered a more quantitative evaluation:

    of the four or five ways a compiler can cheat, Perl 5 uses about eight
    of them

So the question was concerning the static function S_force_word() in toke.c,
which has a fifth argument `allow_initial_tick`, which seemed to be
redundant. The function is documented, and the arguments are described as

 * Arguments:
 *   char *start : buffer position (must be within PL_linestr)
 *   int token   : PL_next* will be this type of bare word (e.g., METHOD,WORD)
 *   int check_keyword : if true, Perl checks to make sure the word isn't
 *       a keyword (do this if the word is a label, e.g. goto FOO)
 *   int allow_pack : if true, : characters will also be allowed (require,
 *       use, etc. do this)
 *   int allow_initial_tick : used by the "sub" lexer only.

The function goes back a long way, with various changes to it and its
callers as the result of bug fixes, but these days most of those call sites
that had passed TRUE as the fifth argument had been further refactored to
avoid calling it. So it seemed that the fifth argument it wasn't needed - ie
it was safe to assume that it was always FALSE. Note, there's no way
whatsoever from any of the documentation to work out whether this is the
case. Although all the individual authors of this code are believed still to
be alive and responsive to e-mail, previous attempts at asking simpler
questions about code written years or decades ago have always resulted in
polite replies to the effect of "I no longer remember". Which really isn't

Hence, trying to get any better understanding of how pretty much any part of
the parser works requires carrying out an investigation like the one I'm
about to describe.

So, for starters, there's the obvious first step - what happens if I change
the code and do a full clean build and run the tests? Nothing fails. That
hints that it might be redundant, but you never can be sure...

Digging around the previous historical versions where S_force_word() had
been changed didn't real anything. Even the changes where that parameter has
been renamed or code relating to it altered only confirmed that there had
been bugs, but the changes fixed those bugs.

The approach that paid off was observing that until 2012 there were two
other call sites passing TRUE to the function. So I built the version of
blead just before they were refactored, and tried using FALSE instead. With
that change, this code stops parsing:

    $ ./miniperl -e "sub 'foo {warn qq{ok}}; &'foo"

That suggests that the argument in question is something to do with
disambiguating whether a ' is the start of a single quoted string, or a
leading package separator (ie the Perl 4 way of saying ::foo)

With that knowledge, and a bit of trial and error, I was able to figure out
that the code in question is needed to parse this correctly:

    $ ./perl -e 'sub one {};' -e "format 'one =" -e 'One!' \
             -e. -e '$~ = "one"; write'

If you change the TRUE to FALSE, this happens:

    $ ./perl -e 'sub one {};' -e "format 'one =" -e 'One!' \
             -e. -e '$~ = "one"; write'
    Undefined format "one" called at -e line 5.

(the parsing of 'format ::one =' is unaffected)

So, finally

a) we know what the code was for
b) we know how to write a test case
c) we can refactor the code to eliminate that argument
   (Brian Fraser had spotted that the tokeniser has already done the necessary
     work about a dozen lines earlier, with the result stored in a different

But that was about 12 hours work to figure out 1 argument to 1 function in
toke.c. There are 97 functions in toke.c, most have multiple arguments, and
the mean function length is double that of S_force_word(). The interactions
are staggeringly complex. Roughly 0.1% down, 99.9% to go.

The final significant thing I worked on this month was Unicode code point
lookup by name. Perl 5 can convert from name to code point using the "\N{}"
escape, which is implemented by automatically loading the charnames pragma.
Using /usr/bin/time it's easy to see that loading charnames increases memory
use considerably. Compare:

    $ /usr/bin/time -v ./perl -Ilib -le 'print "Hello world\n"'
    Hello world

            Maximum resident set size (kbytes): 6976

    $ /usr/bin/time -v ./perl -Ilib -le 'print "Hello\N{SPACE}world\n"'
    Hello world

            Maximum resident set size (kbytes): 30672

Trying to get some idea of where that extra 23 meg came from:

    $ /usr/bin/time -v ./perl -Ilib -le ''
            Maximum resident set size (kbytes): 6432

    $ /usr/bin/time -v ./perl -Ilib -le 'use strict; use warnings'
            Maximum resident set size (kbytes): 10016

    $ /usr/bin/time -v ./perl -Ilib -le 'use charnames ":full"'
            Maximum resident set size (kbytes): 24144

    $ /usr/bin/time -v ./perl -Ilib -le 'use charnames ":full"; print charnames::vianame("SPACE")'
            Maximum resident set size (kbytes): 30736

Just loading strict and warnings allocates another 4M. (Note, "Hello world"
was .5M, so nothing is free.) Loading charnames allocates another 14M, and
using it allocates a further 6M, presumably as various caches start to get

But all these requests are for things that are already known, just not in a
very convenient format for a fast lookup. The main Unicode Data file is
24430 lines, and doesn't include 4 large ranges of CJK unified ideographs, a
large range of algorithmically named Hangul syllables, and about 500
aliases. Moreover, any Perl hash you build of this (or the subset that you
are interested in), is allocated at runtime, from memory that isn't even
going to be shared between threads, let alone between processes.

Karl and I have wondered whether it would be possible to encode the names as
a trie structure, with lookup code written in C. The lookup data would be
unmodified at runtime, so could be compiled by the C compiler into the
constant data section of the executable (or shared library), which will be
shared at least between threads, and on *nix (at least) between all
processes. So I've made a start at looking at this. By the end of the week
my code had reached the point where I have Perl code to parse all the files,
and generate suitable data structures, along with Perl code written in a C
style which can correctly look up any code point by name, except the CJK
ideographs and Hangul names. Which is pleasing.

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in
RT #... is a bug in
CPAN #... is a bug in
BBC is "bleadperl breaks CPAN" - Andreas König's test reports for CPAN modules

[Hours]		[Activity]
 11.75		'foo special case
  0.50		Android
  0.50		Devel::PPPort/my $_
  1.75		MVS
 24.00		PERL_HASH
		PERL_HASH "crap" hash
		PERL_HASH "crap" hash (there is an assumption that MRO stays shared)
		PERL_HASH hashes
		PERL_HASH hashes -- Use the old HvKEYS(hv) > HvMAX(hv) condition
		PERL_HASH hashes. Aha. rehash/off/on/off/on/off
  0.50		PL_interp_size_5_18_0
  7.25		PL_sv_objcount
  1.75		RT #109828
  0.25		RT #113620
  0.25		RT #116362
  0.25		RT #116615
  2.50		RT #116639
  0.25		RT #116789
  0.25		RT #116795
  0.75		RT #116833
  0.25		RT #116839
  0.75		RT #116899
  0.75		RT #116943
  0.50		RT #116989
  0.25		RT #59802
  0.50		SvOK
  9.00		Unicode Names
  0.75		WinCE
  0.25		abolish STRANGE_MALLOC
  1.25		android
  0.50		benchmarking regressions
  5.75		hv_ksplit/hsplit merge
  2.00		inversion lists
  0.25		investigating security tickets
  0.25		lexical $_
  3.25		pahole on the interpreter struct
  0.25		parallel tests
  5.50		process, scalability, mentoring
 39.50		reading/responding to list mail
  3.50		regcomp/setjmp
 18.00		signatures
		signatures (call checker)
  2.50		struct re_save_state
  4.75		yves/hv_h_split
153.00 hours total

Nicholas Clark

* For public reporting purposes it's a "chocolate teapot hash". I used a
  terser term myself in private.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About