develooper Front page | perl.perl5.porters | Postings from January 2013

NWCLARK TPF grant December report

Thread Next
Nicholas Clark
January 9, 2013 20:43
NWCLARK TPF grant December report
Message ID:
As per my grant conditions, here is a report for the December period.

A frustrating month.

Having completed the investigation into the state of hashing described in
November's report, and becoming comfortable that it wasn't likely to explode
without warning, I turned to dealing with the backlog of everything else.
Given that I've only been able to do about 2 weeks' work in the past 2
months, routine traffic on the list means that a lot of "stuff" which has
accumulated which I've not had time to deal with.

The obvious thing to fix first was the failing smoke testing reports. Smoke
testing reports are great as long as they pass. Starting from the known
state of "passing", you make a change, and if everything still passes you
have reasonable confidence that you didn't break anything. Alternatively, if
the tests start to fail, you quickly know that your change has an unexpected
problem, and a clue as to where to start investigating.

Failing smoke tests are frustrating because you no longer have this clean
distinction between "it passes" and "it fails, hence I introduced a
problem". The tests fail, but they were going to fail anyway. So you have to
look carefully to work out whether the failures you see are new (and hence
you broke something), or the ones that were there before (and hence you
*probably* didn't break anything, but you can't be sure).

The right thing to do is to fix the problems that are causing the smoke
tests to fail, before trying to change anything else. In this case the
problem is with the smoker that George Greer runs which is building perl
with using clang's Address Sanitizer, a tool which looks for C programming
errors such as uninitialised and out of range memory access. It's particularly
important to investigate (and fix) such reports, as the problems found are
potentially nasty security risks. In this case, as the smoker had started
finding problems at some point since September's blead release, it can't be
a problem in a production version, but it does need to be nailed before
v5.18.0 ships, and the sooner the better.

Unfortunately, I couldn't replicate the problem on any machine I had access
to on which I'd successfully built perl with clang. So I tried to replicate
the problem on dromedary, the rather beefy 8 core server which acts as hot
backup to the main git server. What followed was several days of
discovering, stalling, and eventually working round bugs in building LLVM
and clang, or in building perl with clang, in the hope of getting to the
point where I could replicate the problem under a debugger, and hence fix

dromedary already had a version of clang installed into /usr/local. However,
I'd never been able to get it to complete the build of perl. The perl binary
would build successfully, but the build will bomb out whilst building XS
extensions with errors such as

    encengine.o: In function `fstat64':
    /usr/include/sys/stat.h:449: multiple definition of `fstat64'
    Encode.o:/usr/include/sys/stat.h:449: first defined here
    encengine.o: In function `fstatat64':
    /usr/include/sys/stat.h:457: multiple definition of `fstatat64'
    Encode.o:/usr/include/sys/stat.h:457: first defined here
    encengine.o: In function `gnu_dev_major':

This time, however, I did manage to get to the bottom of that one. Whilst
it's reported as an LLVM bug: the
underlying problem is actually in the system headers provided by glibc,
which specify that certain functions are to be inlined always. If this
instruction is followed correctly by the compiler, then one copy is inlined
into each object file, and this results in the clash seen above at link

So how come it builds with gcc? Well, the version of gcc on the system
doesn't correctly honour the inlining request, and hence doesn't generate
multiple clashing bodies for the the functions. In turn, this is why the
glibc headers are buggy - no-one was aware of the problem back when they
were released, because no compiler was good enough to spot the problem.

So, how come it only fails in the XS modules? After all, the perl binary is
made by linking multiple object files together, and surely they have
duplicate function definitions? It turns out that it's a side effect of how
the core's build system automatically adds various warnings and strictures
when building the core C code, which it does not when building extensions.
The flags are only used for the core C code, because as it's code we
control, we've progressively tidied it to address the problems the warnings
highlighted, and hence keep the build output clean. (Like the smokes, silent
is good, because it's the easiest to skim.)

One of these flags is -std=c89, forcing ANSI C89 behaviour, which
./cflags.SH determines is viable when building with clang on Linux. It turns
out that this changes the behaviour of the headers, to avoid inlining. (The
C<inline> keyword came in with C++ and C99.) At which point the problem of
multiple bodies goes away, and the link is clean. So, explicitly add
C<-std=c89> to the standard build flags and the build completes, tests run,
and tests pass.

However, sadly, the locally installed version of clang is too old to be
useful, as it's 3.0, and Address Sanitizer wasn't merged in until 3.1.
Not to worry, let's build 3.1...

Turns out that also this isn't that easy. For starters, LLVM and clang are
big, very big. Several gigabytes for the build tree, and another several
gigabytes for the installed tree. So this necessitates a clean up of my home
directory, just to make room on the partition for the build. And then the
build fails with an assertion deep in LLVM, which Googling suggests is due
to a bug in g++ 4.2 - LLVM puts a lot of stress on the C++ compiler used to
build it. The bug is fixed in later versions of g++, but as dromedary's OS
is old, g++, like the system headers, isn't the most current.

Rather than build my own newer copy of gcc, I decided to build a newer LLVM,
in case that fixed it. Which means getting it from source control -
At which point I discover that there is no subversion client installed on
dromedary - presumably none of us have ever needed it. This suggests something
about the trends in version control systems used by open source projects.
So more yak shaving as I download and build that, which turns out not to be
too hard. Which gets me a current LLVM, which I can install and use to build

But I still can't replicate the problem.

So I checked out the exact same svn revision of clang and LLVM that George's
smoker reports that it is using, built and installed that.

*Still* no joy.

Wondering whether this was something to do with the age of the header files,
I tried on one other machine (a rather newer Ubuntu desktop). Again, the
exact same revision as George, and no joy. Similarly svn HEAD can't replicate
the problem.

After three days of trying I had learned a lot about how (not) to build LLVM
and clang, but failed to actually solve the problem that I was setting out
to solve. So I'm no nearer to actually nailing the cause of the
"unknown-crash" which his instance of Address Sanitizer cryptically reports
for a complex expression deep in the code which compiles regular

Prodding further, we discovered that Merijn could replicate the report when
he built on his laptop, and that after shipping the build tree to dromedary,
I could run it there and replicate the report. However, as clang built with
optimisation, it wasn't very useful trying to debug this with gdb.
(Actually, it wasn't at all useful trying to use /usr/bin/gdb, as the debug
format has expanded. I had to use the newer gdb I'd built the week before
from source, as part of my earlier clang yak shaving.)

So, he built again with -g, to enable C level debugging. Problem goes away.
Bother. So he built again, with both -O and -g, to keep the optimiser, but
also add debugging symbols. Debugging optimised code is a pain, but it's
better than nothing. However, even *this* hid the problem. At which point,
we gave up again, because remote-delivery of "adding printf" style debugging
tying up two busy people simply isn't a productive use of time.

I did get somewhat further with another yak shaving exercise. To fix a bug,
DosGlob had been updated, but in the process it gained a declaration after
statement in the C code, which broke the build on some platforms. I
wondered, why hadn't we got an automated test for this, to be able to
rapidly test the change on a branch, and avoid the problem in the future. At
least on FreeBSD the system headers are clean enough to be able to build
with gcc -ansi -pedantic, which I thought was good enough to trap this. It
turns out that I'm wrong - what one needs is gcc -ansi -pedantic-error.
Trying to build blead with this, I discovered that there are only two places
that don't build with this, which feels like it should be small enough to
fix. One, in perlio.c, is a long standing problem which generates one of the
only warnings from the C code. I'd love to fix it, but will need a Configure
test, or possibly even removing the code in question as part of a major
PerlIO implementation cleanup.

The other error was in GDBM_File.xs, and looked trivially simple to fix, by
adding a single cast. So I added it, but much to my surprise and frustration,
this cast broke the build on C++.

So, an aside - why are we testing on C++? After all, Perl 5 is written in C,
not C++. So why the pain of also testing on C++. It's because we support
building C++ extensions, which means that Perl's headers must also be valid
as C++. Most ANSI C is also valid as C++, but a couple of constructions
aren't, and occasionally these slip in. If we don't test for it, we don't
notice when it has happened, and if no-one notices before a release, then
it's too late to fix. Specifically, one slipped into one of the stable
releases of 5.8.x that I made. This isn't a mistake I'd like to repeat.
(But the problem was also in the release candidates. So I'm not loosing
sleep over this. Release candidates exist for this sort of reason - to
permit people to test for the things that matter to them, and get problems
addressed before they are made permanent in a release.)

So, how to test with C++? The Configure system doesn't have a way to probe
the location of a C++ compiler (and doesn't need to), as it's only
interested in the C compiler. So easiest way to automate testing this is to
get the testing environment to name a "C" compiler which is actually a C++
compiler. This does mean that we have to ensure that all the C code is also
conformant C++, but that's generally not too onerous, and often the changes
made result in cleaner code.

So, how to make C++ compilers happy with GDBM_File.xs? I investigated not
adding the cast. This took me down into the rat's nest...

GNU GDBM is stable. Very stable. (But not "specialist biologist word for
stable".) The current version is 1.10. 1.9 was released in 2011, the version
before that, 1.8.3 was released in 2002, while 1.7.3 was released in 1995,
which was between the releases of perl 5.000 and 5.001.

The cast is on the fifth argument to gdbm_open(). The fifth argument is an
optional callback function for fatal errors. Only GDBM is this flexible -
the other four *DBM libraries that the core wraps don't have this last

The GDBM_File module provides a Perl wrapper to the GNU GDBM. It's been in
the core since the start. It also turns out that from the start GDBM_File
has *attempted* to provide Perl access to this fifth argument. The XS code
in 5.000 looks like this:

    gdbm_TIEHASH(dbtype, name, read_write, mode, fatal_func = (FATALFUNC)croak)

GBDM_File's typemap file defines FATALFUNC like this:

    FATALFUNC               T_OPAQUEPTR

and in turn the global typemap file defines T_OPAQUEPTR like this:

    unsigned long *         T_OPAQUEPTR

which is a pointer to data, which cannot legally be cast to/from a pointer
to a function under ANSI C, hence why gcc in pedantic mode is getting upset.
But thinking further, this isn't very useful - the XS code is expecting to
pull a *C* function pointer out of the fifth parameter, not a reference to a
Perl function. This isn't exactly a useful interface. Searching CPAN
suggests that no-one currently uses it. But has it ever worked? So I set out
to test it. Inevitably, this proved more "interesting" than expected.

So, to test it, we need to determine when the fatal_func callback is
actually called. Digging around the GDBM source code suggests that it's only
called for "should never happen" situations, such as low level reads or
writes failing on an open file descriptor. So I wrote a test that ties a
GDBM file to a hash, figures out the numeric file descriptor that the GDBM
library is using to access it, and then closes that file descriptor, forcing
some sort of write error and hence use of the callback.

The callback *was* called. But then it crashed with a SEGV. GDBM_File.xs
defaults the callback to calling Perl_croak(), but gdb showed that execution
was in different C function that simply isn't reached from anything
Perl_croak() calls. This shouldn't be possible, yet clearly it happens.

It turns out that the problem is really subtle, and one that I didn't know
about. Perl_croak takes a pointer to a string, followed by a variable
number of arguments:

    void Perl_croak(const char* pat, ...)

The callback is prototyped to take a pointer to a function that takes a
string. That's compatible, right? After all, a variable number number of
arguments includes the possibility of zero extra arguments, which this is.

Problem is that the machine-level calling convention for variable argument
functions can differ from that for fixed arguments. Specifically, on x86_64
(on which I was testing), for a variable argument function, %rax is set to
the number of arguments passed in floating pointer registers. I didn't know
any of this, and I doubt that most people do. But the upshot is that on
x86_64, if the compiler knows that it's calling a function with fixed
arguments, it doesn't bother "wasting" an instruction setting %rax to zero,
whereas the function prelude for the called function assumes that %rax is
set to something meaningful, and computes a jump based on it. If %rax
contains out of range garbage, that jump can leap into any nearby code.

So, this means that likely no-one has ever even tried to rely on the
callback function, because it's a booby-trap that likely will only crash. At
which point it seemed simplest to remove the complications of an unused
feature, and drop the ability to *change* the function used for the
callback, simply always using croak() as the callback.

The observant will notice that there's a second bug here - Perl_croak()
takes a *printf-style format string, whereas there is no guarantee that the
string GDBM passes to the callback does not contain % characters. We can fix
both this bug and the problem with the calling convention by using a small
wrapper function

    static void
    croak_string(const char *message) {
        Perl_croak_nocontext("%s", message);

and now we're done.

Sadly not. Failing smoke tests reveal that the prototype for the callback
function passed as the fifth argument has changed between GDBM 1.8.3 and
1.9.0, from void (*)() to void(*)(const char *). This distinction doesn't
matter to a C compiler - to it, the former is "arguments unspecified", and
the latter is "one argument specified", but to a C++ compiler, the () of the
former means "zero arguments specified". So if you're compiling with a C++
compiler, the above static function is C++, the types conflict, and it's a
compile-time error without a cast.

But which cast to use? Really, I wanted to simply prototype my function as
"C" linkage - ie

    static "C" void
    croak_string(const char *message) {
        Perl_croak_nocontext("%s", message);

Sadly this isn't allowed. extern "C" makes sense. static "C" is obviously
useless. Except that I've just found a use case for it. Bother!

In the end, I opted to go for conditional compilation based on the
C-pre-processor macros defined by GDBM, instead of something more complex
such as probing. There are just 8 tarballs of GDBM released over the past 18
years, so it was perfectly possible to verify that this approach worked on
all of them.

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in
RT #... is a bug in
CPAN #... is a bug in
BBC is "bleadperl breaks CPAN" - Andreas K├Ânig's test reports for CPAN modules

[Hours]		[Activity]
  0.25		DosGlob
 10.00		GDBM_File
  0.75		PERL_HASH
  0.50		POSIX::strptime
  0.25		RT #115910
  3.00		RT #115928
  1.00		Storable
  0.25		Unicode data structures
  3.75		cflags, DosGlob, -std=c89
 23.75		clang
		clang dromedary
  1.00		dist/threads-shared/t/stress.t
  0.25		gcc -ansi -pedantic-error
  2.75		perl.h, X2P
  0.25		process, scalability, mentoring
 32.50		reading/responding to list mail
  0.25		smoke-me/SuperFast
  1.00		used only once warning
 81.50 hours total

Nicholas Clark

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About