develooper Front page | perl.perl5.porters | Postings from June 2012

NWCLARK TPF grant May report

Thread Next
Nicholas Clark
June 14, 2012 04:31
NWCLARK TPF grant May report
Message ID:
Possibly the most unexpected discovery of May was determining precisely why
Merijn's HP-UX smoker wasn't able to build with certain configuration
options. The output summary grid looked like this, which is most strange:

O = OK  F = Failure(s), extended report at the bottom
X = Failure(s) under TEST but not under harness
? = still running or test results not (yet) available
Build failures during:       - = unknown or N/A
c = Configure, m = make, M = make (after miniperl), t = make test-prep

v5.15.9-270-g5a0c7e9  Configuration (common) none
----------- ---------------------------------------------------------
O O O m - - 
O O O O O O -Duse64bitall
O O O m - - -Duseithreads
O O O O O O -Duseithreads -Duse64bitall
| | | | | +- LC_ALL = univ.utf8 -DDEBUGGING
| | | | +--- PERLIO = perlio -DDEBUGGING
| | | +----- PERLIO = stdio  -DDEBUGGING
| | +------- LC_ALL = univ.utf8
| +--------- PERLIO = perlio
+----------- PERLIO = stdio 

As the key says, 'O' is OK. It's what we want. 'm' is very bad - it means
that it couldn't even build miniperl, let alone build extensions or run any
tests. But what is strange is that ./Configure ... will fail, but the same
options plus -Duse64bitall will work just fine. And this is replicated with
ithreads - default fails badly, but use 64 bit IVs and pointers and it
works. Usually it's the other way round - the default configuration works,
because it is "simplest", and attempting something more complex such as
64 bit support, ithreads, shared perl library, hits a problem.

As it turns out, what's key is that that ./Configure ... contains
-DDEBUGGING. The -DDEBUGGING parameter to Configure causes it to add
-DDEBUGGING to the C compiler flags, and to *add* -g to the optimiser
settings (without removing anything else there). So on HP-UX, with HP's
compiler that changes the optimiser setting from '+O2 +Onolimit' to
'+O2 +Onolimit -g'. Which, it seems, the compiler doesn't accept for
building 32 bit object code (the default) but does in 64 bit. Crazy thing.

Except, that, astoundingly, its not even that simple. The original error
message was actually "Can't handle preprocessed file". Turns out that that
detail is important. The build is using ccache to speed things up, so
ccache is invoking the pre-processor only, not the main compiler, to create
a hash key to look up in its cache of objects. However, on a cache miss,
ccache doesn't run the pre-processor again - to save time by avoiding 
repeating work, it compiles the already pre-processed source. And that is key
the distinction between invoking the pre-processor and then compiling,
versus compiling without the pre-processor:

    $ echo 'int i;' >bonkers.c
    $ cc -c -g +O2 bonkers.c           
    $ cc -E -g +O2 bonkers.c >bonkers.i 
    $ cc -c -g +O2 bonkers.i
    cc: error 1414: Can't handle preprocessed file "bonkers.i" if -g and -O specified.
    $ cat bonkers.i
    # 1 "bonkers.c"
    int i;
    $ cc -c -g +O2 +DD64 bonkers.c     
    $ cc -E -g +O2 +DD64 bonkers.c >bonkers.i
    $ cc -c -g +O2 +DD64 bonkers.i           
    $ cat bonkers.i                          
    # 1 "bonkers.c"
    int i;

No, it's not just crazy compiler, its insane! It handles -g +O2 just fine
normally, but for 32 bit mode it refuses to accept pre-processed input.
Whereas for 64 bit mode it does.

If HP think that this isn't a bug, I'd love to know what their excuse is.

A close contender for "unexpected cause" came about as a result of James E
Keenan, Brian Fraser and Darin McBride recent work going through RT looking
for old stalled bugs related to old versions of Perl on obsolete versions
operating systems, to see whether they are still reproducible on current
versions. If the problem isn't reproducible, it's not always obvious whether
the bug was actually fixed, or merely that the symptom was hidden. This
matters if the symptom was revealing a buffer overflow or similar security
issue, as we'd like to find these before the blackhats do. Hence I've been
investigating some of these to try to get a better idea whether we're about
to throw away our only easy clue about still present bug.

One of these was RT #6002, reported back in 2001 in the old system as ID
20010309.008. In this case, the problem was that glob of a long filename
would fail with a SEGV. Current versions of perl on current AIX don't SEGV,
but did we fix it, did IBM, or is it still lurking? In this case, it turned
out that I could replicate the SEGV by building 5.6.0 on current AIX. At
which point, I have a test case, so start up git bisect, and the answer
should pop out within an hour. Only it doesn't, because it turns out that
git bisect gets stuck in a tarpit of "skip"s because some intermediate
blead version doesn't build. So this means a digression into bisecting the
cause of the build failure, and then patching Porting/ to
be able to build the relevant intermediate blead versions, so that it can
then find the true cause. This might seem like a lot of work that is used
only once, but it tends not to be. It becomes progressively easier to
bisect more and more problems without hitting any problems, and until you
have it you don't realise how powerful a tool automated bisection is. It's
a massive time saver.

But, as to the original bug and the cause of its demise. It turned out to be
interesting. And completely not what I expected:

    commit 61d42ce43847d6cea183d4f40e2921e53606f13f
    Author: Jarkko Hietaniemi <>
    Date: Wed Jun 13 02:23:16 2001 +0000
    New AIX dynaloading code from Jens-Uwe Mager.
    Does break binary compatibility.
    p4raw-id: //depot/perl@10554

The SEGV (due to an illegal instruction) goes away once perl switched to using
dlopen() for dynamic linking on AIX. So my hunch that this bug was worth
digging into was right, but not for reason I'd guessed.

A couple of bugs this month spawned interesting subthreads and digressions.
RT #108286 had one, relating to the observation that code written like this,
with each in the condition of a while loop:

    while ($var = each %hash) { ... }
    while ($_ = each %hash) { ... }

actually has a defined check automatically added, eg

    $ perl -MO=Deparse -e 'while ($_ = each %hash) { ... }'
    while (defined($_ = each %hash)) {
        die 'Unimplemented';
    -e syntax OK

whereas code that omits the assignment does not have defined added:

    $ perl -MO=Deparse -e 'while (each %hash) { ... }'
    while (each %hash) {
        die 'Unimplemented';
    -e syntax OK

contrast with (say) readdir, where defined is added, and an assignment to

    $ perl -MO=Deparse -e 'while ($var = readdir D) { ... }'
    while (defined($var = readdir D)) {
        die 'Unimplemented';
    -e syntax OK
    $ perl -MO=Deparse -e 'while (readdir D) { ... }'
    while (defined($_ = readdir D)) {
        die 'Unimplemented';
    -e syntax OK

Note, this is only for readdir in the condition of a while loop - it doesn't
usually default to assigning to $_

So, is this intended, or is it a bug? And if it's a bug, should it be fixed.

Turns out that the answer is, well, involved.

The trail starts with a ruling from Larry back in 1998:

    As usual, when there are long arguments, there are good arguments for both
    sides (mixed in with the chaff).  In this case, let's make
        while ($x = <whatever>)
    equivalent to
        while (defined($x = <whatever>))
    (But nothing more complicated than an assignment should assume defined().)

Nick Ing-Simmons asks for a clarification:

    Thanks Larry - that is what the patch I posted does.
    But it also does the same for C<readdir>, C<each> and C<glob> - 
    i.e. the same cases that solicit the warning in 5.004 is extending
    the defined insertion to those cases desirable?
    (glob and readdir seem to make sense, I am less sure about each).

(it's clarified in a later message that Nick I-S hadn't realised that each
in *scalar* context returns the keys, so it's an analogous iterator which
can't return undef for any entry)

In turn, the "RULING" dates back to a thread discussing/complaining about
a warning added in added in 5.004

    $ perl5.004 -cwe 'while ($a = <>) {}'
    Value of <HANDLE> construct can be "0"; test with defined() at -e line 1.
    -e syntax OK

The intent of the changes back then appears to be to retain the 5.003 and
earlier behaviour on what gets assigned for each construction, but change
the loop behaviour to terminate on undefined rather than simply falsehood
for the common simple cases:

    while (OP ...)


    while ($var = OP ...)

And there I thought it made sense - fixed in 1998 for readline, glob and
readdir, but introducing the inconsistency because each doesn't default
to assigning to $_. Except, it turned out that there was a twist in the
tail. It turns out that while (readdir D) {...} didn't use to implicitly
assign to $_. Both the implicit assignment to $_ and defined test were added
in *2009* by commit 114c60ecb1f7, without any fanfare, just like any other
bugfix. And the world hasn't ended.

    $ perl5.10.0 -MO=Deparse -e 'while (readdir D) {}'
    while (readdir D) {
    -e syntax OK
    $ perl5.12 -MO=Deparse -e 'while (readdir D) {}'
    while (defined($_ = readdir D)) {
    -e syntax OK

Running a search of CPAN reveals that almost no code uses while (each %hash)
[and why should it? The construction does a lot of work only to throw it
away], and *nothing* should break if it's changed. Hence it makes sense to
treat this as a bug, and fix it. Which has now happened, but I can't take
credit for it - post 5.16.0, Father Chrysostomos has now fixed it in blead.

To conclude this story, the mail archives from 15 years ago are fascinating.
Lots of messages. Lots of design discussions, not always helpful. And some
of the same unanswered questions as today.

The digression relates from trying to replicate a previous old bug (ID
20010918.001, now #7698) I'd dug an old machine with FreeBSD 4.6 out from
the cupboard under the stairs in the hope of reproducing the period problem
with a period OS. Sadly I couldn't do that, but out of curiosity I tried to
build blead on it. This is the same 16M machine whose swapping hell prompted
my investigation of enc2xs the better part of a decade ago, resulting in
various optimisations on its build time memory use, that in turn led to ways
to roughly halve the side of the built shared objects, and a lot of the
material then used in a tutorial I presented at YAPC::Europe and The German
Perl Workshop, "When Perl is not quite fast enough". This machine has

Once again, it descended into swap hell, this time on mktables. (And with
swap on all 4 hard disks, it's very effective at letting you know that it's
swapping.) Sadly after 10 hours, and seemingly nearly finished, it ran out
of virtual memory. So I wondered if, like last time, I could get the memory
usage down. After a couple of false starts I found a tweak to Perl_sv_grow
that gave a 2.5% memory reduction on FreeBSD (but none on Linux), but that
wasn't enough. However, the cleanly abstracted internal structure of
mktables makes it easy to add code to count the memory usage of the various
data structures it generate. One of its low-level types is "Range", which
subdivides into "special" and "non-special". There are 368676 of the latter,
and the name for each may be need to be normalised into a "standard
form". The code was taking the approach of calculating the standard form at
object creation time. With the current usage patterns of the code, this
turns out to be less than awesome - the standard form is only requested for
22047 of them. By changing the code to calculate only when needed (and cache
the result) I reduced RAM and CPU usage by about 10% on Linux, and 6% on
FreeBSD. Whilst the latter is smaller, it was enough to get the build
through mktables, and on to completion. The refactoring is now merged to
blead, post 5.16.0. Hopefully everyone's build will be a little bit smaller
and a little bit faster as a result.

To complete the story, I should note that make harness failed with about 100
tests still to run, snatching defeat from the jaws of victory. Turns out
that *that* also chews a lot of memory to store test results. make test,
however, did pass (except for one bug in t/op/sprintf.t, patch in RT
@112820). Curiously gcc, even when optimising, isn't the biggest memory hog
of the build. It's beaten by mktables, t/harness and a couple of the Unicode
regression tests. But even then, our build is very frugal. It should
complete just fine with 128M of VM on a 32 bit FreeBSD system, and I'd guess
under 256M on Linux (different malloc, different trade offs).  I think that
this means that blead would probably build and test OK within the hardware
of a typical smartphone (without swapping), if they actually had native
toolchains. Which they don't. Shame :-(

Part of May was spent getting a VMS build environment set up on the HP Open
Source cluster, and using it to test RC1 and then RC2 on VMS.

Long term I'd like to have access to a VMS environment, not to actually do
any porting work to VMS, but to permit refactoring of the build system
without breaking VMS. George Greer's smoker builds the various smoke-me
branches on Win32, so that makes it easy to test changes that would affect
the Win32 build system, but no such smoker exists for VMS. Hence historically
I've managed to do this by sending patches to Craig Berry and asking him
nicely if he'd test them on his system, but this is obviously a slow,
inefficient process that consumes his limited time, preventing him using it
to instead actually improve the VMS port.

As the opportunity to get access turned up just as 5.16.0 was nearing
shipping, I decided to work on getting things set up "right now" to try to
get (more) tests of the release candidates on VMS. We discovered various
shortcomings in the instructions in README.vms, and as a side effect of
debugging a failed build, a small optimisation to avoid needless work when
building DynaLoader. So it's likely that my ignorance will continue to be a
virtue by finding assumptions and pitfalls in the VMS process that the real
experts don't even realise that they are avoiding subconsciously.

We had various scares just before 5.16.0 shipped relating to build or test
issues on Ubuntu, specifically on x86_64. This shouldn't happen - x86_64
GNU/Linux is probably the most tested platform, and Ubuntu is a popular
distribution, so it feels like there simply shouldn't be any more bugs
lurking. However, it seems that they keep breeding.

In this case, it's yet another side effect of Ubuntu going
multi-architecture, with the result that the various libraries perl needs to
link against are now in system dependent locations, instead of /usr/lib.
This isn't a problem (well, wasn't once we coded to cope with it) - we ask
the system gcc where its libraries are coming from, and use that library
path. The raw output from the command looks like this:

$ /usr/bin/gcc -print-search-dirs 
install: /usr/lib/gcc/x86_64-linux-gnu/4.6/
programs: =/usr/lib/gcc/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/bin/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/bin/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/bin/
libraries: =/usr/lib/gcc/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/lib/../lib/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../lib/:/lib/x86_64-linux-gnu/4.6/:/lib/x86_64-linux-gnu/:/lib/../lib/:/usr/lib/x86_64-linux-gnu/4.6/:/usr/lib/x86_64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/lib/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../:/lib/:/usr/lib/

So the hints file processes that, to get the search path. It runs this
pipeline of commands:

$ /usr/bin/gcc -print-search-dirs | grep libraries | cut -f2- -d= | tr ':' '\n' | grep -v 'gcc' | sed -e 's:/$::'

Except that all of a sudden, we started getting reports of build failures
on Ubuntu. It turned out that no libraries were found, with the first problem
being the lack of the standard maths library, hence miniperl wouldn't link.
Why so? After a bit of digging, it turns out that the reason was that the
system now had a gcc which localised its output, and the reporter was running
under a German locale.

So, here's what the hints file sees under a German locale:

$ export LC_ALL=de_AT
$ /usr/bin/gcc -print-search-dirs | grep libraries | cut -f2- -d= | tr ':' '\n' | grep -v 'gcc' | sed -e 's:/$::'

Oh dear, no libraries. Why so?

$ /usr/bin/gcc -print-search-dirs installiere: /usr/lib/gcc/x86_64-linux-gnu/4.6/
Programme: =/usr/lib/gcc/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/bin/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/bin/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/bin/
Bibliotheken: =/usr/lib/gcc/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/lib/../lib/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../x86_64-linux-gnu/4.6/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../lib/:/lib/x86_64-linux-gnu/4.6/:/lib/x86_64-linux-gnu/:/lib/../lib/:/usr/lib/x86_64-linux-gnu/4.6/:/usr/lib/x86_64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../../x86_64-linux-gnu/lib/:/usr/lib/gcc/x86_64-linux-gnu/4.6/../../../:/lib/:/usr/lib/

Because in the full output, the string we were searching for, "libraries",
isn't there. it's now translated to "Bibliotheken".

Great. Unfortunately, there isn't an alternative machine readable output
format offered by gcc, so this single output format has to make do for humans
and machines, which means that the thing that we're parsing changes.

This is painful, and often *subtle* pain because we don't get any indication
of the problem at the place where it happens. In this case, a failure in
the hints file doesn't become obvious until the end of the link in the build.

The solution is simple - force the locale to "C" when running gcc in a
pipeline. But it's whack-a-mole fixing these. It would be nice if more
tools made the distinction that git does between porcelain (for humans), and
plumbing (for input to other programs).

The second Ubuntu failure report just before 5.16.0 was for t/op/filetest.t
failing. It turned out that the test couldn't cope with a combination of
circumstances - running the test as root, *but* the build tree not being
owned by root, *and* the file permissions being such that other users
couldn't read files in the test tree. This all being because testing that -w
isn't true on a read only file goes wrong if you're root, so there's
special-case code to detect if it's running as root, which temporarily
switches to an arbitrary non-zero UID for that test. Unfortunately it also
had a %Config::Config based skip within that section, and the read of
obscure configuration information triggers a disk read from lib/, which
fails if the build tree's permissions just happened to be restrictive. The
problem had actually been around for quite a while, so Ricardo documented it
as a known issue and shipped it unchanged.

So post 5.16.0, I went to fix t/op/filetest.t. And this turned into quite a
yak shaving exercise, as layer upon layer of historical complexity was
revealed. Originally, t/op/filetest.t was added to test that various file
test operators worked as expected. (Commit 42e55ab11744b52a in Oct 1998.) It
used the file t/TEST and the directory t/op for targets. To test that
read-only files were detected correctly, it would chmod 0555 TEST to set it
read only.
The test would fail if run as root, because root can write to anything. So
logic was added to set the effective user ID to 1 by assigning to $> in an
eval (unconditionally), and restoring $> afterwards. (Commit
846f25a3508eb6a4 in Nov 1988.) Curiously, the restoration was done after the
test for C<-r op>, rather than before it.
Most strangely, a skip was then added for the C<-w op> test based on
$Config{d_seteuid}. The test runs after $> has been restored, so should have
nothing to do with setuid. It was added as part of the VMS-related changes
of commit 3eeba6fb8b434fcb in May 1999. As d_seteuid is not defined in VMS,
this makes the test skip on VMS.
Commit 15fe5983b126b2ad in July 1999 added a skip for the read-only file
test if d_seteuid is undefined. Which is actually the only test where having
a working seteuid() *might* matter (but only if running as root, so that $>
can be used to drop root privileges).
Commit fd1e013efb606b51 in August 1999 moved the restoration of $> earlier,
ahead of the test for C<-r op>, as that test could fail if run as root with
the source tree unpacked with a restrictive umask. (Bug ID 19990727.039)

"Obviously no bugs" vs "no obvious bugs". Code that complex can hide
anything. As it turned out, the code to check $Config{d_seteuid} was
incomplete, as it should also have been checking for $Config{d_setreuid} and
$Config{d_setresuid}, as $> can use any of these. So I refactored the test
to stop trying to consult %Config::Config to see whether root assigning to
$> is going to work - just try it in an eval, and skip if it didn't. Only
restore $> if we know we changed it, and as we only change it from root, we
already know which value to restore it to.

Much simpler, and avoids having to duplicate the entire logic of which
probed Configure variables affect the operation of $>

Finally, I spotted that I could get rid of a skip by using the temporary
file the test (now) creates rather than t/TEST for a couple of the tests.
The skip is necessary when building "outside" the source tree using a
symlink forest back to it (./Configure -Dmksymlinks), because in that case
t/TEST is actually a symlink.

So now the test is clearer, simpler, less buggy, and skips less often.

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in
RT #... is a bug in
CPAN #... is a bug in
BBC is "bleadperl breaks CPAN" - Andreas K├Ânig's test reports for CPAN modules
ID YYYYMMDD.### is an bug number in the old bug system. The RT # is given
  afterwards. You can look up the old IDs at

[Hours]		[Activity]
  0.50		?->
  1.25		AIX bisect
  0.75		AIX ccache
  1.00		HP-UX 32 bit -DDEBUGGING failure
  1.50		ID 20000509.001 (#3221)
  0.25		ID 20010218.002 (#5844)
  1.50		ID 20010305.011 (#5971)
  1.25		ID 20010309.008 (#6002)
  0.25		ID 20010903.004 (#7614)
  0.50		ID 20010918.001 (#7698)
  0.50		ID 20011126.145 (#7937)
  0.50		IO::Socket::IP
  3.75		RT #108286
  0.25		RT #112126
  0.25		RT #112732
  0.50		RT #112786
  0.75		RT #112792
  0.75		RT #112820
  1.00		RT #112866
  0.50		RT #112914
  0.75		RT #112946
  0.50		RT #17711
  0.25		RT #18049
  0.25		RT #29437
  0.25		RT #32331
  0.25		RT #47027
  0.50		RT #78224
  0.50		RT #94682
  0.50		Ubuntu link fail with non-English locales
  4.25		VMS setup/RC1
  7.00		VMS setup/RC2
  1.00		clarifying the build system
  1.00		installhtml
  4.50		mktables memory usage
  1.25		process, scalability, mentoring
 46.25		reading/responding to list mail
  1.75		smoke-me branches
  0.25		smoke-me/trim-superfluous-Makefile
  5.25		t/op/filetest.t
  1.00		t/porting/checkcase.t
  0.25		the todo list
  1.00		undefined behaviour from integer overflow
 96.00 hours total

Nicholas Clark

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About