develooper Front page | perl.perl5.porters | Postings from May 2012

NWCLARK TPF grant April report

Thread Next
From:
Nicholas Clark
Date:
May 15, 2012 03:20
Subject:
NWCLARK TPF grant April report
Message ID:
20120515102002.GP37285@plum.flirble.org
The largest part of of the month was spent on various topics related to
cleaning up the build process. In particular, simplifying the top-level
Makefile, by removing duplication and self-recursion.

On platforms that run Configure (so Linux, Unix, Cygwin and OS/2), the
Makefile is extracted from a shell script Makefile.SH. Most of it is written
out verbatim, but there is some shell logic to generate OS or configuration
specific code.

In turn that Makefile is used by makedepend to generate a second Makefile
containing dependency information (named makefile on most systems, and
GNUMakefile on OS X - both names chosen so that the make utility picks the
newer makefile in preference).

The makefile compiles all the object files and links them into miniperl,
which is pretty much the real perl, only without any form of dynamic
linking. Between 5.005 and 5.6.0 things got a bit more complex, because perl
switched to using File::Glob to implement globbing. [File::Glob uses the BSD
glob code, written by that well-known Perl contributor Guido van Rossum :-),
safely quarantined in its own file to avoid licensing cross-contamination].
As File::Glob needs miniperl to build, and miniperl isn't built yet, to
solve the bootstrapping problem op.c is conditionally compiled as opmini.o,
and that instead calls the old "shell out to csh" globbing code. This need
to link with opmini.o instead of op.o becomes relevant later.

miniperl bootstraps the rest of the build, and is used as much as possible
to avoid writing any subsequent build system 3 (or more) times - shell,
batch files and DCL scripts. In particular, miniperl builds DynaLoader
and perlmain.c, and links both with all those already compiled object files
to make the real perl.

There is a configuration option to build a shared perl library. In this
case, the C code is compiled to be position independent, and linked into a
shared libperl, with the perl binary just a small stub linked against it.
This also becomes relevant later, as libperl.so (or .dylib etc) contains the
code for op.o, but not opmini.o.

Currently in blead, the Makefile has 20 places where it calls $(MAKE) to run
a different target in the top level directory. Of these, this one is the
most troubling:

	$(LDLIBPTH) $(RUN) ./miniperl$(HOST_EXE_EXT) -w -Ilib -MExporter -e '<?>' || $(MAKE) minitest

as it can cause a fork bomb with a parallel make if miniperl builds but
fails to work. Having been bitten by that fork bomb one time too many, I
decided to eliminate it. However, that above line is actually replicated in
4 different places, 3 OS specific and the generic fallback. Hence the first
digression into yak-shaving - reduce that number.

So, we have AIX (and BeOS), NeXT, OS X and "general". 3 of those are very
much alive, so we can't just kill the code...

Firstly, AIX.

On AIX, when building with a shared perl library, one needs the list symbols
that library should export when building it. For 5.005 and earlier, that
list was generated by a shell script, so no bootstrapping problem.

However, the export list had to be manually updated in the shell script, so
with commit 549a6b102c2ac8c4 in Jul 1999, the Win32 solution was generalised
to work on both AIX and Win32. This uses Perl script, makedef.pl to parse
various files and generate the current export list automatically. This
introduces a bootstrapping problem - miniperl is needed to generate
libperl.a, but libperl.a is needed to generate miniperl. This commit solves
the problem by introducing a new target, MINIPERL_NONSHR, which builds a
special staticly-linked miniperl (named miniperl_nonshr) in order to run
makedef.pl, which in turn permits libperl.a and the regular miniperl to be
built, and the build to proceed.

All was well until commits 52bb0670c0d245e3 and 8d55947163edbb9f (Dec 1999)
changed the default for CORE::glob() to use the XS module File::Glob, and
linked miniperl against an opmini.o, built from op.c but with compiler flags
to use the old glob-via-csh code. The change made for AIX was to build
*miniperl_nonshr* with the bootstrapping glob code, but leave the build of
miniperl unchanged. This broke the build on AIX - miniperl would build just
fine, but would fail to build any XS extensions, as the ExtUtils::MakeMaker
code requires working globbing.

The AIX build was fixed with commit 18c4b137c9980e71 (Feb 2000) by changing
Makefile.SH so that AIX used the same rules to build miniperl as NeXT. The
rules for NeXT generated miniperl from an explicit list of object files,
instead of using libperl.a. The result of this change was that on AIX,
miniperl was now identical to miniperl_nonshr. Both correctly use the csh
globbing code, but now neither require the shared libperl.a to work.

This makes miniperl_nonshr redundant. So I eliminated it. This starts to
converge the code for AIX with the other platforms.

The default case:

Curiously, as a side effect of commit 908fcb8bef8cbab8 (Dec 2006) which
moved DynaLoader.o into libperl.so, the default platform build rules for
miniperl were changed to use an explicit list of object files, instead of
C<-lperl>, which had the side effect of building miniperl non-shared. Hence
all platforms now have a non-shared miniperl when building perl with a
shared perl library.

Next, OS X:

Commit cb3fc4263509f28c (May 2003) removed the use of -flat_namespace from
the link flags, but added it specially in Makefile.SH for miniperl, so that
symbols from opmini.o overrode those in libperl.dylib. However, a side
effect of commit 908fcb8bef8cbab8 (Dec 2006) was to change the linker line
to use explicit object files, meaning that op.o was no longer part of
linking, meaning that the override is no longer needed. Hence darwin's link
does not need special-casing.

Lastly, NeXT:

Whilst it's not clear whether anyone is still using NeXT, since the previous
changes have refactored the other 3 cases to use an explicit list of object
files, the only difference remaining between the makefile rule for NeXT and
the rest is that next doesn't use $(CLDFLAGS) when linking miniperl, whereas
all the other do. It's not clear if this difference is significant or
accidental, but lacking any NeXT system to test on, it was simple enough to
preserve the difference but use simpler code to implement it.

The result - all 4 cases are merged. However, I've not yet actually
eliminated that particular troublesome C<|| $(MAKE) ...> rule, as to test it
properly requires C<make minitest> to pass first time, without first
building all the non-XS extensions - work I've made progress on, but not yet
completed.

As AIX now uses regular miniperl to run makedef.pl, the AIX-specific section
of the Makefile now starts to look much more similar to the OS/2 specific
section that runs makedef.pl there.  With some refactoring of makedef.pl to
avoid needing a command-line parameter to specify the platform to build for,
and passing in a -DPERL_DLL on AIX which does nothing, the two converge even
further. At which point it was a small matter of using Makefile macros to
encode the different dependencies and filenames used, at which point the two
can share code, and Makefile.SH gets simpler.

In turn, I also eliminated a lot of redundancy in the various variant
install targets (each of which had been implemented with a call to $(MAKE)
in the same directory), and the pre-requisites for various debugging
targets.

One of these needs to pass a flag to installperl to instruct it to run strip
on the installed binaries. installman doesn't need to strip anything, so
doesn't accept such a flag. As it's the only difference between the
invocations of the two, I decided that the simplest option was actually to
make installman accept a --strip flag and do nothing. This, as ever, isn't
as simple as it seems. installperl's strip flag is currently -s, but
installman uses Getopt::Long so accepts -s as an abbreviation for --silent
Hence the best solution is to have both accept a long-option --strip.  This
means refactoring installperl to use Getopt::Long, which in turn is "fun"
because it accepts both +v and -v as distinct options, which Getopt::Long
can't support. Or, more pragmatically, can't *directly* support, in the
general case. Fortunately installperl isn't the general case, as it takes no
non-option arguments, which permits a reasonably simple solution.

With this, more duplication died. En route, I spotted and eliminated some
dead code in installperl related to a 5.000->5.001 change - specifically
where in @INC autosplit installs files. Small instances of this sort of
cruft accumulate all over the source tree, but generally the code is never
annotated sufficiently as to the purpose to make it obvious that it had a
specific time-limited purpose. And of course, it's maybe only 1% of the
slightly "look twice and wonder why" code that is actually redundant - most
is subtly useful on some platform or configuration corner case that is hard
to test, but likely still needed somewhere.

This work is in the branches smoke-me/Makefile-miniperl-unification,
smoke-me/Makefile-norecurse, smoke-me/make_ext and smoke-me/perlmodlib

Makefile.SH is a lot better than it, although there's still some more to do
that will make it simpler still. As well as fixing minitest, still to do are
some further simplifications of how ./miniperl is invoked to run various
utilities. Most of the Makefile command lines have -Ilib -Idist/Cwd
-Idist/Cwd/lib, dating back to the time when Cwd was detangled from ext/ and
lib/ into a single directory in dist/ with the same layout as the CPAN
distribution. However, since then I refactored miniperl to use the
sitecustomize code to set up the build @INC automatically, meaning that
everything after that first -Ilib is now redundant. So that's still to
clean.


As ever, the age and gnarliness of the codebase, combined with the
complexities resulting from being able to build various different
configurations on many different platforms means that often I spend a lot of
time investigating things, but little code changed as a result.

The most obvious example of this was while investigating an unrelated
problem, happening to use valgrind on OS X, and discovering that it was
reporting an error during interpreter startup [specifically down from
S_init_postdump_symbols()]. Strange and troubling - strange because I
thought we'd nailed all of these, and troubling because interpreter start up
is a fairly common code path. As in, 100% of programs traverse it. So it's
important not mess up and possibly open the door for malice. Except that the
more I dug the more this seemed to be S.E.P.* - either a bug in gcc or in
valgrind. I'm aware of Klortho's advice on this matter:

#11907 Looking for a compiler bug is the strategy of LAST resort. LAST resort.

but it did really seem to be a bug somewhere in the toolchain - malloc
allocating 142 bytes, and then memmove attempting to read 8 bytes from
offset 136, 6 before the end of the 8*n+6 sized block. So I noted the
symptoms as RT #112198 in case anyone else hit them and searched for them,
and then rejected the ticket, thinking I was done.

Of course, I wasn't. Tony Cook recognised the symptoms as being the same as
a Mozilla bug: https://bugzilla.mozilla.org/show_bug.cgi?id=710438#c3 which
leads to a valgrind bug: https://bugs.kde.org/show_bug.cgi?id=285662 which
had been marked as resolved in their svn trunk. I've built valgrind from svn
and verified that their r12423 genuinely fixes it.

While on the subject of Advice from Klortho #11907, I investigated further
the failure of HP-UX to build blead, failing on File::Glob with very similar
symptoms to a mysterious Cygwin failure - lib/File/Glob.pm is reported as
containing syntax errors, because a postfix when is not recognised as being
enabled syntax - the use feature 'switch'; is being ignored.

However, in this case it turns out that the bug isn't the same as Cygwin
(lack of a suitable cast in our code), but instead similar to an earlier bug
on AIX. In that case, a compiler bug in handling the C99 bool type with
paired ! operators caused the control V byte in the name $^V not to be
considered a control character. In this case, a compiler bug with bool and
the && operator** caused features never to be enabled. Again, worked round
with a trivial change in a header file in how we express a construction. It's
a bit frustrating, but it does seem that 12 years isn't really enough time
for the compiler writers to get all the bugs out of these new fangled bool
thingymabobs. Stay tuned next month for further fun with HP's compiler.


I investigated what gcc's -ftrapv flag reveals about the prevenance of
signed integer overflow in the core C code. Unsigned integer overflow in C
is well defined - it wraps. *Signed* integer overflow, however, is undefined
behaviour (not just implementation defined or unspecified) so really isn't
something we should be doing. Not having to care about undefined behaviour
gives C compilers the loopholes they need to optimise conformant code.
Whilst right now compilers usually end up exposing the behaviour of the
underlying CPU (which is nearly always 2s complement these days) the core's
code is making this assumption, but should not.

The -ftrapv flag for gcc changes all operations that might caused signed
overlow to trap undefined behaviour and abort(). This, of course, is *still*
a conformant compiler, because undefined behaviour is, well, undefined. (Be
glad that it didn't format your hard disk. That's fair game too, although
one hopes that the operating system stops that.)

It turns out that there's quite a lot of bad assumptions in the code about
signed integer overflow. These seem to fall into three groups

0) The code used to implement the arithmetic operators under use integer;

1) The assumption that the signed bit pattern for the most negative signed
   value, IV_MIN, is identical to the bit pattern for the unsigned negation
   of that value. ie on a 32 bit system, the IV -2147483648 and the
   UV 2147483648 have the same bit representation

2) A lot of code in the regular expression engine uses the type I32

The regular expression engine code is complex. It only causes aborts under
-ftrapv on a 32 bit system, probably due to type promotion on 64 bit systems
for the problematic expressions, such that the values don't overflow.
However, it's clear that in places the engine is relying on sentinel values
such as -1, and it's not clear whether these are tightly coupled with
pointer arithmetic, so a brute force approach of trying to "fix" the problem
by converting I32 to SSize_t is likely to cause far more bugs than it fixes.

Sadly the use integer code sadly is doing exactly what it's documented to do
- "native integer arithmetic (as provided by your C compiler) is used" And
as your C compiler's integer arithmetic is undefined on signed overflow, so
will your Perl code be. So, we have to accept this is going to raise the ire
of -ftrapv. However, this causes problems when testing, as code to test it
will abort, and it's not possible to make exceptions from -ftrapv on a
function by function basis. So the best solution to this seems to be to
break out the integer ops into their own file, and set the C compiler flags
specially for that file (which we already have the infrastructure for).

The middle set of code turns out to be relatively easy to fix. In part it
can be done by changing conversions between IV_MIN and its negation from
obvious-but-overflowing terse expressions to slightly longer expressions
that sidestep the overflow. Most of the rest can be fixed by bounds checking
- negating values in range, and explicitly using IV_MIN when negating the
value 1-(IV-MIN+1). One surprise was the code used to avoid infinite looping
when the end point of a .. range was IV_MAX, which needs a special case, but
the special case in turn was written assuming signed integer overflow. That
code is fixed in smoke-me/iter-IV_MAX, the rest in smoke-me/ftrapv, and some
related cleanup of pp_pow in smoke-me/pp_pow. All will be merged to blead
after 5.16.0 is released.

To continue this, I investigated using CompCert to compile Perl. 
CompCert is a "verified compiler, a high-assurance compiler for almost all
of the ISO C90 / ANSI C language, generating efficient code for the PowerPC,
ARM and x86 processors." -- http://compcert.inria.fr/

However, its description of "almost all" is accurate - it doesn't include
variable argument lists, which is kind of a show stopper for us. However, as
an experiment I tried out running Configure with it to see what happened.
Result! We generate an invalid config.h file. So that's now RT #112494, with
a fix ready to go in post 5.16.0


I also spent some time investigating precisely how and when we lost support
for using sfio (in place of stdio). It turns out that the build of miniperl
is restored by 7 one-line fixes (now preserved for posterity in the branch
nicholas/misc-tidyup as they're not relevant to getting 5.16.0 released).
More usefully, it revealed a section of Storable.xs which is no longer
needed, so that's 14 lines of code to kill. However, this doesn't mean that
sfio is nearly working and about to be useful again. With this fixed, the
build fails thanks to a strange issue which truncates some Makefiles' output
by ExtUtils::MakeMaker at 8192 bytes. I dug further, because I wanted to be
sure it wasn't the only known symptom of some mysterious core buffering
bug. Based on some historical anomalous CPAN smoker results, I'm suspicious
that we may well have one. It turns out, fortunately, that in this case it's
not our bug. The cause was far more interesting - something I'd never even
considered. In C, a pointer to the element immediately after an array is
valid for pointer arithmetic. (But, obviously, not the value it points
to. So don't dereference it.) The usual use of this is to store a buffer end
pointer - increment the active pointer, and if it's equal to the buffer end,
do something.

However, what's not obvious from that is that the memory address immediately
after the array might *also* be the start of some other array that the
program holds a pointer to. The bug was that sfio's "do something" (ie empty
the write buffer) was deferred until the next call that wrote to the
stream. However, ahead of that check was a check related to sfreserve(),
which permits user code to write directly into the buffer. This is
implemented by handing out a pointer to the internal "next write" pointer,
and the user code signals that it is done by calling sfwrite() with this
(internal) pointer. The special case was intended to express "is the passed
in write-from pointer identical to the internal next write pointer?"
However, that check breaks if it just happens that the internal buffer is
full ("next write" points one beyond the buffer), *and* the user's buffer
just happens to be in the very next byte of memory. Which turns out to be
possible on FreeBSD, where malloc() is able to allocate contiguous blocks of
memory. So, fortunately, this isn't our bug. I've reported it to Phong Vo,
who confirms that it's a bug in sfio.

Even with this fixed locally, a perl built with sfio fails even *more* tests
than one built without PerlIO enabled. Unless someone external with an
ongoing interest in sfio works to fix these issues, the days of the current
sfio code are numbered. Dead non-working code just gets in the way.


A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in http://perl5.git.perl.org/perl.git
RT #... is a bug in https://rt.perl.org/rt3/
CPAN #... is a bug in https://rt.cpan.org/Public/
BBC is "bleadperl breaks CPAN" - Andreas K├Ânig's test reports for CPAN modules
ID YYYYMMDD.### is an bug number in the old bug system. The RT # is given
  afterwards.

[Hours]		[Activity]
  0.25		'd' flags (on valid_utf8_to_uv{chr,uni})
  0.75		AIX bisect
  5.25		AIX make bug
		AIX make bug (ccache)
		AIX make bug (https://bugzilla.samba.org/show_bug.cgi?id=8906)
		AIX make bug (xlc -DDEBUGGING segv)
  0.50		DL shared library (RT #40652)
  6.50		HP-UX build failure (cc bug)
  0.25		HP-UX cc bug
  7.25		ID 20000721.002 (#3561)
  0.25		ID 20010801.037 (#7425)
  1.00		ID 20020513.013 (#9319)
  2.50		LDLIBPTH
  0.25		MSDOS died with 5.000, DJGPP only appeared in 5.004
  2.00		Makefile improvements
		Makefile improvements (VMS)
  7.00		Makefile-norecurse
 11.00		Makefile.SH miniperl rule unification
  0.25		Pod::PerlDoc
  3.50		RT #108286
  0.50		RT #112312
  4.00		RT #112350
  1.25		RT #112370
  0.25		RT #112404
  0.25		RT #112478
  1.25		RT #112494
  2.00		RT #112504
  0.25		RT #112536
  2.25		RT #24250
  0.50		RT #33159
  0.25		RT #36309
  0.50		RT #40652
  1.50		Solaris -xO3 failure
  0.75		Solaris bisect.
		Solaris bisect. (-xO3)
  0.50		Testing on OpenBSD
 12.00		bisect.pl
		bisect.pl (HP-UX and AIX)
		bisect.pl (HP-UX build failure (cc bug))
		bisect.pl (user fixups)
		bisect.pl (valgrind)
  8.50		build process [autodoc, perlmodlib, $(Icwd)]
  0.25		cross compilation
  3.50		embed_lib.pl
  5.50		installman
  6.50		minitest cleaner
  1.50		pp_pow
  0.50		process, scalability, mentoring
 24.50		reading/responding to list mail
  4.00		review davem/re_eval
  7.50		sfio
  2.50		split %SIG panic
  1.50		the todo list
  9.50		undefined behaviour caught by gcc -ftrapv
  3.75		valgrind error on OS X (RT #112198)
======
156.00 hours total

Nicholas Clark

*  http://en.wikipedia.org/wiki/SEP_field
** to be more accurate, an expression involving a variable, ||, && and a
   function returning bool.

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About