develooper Front page | perl.perl5.porters | Postings from June 2013

NWCLARK TPF grant April report

Thread Next
Nicholas Clark
June 10, 2013 13:22
NWCLARK TPF grant April report
Message ID:
As per my grant conditions, here is a report for the April period.

I started the month looking at the Unicode Names code. At Karl's suggestion
I changed it to parse the UnicodeData.txt file properly. Previously it had
hardcoded various constants, particularly related to the CJK ideographs and
Hangul syllables. The CJK ranges in Unicode have increased in the past, and
so it's possible that they will increase again. Not only is it (more) future
proof, it also made it simpler to detect significant gaps in the allocated
character ranges, which is useful for size optimisations. By handing the
gaps better I reduced the data size by 13K, and by using two sizes of arrays
for the trie structure, saved a further 25K.

The intent of all this is to provide the data needed for the \N{} syntax
directly as C code and static data, to avoid the tokeniser needing to load
charnames if it sees \N{}. Given that the C code in question is generated by
Perl, but to compile the Perl you need the C code, there's a potential
bootstrapping problem here. Not wishing to ship >1M of generated code if
avoidable, I experimented to see whether the \N{} escape syntax is needed by
miniperl. It turns out that if you replace the \N{} handler by abort() in
miniperl, you can still build perl perfectly. Excellent! Also, telling the
Makefile to build distinct toke.o and tokemini.o is a one line change - it's
nice when easy things *are* easy.

Frustratingly the work is not yet ready to merge into blead, as it's not yet
finished enough, and other things keep taking priority.

We had a bit of a digression involving perl on s390. Merijn was given a CD
for OpenSUSE for s390, soon had it running on the emulator "Hercules". What
does one do next? Obviously - try to build Perl. So he build blead (which
worked) and ran the tests (which mostly worked). My initial reaction was:

    Is anyone actually using Perl on it?

    In that, we've not had any bug reports about this before, and for a lot
    of these somewhat esoteric platforms I'm sort of wondering at what point
    do they stop being "fun", and start being "work". Right now, I think
    it's more at the "fun" level, and it might be telling us something
    interesting about portability, as it might be that all these tests
    failing are down to the same problem.

The problems all seemed to be involve conversion of large floating point
values to integers. Which we do quite a lot of, and should work, but
historically we've been skirting the boundary of what's conformant ANSI C,
sometimes on the wrong side. So Merijn and I exchanged code and results as I
tried to remote debug what the cause was. We tried to establish whether it
was even handling large unsigned integers correctly (it was). We tried to
rule out unions and ? : ternaries (which have confused compilers in the
past). Nope. In the end, we ascertained that it was a bug in the supplied
gcc 4.3.4 - it generated bad code for casting from unsigned integers to

At which point Niko Tyni replied that the particular problem was already
diagnosed as a compiler bug, and had been fixed. Debian was building on s390
with gcc 4.6.3, and he believed that gcc 4.4.7 was fixed.

So that all ended up being rather a waste of time, thanks to the
distribution's continued use of an obsolete and buggy compiler. Particularly
frustrating given that a fix exists in a version that is at least a year

A significant development this month was having serious second thoughts
about $* and friends. As mentioned in last month's report, everything smoked
fine, so the change was merged to blead. Only then did the problems emerge.
Specifically Father Chrysostomos demonstrated rather succinctly that the
core's tests weren't comprehensive enough. The tests correctly verified that
using any of @*, &* ** and %* generated the desired deprecation warning. But
the warning was also generated by *{*}, *{"*"} and C<$_ = "*"; *$_>, none of
which "need" to be deprecated. Nothing tested these, so nothing noticed that
they now warned. This was because I'd adapted the code used to warn for $*,
and made it warn for all. However, there's a significant difference. The
"magic" functionality of $* (globally setting multiline matching) was what
was deprecated and then removed. It was nothing to do with parsing the two
characters $* as a punctuation variable, hence the warning needed to be
triggered by any *use* of the scalar variable *, independent of what syntax
was used to assign to it. For this reason, the best place to inject that
warning was the code which creates typeglobs and used to set up the magic
which made $* work. As that code is dealing in typeglobs, it already had
logic to determine whether the request was for the SCALAR slot or one of the
others, so it was simple to extend it to warn for all slots, extending
warnings from $* to all variables named *.

Simple, obvious and wrong.

The error being that the intent was to deprecate the *syntax* @* etc, not
use of the variable itself. Hence right place to insert a deprecation would
be in the parser. Specifically, toke.c. 12151 lines of horror most aptly
summarised as 'It all comes from here, the stench and the peril.'

Strangely for toke.c, it seems that it's actually fairly easy to deprecate
the parsing of @* etc. Tokens are parsed by a routine S_scan_indent() which
is relatively self-contained, and the control flow around it is also fairly
clear. So a warning can be issued by adding another parameter to that routine,
and only setting it true from the 4 places in the parser that deal with things
starting '@', '&', '*', and '%' respectively. This worked.

However, the seconds thoughts went deeper than that. I think that even this
approach is wrong on two further levels.

Firstly the intent was to enable syntax of the form @*foo and %*bar.
Having @*foo and %*bar would seem to imply that one can't also have @* or %*.

What hadn't sunk is is that we have both $# and $#foo (and $#$foo), and there
doesn't seem to be a parsing problem with this. There *is* some special casing
for which punctuation vars $#... will work on, notably only @+, @- and @@:

        if (s[1] == '#' && (isIDFIRST_lazy_if(s+2,UTF) || strchr("{$:+-@", s[2]\
))) {
            PL_tokenbuf[0] = '@';

and, unlike most of Perl 5, recognising $#... is space sensitive:

    $ perl -le '$foo = [1..3]; $# = \*STDERR; print $#{$foo}'
    $# is no longer supported at -e line 1.
    $ perl -le '$foo = "bar"; %# = (bar => "Baz"); $# = \*STDERR; print $# {$foo}'

(in the latter, $# {$foo} is a hash lookup for key $foo of hash %#)

but it generally works without surprising anyone.

As best I can figure out, one could add @*foo, &*foo, **foo, %*foo $*foo
without removing anything or breaking any sane code on CPAN. The only code
which I think would change behaviour is that is either using $* as the
variable for a file handle passed to print (anything else?), or code which
would parse $*+=1 as $*+ = 1 instead of $* += 1, or code which is making
array slices on %*.

So I think that the right thing to do is not to blanket deprecate parsing @*
&* ** %* and $*, but instead change the parser to warn or deprecate on the
specific ambiguous constructions. Which means that the "new" "needed"
constructions need to come first. Or at least some idea of them.

But, I think I'm wrong again, because the specific intent was to have
consistent "slurpy" syntax for subroutine signatures. Consistent with Perl
6, and consistent between the Perl 5 signature and regular Perl 5 code.

And I got this wrong. In that, Perl 6 does have @*foo. But that's a dynamic
scoping lookup. The slurpy syntax is *@foo. (And *%foo, and *$foo)

For which we don't need to worry about the various sigils used with the **
typeglob at all. We need to consider how the parser deals with the 3
typeglobs *@, *% and *$. And based on how $# and $#foo are handled, I think
that everything that is wanted for "new" syntax is currently a syntax error.
Or, if not, all that is currently legal syntax is incredibly obscure corner

So the net result of all of this was better tests, a bit better understanding 
of another 0.1% of the tokeniser, and a bug fix, in that $* and $# now warn for
every location that references them. Previously there were "holes" through
which they could be used but avoid the warning. A lot of motion, but not much

The "second thoughts" fun continued with RT #116989, a bug originally
related to S_croak_memory_wrap(), but which turned out to be the tip a
considerably more pervasive iceberg.

The perl source code uses a lot of macros. Some of these were used to
implement inline functions long before C compilers could. Now that the
semantics of "static inline" functions are settled and widely supported,
we've started to replace the (still-working) macros with inline functions.

However, this threw up an unexpected surprise. You'd think that replacing
well-behaved macros (those that only use their parameters, and not more than
once) with inline functions would have no functional change. But we found
an interesting gotcha...

Unreferenced macros disappear at the end of pre-processing. Unreferenced
inline functions stick around for a while longer. Some compilers only remove
them if the optimiser is enabled. (Certainly some versions of gcc), whereas
others remove them always (for example clang).

The problem was with macros (and now functions) that reference global
variables or other functions. If the dead code isn't removed by link time,
it causes unsatisfied dependencies, and the build fails. The effect is that
the header files have changed such that they can now only be included by
code which links to the perl library.

Why is this a problem?

Probing code which wants to figure out if a function is present, or compiler
flags work, will compile a small test program using the perl header files,
and if it runs assume that the probed-for thing can be used. The assumption
is that "not running" is caused by the probed-for thing's absence, and not
some problem in the probe. So the first problem is that this stopped being
true, and the second problem is that we didn't realise this because none of
the probing code is verifying this assumption by testing the null hypothesis
- that a test program without anything changed does run. So when the header
files change such that including them now makes the test program fail to
link, all the probing code silently stops working - it simply thinks that it
can't find anything, routinely reports this, with results such as
Time::HiRes failing to find platform-specific functions that it uses, and
hence disabling optional functionality. And we didn't realise this.

Probing code such as that in Time::HiRes could likely resolve this by
linking its probe programs with the perl runtime library. But the code in
cflags.SH wants to work out what extra compile flags it can use locally to
*build* the perl library - ie which known potentially useful flags can't be
added to the build command line locally because they cause compilation
failures on this particular system. This is a bootstrapping problem, as we
want to know what we can compile it with, but if the probe now only works
once we can link with the perl codebase, then we can't answer that until
after we've compiled it.

As well as me, Andy Dougherty and Tony Cook had both already worked on this,
so we had a couple of proposed patches, along with a new regression test to
actually test that null hypothesis of probe code. However, it still didn't
quite feel complete. After thinking about the implications for a while, and
experimenting with various approaches I can't see a way to have
everything. ie

1) static inline functions to replace macros

combined with

2) defaulting to use them

3) not screwing up the "traditional" use without linking

I think that the least worst option is to keep (1) and (2), and force (3) to
work by adding a define that disables inclusion of "inline.h". Of course,
that means that you can't *use* anything from it. But as all it contains are
functions *declarations* that are of no use without data structures
initialised from the perl library, it's no loss. Hence we added a macro
PERL_NO_INLINE_FUNCTIONS for that purpose - explicitly disabling the
declaration of inline functions, for probes and similar code which needs it.
Having tested it on Solaris and HP-UX (as well as the more common
platforms), I pushed it to a smoke-me branch expecting it to work
everywhere. Fortunately I left it over the weekend to let as many machines
as possible catch up with it.

This revealed one last silly sting in the tail. Everything was happy except
for *Darwin*'s flavour of g++ 4.2.1, which objected to using "-ofilename",
petulantly demanding a space in between. It's just special. It's
particularly special as the the FreeBSD g++ 4.2.1 is happy with or without a
space. Same compiler, same reported version, but clearly one vendor patched
theirs. Well *at least* one vendor :-(

So another round of testing with a space in, and finally everyone was happy
and it was merged to blead, in time for v5.18.0.

I also looked at TAP::Parser. No-one pounced on my suggestion in the
previous report on where to look for memory savings, so I followed it up
myself. My guess was roughly correct - it uses an array for each test to
store the numbers of the passing subtests. So for the usual case of all the
subtests passing (in the right order, and matching the plan), this is an
array [1 .. $plan]. Fortunately TAP::Parser is very nicely factored, with
all the code using accessor functions instead of poking into the internals,
so it was very easy to alter it to use an alternative storage
representation, and hide this change from *everything* else just by changing
the accessors functions to "unwrap" the optimisation.

While working with the code I tried out a second approach to reducing memory
usage, albeit somewhat "unwarranted chumminess with the [...]
implementation" - forcing the test numbers to be stored as numbers. The test
lines are parsed by a regular expression, hence the captured results are
internally strings. The grammar uses qr/\d+/, so these strings can only be
numbers, hence by adding 0, the value stored is an IV, not a PV, which is

Combined, these dramatically cut memory use (about 90%). The bad news is
that this change didn't make it in time for v5.18.0 - it's really not the
thing you change during the pre-release code freeze. The good new is that
it's already on CPAN, you you can upgrade all your installations, if you so
wish. Including your older installed Perl versions.

A more detailed breakdown summarised from the weekly reports. In these:

16 hex digits refer to commits in
RT #... is a bug in
CPAN #... is a bug in
BBC is "bleadperl breaks CPAN" - Andreas K├Ânig's test reports for CPAN modules

[Hours]		[Activity]
  0.50		CPAN #83167
  0.25		ExtUtils::CBuilder
  1.00		HvFILL
  0.25		RT #113794
  0.75		RT #114502
 17.00		RT #116943/S_scan_indent
 16.50		RT #116989
		RT #116989, toolchain bootstrap, probing.
  0.25		RT #117003
  0.50		RT #117327
  7.50		RT #117501 (Open64 compiler)
  0.25		RT #117543
  1.00		RT #117687
  0.50		RT #117743
  2.75		RT #54044
  8.25		TAP::Parser
		TAP::Parser (CPAN #84939)
 20.00		Unicode Names
		Unicode Names (No \N{} in miniperl)
  0.25		caller with undef package
  0.25		mktables
  0.50		pp_pack.c
  4.00		process, scalability, mentoring
 26.50		reading/responding to list mail
  1.50		s390
111.50 hours total

Nicholas Clark

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About