develooper Front page | perl.perl5.porters | Postings from March 2013

NWCLARK TPF grant report #78

Thread Next
Nicholas Clark
March 27, 2013 20:24
NWCLARK TPF grant report #78
Message ID:
[Hours]		[Activity]
2013/02/25	Monday
 0.25		MVS
 1.75		Unicode Names
 3.00		reading/responding to list mail
 2.25		struct re_save_state

2013/02/26	Tuesday
 6.00		'foo special case
 0.25		RT #116943
 1.00		Unicode Names
 0.50		benchmarking regressions
 1.00		hv_ksplit/hsplit merge
 2.50		process, scalability, mentoring
 1.00		reading/responding to list mail
 0.25		struct re_save_state

2013/02/27	Wednesday
 5.75		'foo special case
 0.00		MAD
 0.50		PL_sv_objcount
 1.00		Unicode Names

2013/02/28	Thursday
 0.50		Devel::PPPort/my $_
 6.75		PL_sv_objcount
 0.50		RT #116989
 2.25		Unicode Names

2013/03/01	Friday
 5.25		Unicode Names
 0.50		process, scalability, mentoring
 1.50		reading/responding to list mail

2013/03/02	Saturday
 2.75		File::Spec & bootstrap

2013/03/03	Sunday
 4.00		Unicode Names

Which I calculate is 51.00 hours

I've double checked - I did do 51 hours, and I did do a 12.50 hour day
(starting at 9am, and continuing until 11:30pm, with 3 even-sized breaks,
due to it being a very interesting problem:)

Each of the core's C source code files start with a quote from Lord of the
Rings. Many are right on the money. toke.c's quote is:

 *  'It all comes from here, the stench and the peril.'    --Frodo

This is quite apt, as toke.c is 12127 lines of near-impenetrable magic.
Chaim Frenkel's summary is well known:

    Perl's grammar can not be reduced to BNF. The work of parsing perl is
    distributed between yacc, the lexer, smoke and mirrors.

Larry recently offered a more quantitative evaluation:

    of the four or five ways a compiler can cheat, Perl 5 uses about eight
    of them

So the rather interesting problem was concerning the static function
S_force_word() in toke.c, which has a fifth argument `allow_initial_tick`,
which seemed to be redundant. The function is documented, and the arguments
are described as follows:

 * Arguments:
 *   char *start : buffer position (must be within PL_linestr)
 *   int token   : PL_next* will be this type of bare word (e.g., METHOD,WORD)
 *   int check_keyword : if true, Perl checks to make sure the word isn't
 *       a keyword (do this if the word is a label, e.g. goto FOO)
 *   int allow_pack : if true, : characters will also be allowed (require,
 *       use, etc. do this)
 *   int allow_initial_tick : used by the "sub" lexer only.

The function goes back a long way, with various changes to it and its
callers as the result of bug fixes, but these days most of those call sites
that had passed TRUE as the fifth argument had been further refactored to
avoid calling it. So it seemed that the fifth argument it wasn't needed - ie
it was safe to assume that it was always FALSE. Note, there's no way
whatsoever from any of the documentation to work out whether this is the
case. Although all the individual authors of this code are believed still to
be alive and responsive to e-mail, previous attempts at asking simpler
questions about code written years or decades ago have always resulted in
polite replies to the effect of "I no longer remember". Which really isn't

Hence, trying to get any better understanding of how pretty much any part of
the parser works requires carrying out an investigation like the one I'm
about to describe.

So, for starters, there's the obvious first step - what happens if I change
the code and do a full clean build and run the tests? Nothing fails. That
hints that it might be redundant, but you never can be sure...

Digging around the previous historical versions where S_force_word() had
been changed didn't real anything. Even the changes where that parameter has
been renamed or code relating to it altered only confirmed that there had
been bugs, but the changes fixed those bugs.

The approach that paid off was observing that until 2012 there were two
other call sites passing TRUE to the function. So I built the version of
blead just before they were refactored, and tried using FALSE instead. With
that change, this code stops parsing:

    $ ./miniperl -e "sub 'foo {warn qq{ok}}; &'foo"

That suggests that the argument in question is something to do with
disambiguating whether a ' is the start of a single quoted string, or a
leading package separator (ie the Perl 4 way of saying ::foo)

With that knowledge, and a bit of trial and error, I was able to figure out
that the code in question is needed to parse this correctly:

    $ ./perl -e 'sub one {};' -e "format 'one =" -e 'One!' \
             -e. -e '$~ = "one"; write'

If you change the TRUE to FALSE, this happens:

    $ ./perl -e 'sub one {};' -e "format 'one =" -e 'One!' \
             -e. -e '$~ = "one"; write'
    Undefined format "one" called at -e line 5.

(the parsing of 'format ::one =' is unaffected)

So, finally

a) we know what the code was for
b) we know how to write a test case
c) we can refactor the code to eliminate that argument
   (Brian Fraser had spotted that the tokeniser has already done the necessary
     work about a dozen lines earlier, with the result stored in a different

But that was about 12 hours work to figure out 1 argument to 1 function in
toke.c. There are 97 functions in toke.c, most have multiple arguments, and
the mean function length is double that of S_force_word(). The interactions
are staggeringly complex. Roughly 0.1% down, 99.9% to go.

Another area of complex interaction, but not *as* complex, is the build
system. The distribution needs to bootstrap itself up from source files to a
complete working tested perl, without assuming that there is already a perl
installed on the system (else you have a chicken and egg problem), but also
without getting confused if there *is* a perl installed on the system (as it
is probably a different version, or has incompatible build options).

Zefram submitted a patch which provides XS implementations for some of the
key methods in File::Spec. Whilst we agreed that for something as
fundamental as File::Spec, it's too close to v5.18.0 to safely put it in, I
did try to make a branch of blead, to test that it worked in blead.

This turned out to be rather more involved that I thought it would.

I expected it to be potentially interesting, because File::Spec and Cwd are
used early during the bootstrap of the toolchain, which means that a bunch
of scripts early in the build process need to be able to run it (pure-Perl)
from dist/

But it proved to be a bit more interesting that just that. For starters, I
hit a failure mode that I wasn't expecting. We have this rule to create
lib/, the setup file which primes miniperl to be able to run
uninstalled and change directory. (Which you need for the build):

    lib/ $(MINIPERL_EXE)
            $(MINIPERL) >lib/

Problem is that the upgrade caused a compile time error in, because Cwd now requires constant and re.
It's roughly akin to this:

    $ git diff
    diff --git a/ b/
    index ec3b36e..e8efec1 100644
    --- a/
    +++ b/
    @@ -1,5 +1,6 @@
     #!./miniperl -w
     use strict;
     if (@ARGV) {
         my $dir = shift;
    $ make lib/
    ./miniperl -Ilib >lib/
    Died at line 3.
    make: *** [lib/] Error 255
    $ echo $?
    $ ls -l lib/
    -rw-r--r-- 1 nick nick 0 Mar  2 16:15 lib/

ie the build fails, but the generated file is not removed. So if you attempt
to continue by running make again, that file is assumed to be good, and
something ugly fails soon after for all the wrong reasons, spewing error
messages that aren't related to the original problem.

So, having figured out that there is a second bug obscuring the real bug, it
becomes easier to fix the actual causes :-) Although the bad news is that
this means changes to the Win32 and VMS makefiles too. I pushed a tentative
fix to the Makefile bootstrapping logic in
smoke-me/nicholas/build-bootstrap, which George Greer's Win32 smoker seems
fine with. However, I think I can see how to improve it by more use of
buildcustomize (removing -Ilib), but things have been sufficiently busy that
I've not had a chance to look further. In particular, we don't have any VMS
smokers, so I'll have to test things manually on VMS, which makes it all
more time consuming.

The third significant thing I worked on this week was Unicode code point
lookup by name. Perl 5 can convert from name to code point using the "\N{}"
escape, which is implemented by automatically loading the charnames pragma.
Using /usr/bin/time it's easy to see that loading charnames increases memory
use considerably. Compare:

    $ /usr/bin/time -v ./perl -Ilib -le 'print "Hello world\n"'
    Hello world

            Maximum resident set size (kbytes): 6976

    $ /usr/bin/time -v ./perl -Ilib -le 'print "Hello\N{SPACE}world\n"'
    Hello world

            Maximum resident set size (kbytes): 30672

Trying to get some idea of where that extra 23 meg came from:

    $ /usr/bin/time -v ./perl -Ilib -le ''
            Maximum resident set size (kbytes): 6432

    $ /usr/bin/time -v ./perl -Ilib -le 'use strict; use warnings'
            Maximum resident set size (kbytes): 10016

    $ /usr/bin/time -v ./perl -Ilib -le 'use charnames ":full"'
            Maximum resident set size (kbytes): 24144

    $ /usr/bin/time -v ./perl -Ilib -le 'use charnames ":full"; print charnames::vianame("SPACE")'
            Maximum resident set size (kbytes): 30736

Just loading strict and warnings allocates another 4M. (Note, "Hello world"
was .5M, so nothing is free.) Loading charnames allocates another 14M, and
using it allocates a further 6M, presumably as various caches start to get

But all these requests are for things that are already known, just not in a
very convenient format for a fast lookup. The main Unicode Data file is
24430 lines, and doesn't include 4 large ranges of CJK unified ideographs, a
large range of algorithmically named Hangul syllables, and about 500
aliases. Moreover, any Perl hash you build of this (or the subset that you
are interested in), is allocated at runtime, from memory that isn't even
going to be shared between threads, let alone between processes.

Karl and I have wondered whether it would be possible to encode the names as
a trie structure, with lookup code written in C. The lookup data would be
unmodified at runtime, so could be compiled by the C compiler into the
constant data section of the executable (or shared library), which will be
shared at least between threads, and on *nix (at least) between all
processes. So I've made a start at looking at this. By the end of the week
my code had reached the point where I have Perl code to parse all the files,
and generate suitable data structures, along with Perl code written in a C
style which can correctly look up any code point by name, except the CJK
ideographs and Hangul names. Which is pleasing.

Nicholas Clark

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About