develooper Front page | perl.vmsperl | Postings from June 2002

Re: [PATCH] pp.c:pp_pow() -- quit when you're done

Thread Previous | Thread Next
From:
Tels
Date:
June 16, 2002 03:55
Subject:
Re: [PATCH] pp.c:pp_pow() -- quit when you're done
Message ID:
perl.vmsperl-10831@nntp.perl.org
-----BEGIN PGP SIGNED MESSAGE-----

Moin,

[warning, long mail]

"Craig A. Berry" <craigberry@mac.com> wrote:

>A for loop in pp_pow is constructed in such a way that it squares the
>base again even after it's iterated through all the bits of the
>exponent.  This could be a problem on any system where floating
>overflow causes an exception; I noticed it with the D_FLOAT format on
>OpenVMS Alpha, where it was causing Perl to blow up when running
>p/op/pow.t.  The patch below fixes it by exiting the loop as soon as
>we know we can.
>
>If someone strongly objects to having this test inside the loop, I
>can ifdef it.
>
>--- pp.c;-0     Fri Jun  7 09:24:27 2002
>+++ pp.c        Sat Jun 15 15:10:13 2002
>@@ -956,6 +956,8 @@
>                             result *= base;
>                             /* Only bother to clear the bit if it is set.
>*/
>                             power &= ~bit;
>+                           /* Avoid squaring base again if we're done. */
>+                           if (power == 0) break;
>                         }
>                     }
>                     SP--;
>[end of patch]

>                    NV result = 1.0;
>                    NV base = baseuok ? baseuv : -(NV)baseuv;
>                    int n = 0;
[snip comment]
>for (; power; base *= base, n++) {
>                        /* Do I look like I trust gcc with long longs here?
>                           Do I hell.  */
>                        UV bit = (UV)1 << (UV)n;
>                        if (power & bit) {
>                            result *= base;
>                            /* Only bother to clear the bit if it is set*/
>                            power &= ~bit;
>                        }
>                    }

Well, well, this can be a more bit optimized. (Not sure if it makes a
difference in praxis):

        for(; power; base *= base, n++) {
          UV bit = (UV)1 << (UV)n;

This constructs 1 << n every step, but having bit << 1 in every step
suffices. Also, we no longer need n. (Maybe this comes down to the same
amount of asm instructions due to compiler cleverness but I doubt it). We
also convert this to a while loop while we are at it:

(Excuse my indention (sp?), but works better in the mailer)

        UV bit = 1;
        while (power != 0)
          {
          if (power & bit)
            {
            result *= base;
            /* Only bother to clear the bit if it is set*/
            power &= ~bit;
            /* Save the last squaring of base, because this makes trouble */
            if (power == 0) { break; }
            }
          base *= base;
          bit <<= (UV)1;
          }

The additional "if (power == 0)" is only executed if the bit was on, which
should be seldom enough ;)

One could use a do { } while ( ... ); loop to skip the first power == 0
test (since that is likely handled above), but that should be only a
theoretical gain, since such a comparisation is quite fast compared to the
rest (unlike with BigInt's) (and we see later what the compiler does for
us, anyway).

It is also a question whether a loop without the if would be faster on some
CPU's with long pipelines:

        UV bit = 1;
        while (power != 0)
          {
          result *= base * (power & bit);
          power &= ~bit;
          if (power == 0) { break; }
          base *= base;
          bit <<= (UV)1;
          }

While this executes some more instructions for the (power & bit == 0) case,
it saves one comparisation.

(btw, this could be re-aranged so that we save one of the (power == 0)
tests by partial unrolling the loop, but see later)

Here are two C testfiles, and the asm output from gcc -O3:

############################################################################

#define UV int

int main (void)
  {
  int power = 13;
  float base = 1.0f;
  float result = 3.0f;

        UV bit = 1;
        while (power != 0)
          {
          result *= base * (power & bit);
          power &= ~bit;
          if (power == 0) { break; }
          base *= base;
          bit <<= (UV)1;
          }

  return 0;
  }

        .file   "while.c"
        .text
        .align 2
        .p2align 4,,15
..globl main
        .type   main,@function
main:
        pushl   %ebp
        movl    $13, %ecx
        movl    %esp, %ebp
        subl    $8, %esp
        andl    $-16, %esp
        movl    $1, %edx
        .p2align 4,,15
..L6:
        movl    %edx, %eax
        notl    %eax
        andl    %eax, %ecx
        je      .L3
        addl    %edx, %edx
        jmp     .L6
..L3:
        leave
        xorl    %eax, %eax
        ret
..Lfe1:
        .size   main,.Lfe1-main
        .ident  "GCC: (GNU) 3.1"

############################################################################

#define UV int

int main (void)
  {
  int power = 13;
  float base = 1.0;
  float result = 3;

        UV bit = 1;
        while (power != 0)
          {
          if (power & bit)
            {
            result *= base;
            /* Only bother to clear the bit if it is set*/
            power &= ~bit;
            /* Save the last squaring of base, because this makes trouble */
            if (power == 0) { break; }
            }
          base *= base;
          bit <<= (UV)1;
          }

  return 0;
  }
        .file   "while_org.c"
        .text
        .align 2
        .p2align 4,,15
..globl main
        .type   main,@function
main:
        pushl   %ebp
        movl    $13, %ecx
        movl    %esp, %ebp
        subl    $8, %esp
        andl    $-16, %esp
        movl    $1, %edx
        .p2align 4,,15
..L7:
        testl   %edx, %ecx
        je      .L5
        movl    %edx, %eax
        notl    %eax
        andl    %eax, %ecx
        je      .L3
..L5:
        addl    %edx, %edx
        testl   %ecx, %ecx
        jne     .L7
..L3:
        leave
        xorl    %eax, %eax
        ret
..Lfe1:
        .size   main,.Lfe1-main
        .ident  "GCC: (GNU) 3.1"

############################################################################

The inner loops are:

..L7:
        testl   %edx, %ecx
        je      .L5
        movl    %edx, %eax
        notl    %eax
        andl    %eax, %ecx
        je      .L3
..L5:
        addl    %edx, %edx
        testl   %ecx, %ecx
        jne     .L7

and for the second variant:

..L6:
        movl    %edx, %eax
        notl    %eax
        andl    %eax, %ecx
        je      .L3
        addl    %edx, %edx
        jmp     .L6
..L3

############################################################################

What do we learn from this? The compiler optimzed away anything. Back to
the drawing board:

Add printf ("%f",result); before the return 0:

First variant:

..L7:
        testl   %edx, %ecx
        je      .L5
        movl    %edx, %eax
        notl    %eax
        andl    %eax, %ecx
        fmul    %st(1), %st
        je      .L9
..L5:
        fxch    %st(1)
        addl    %edx, %edx
        testl   %ecx, %ecx
        fmul    %st(0), %st
        je      .L11
        fxch    %st(1)
        jmp     .L7
        .p2align 4,,7
..L11:
        fstp    %st(0)
..L3:
[snip printf and ret block]
        ret
..L9:
        fstp    %st(1)
        jmp     .L3

Second variant:

..L6:
        movl    %ecx, %eax
        andl    %edx, %eax
        pushl   %eax
        movl    %edx, %eax
        notl    %eax
        fildl   (%esp)
        addl    $4, %esp
        andl    %eax, %ecx
        fmul    %st(2), %st
        fmulp   %st, %st(1)
        je      .L3
        fxch    %st(1)
        fmul    %st(0), %st
        fxch    %st(1)
        addl    %edx, %edx
        jmp     .L6
..L3:

(We also see that the compiler is not (yet) clever enough to see that a
printf ("1549323") would be sufficient for a correct program :-P

The first variant has one conditional (je L5) which is fall-trough for 1
bits and jump for zero bits (there should be more zero bits). It has 14
instructons for the 1 case, and 9 for the zero case.

The second variant has only one conditional, (je L3) which is the loop-end
and only jumps once. It has 16 instructions in both cases. (It could be
faster, due to better pairing, which you can only find out by reading tons
of CPU docs - only to find that it only works for AMD, not for Intel - or
by benchmarking it)

You see also that the compiler optimized away the (power != 0) at the loop
entry and moved it into the loop! (probably because I did power = 13 at
the start).

Oh fun with optimizing compilers - but anyway, if the compiler knows in the
pp.c that power can not be 0 (because tested and branched further up), it
would pull the same trick.

Pipeline-wise the second variant looks better (shorter/cleaner). Anyway,
untested patch for first variant attached. Somebody with too much time
on their hands could benchmark the three versions *wink hint nudge*

Hope this helps,

Te"They call me 'The Optimizer[tm]'"ls


- --
perl -MMath::String -e 'print \
Math::String->from_number("215960156869840440586892398248"),"\n"'

 http://bloodgate.com/perl       My current Perl projects
 PGP key available on http://bloodgate.com/tels.asc or via email.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl.

iQEVAwUBPQxu9XcLPEOTuEwVAQGCVgf+OrEIDEBI2AVbBjsAnrs78B6UDtqKDUkg
upSs8uoKHj8Wd4x641SuTMEC9/8MMYsDpFDm/V2VUq5a3aoKIvVHTNNYHrJC+JRZ
ckhQNaEGO0hHll7s7vEEPuJGjtbxNs2Av4ou1KfO2pmAwnCdayt+k0Sht5+NUEeE
njVMPqdg2Xw0MFaKGd4zBK7PTXCvFKwspUeRPwYZ6b746q2bT71HI2GYj81CZ5Lr
HUMM9nll1r0Sp0a6ke114xfH/prKI10c7VIXT8c0FS68L70xrUSVqZOp9R1+ECsb
rAHt4h4ub1fX6kFhmAyLIAq7W491cXgvaoFFtEFAW0LEnI6/shyIEA==
=cMX5
-----END PGP SIGNATURE-----

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About