develooper Front page | perl.perl5.porters | Postings from September 2014

RFC 2 pointer sized in registers only return struct optimization

September 25, 2014 00:52
RFC 2 pointer sized in registers only return struct optimization
Message ID:
I resurrected a patch from a couple years and rebased it. Every OS and 
CPU that Perl runs on except Win64 has a function's return value to be 2 
pointer sized registers.

Perl very often has prototype with an outgoing only STRLEN * len and a 
buffer * as the return value. In these cases always returning the len in 
the high register instead of passing STRLEN * will lead to better code 
generation. *nolen() functions will cease to exist on an assembly level. 
No more passing NULL to the callee or stub that passes NULL for you. If 
the caller doesn't want len, it just ignores the high register and 
overwrites whenever it wants. Of course if calculating len is expensive, 
and "if(len)" calls more funcs only if "if(len)", then this optimization 
definetly shouldn't be done on that function. If len is from SvCUR or 
something equally lightweight (Perl__to_fold_latin1). All the regcall 
ABIs (SysV x64, ARM, PARISC, MIPS) have atleast 4 registers for incoming 
args. By removing the &len from the callers, a C stack memory slot is 
freed in most callers unless var len has to survive across another call. 
Also if it makes a function call not need to pass any args on C stack, 
then certain optimizations with shadow/reg spill space can happen. A 
strlen being saved across a function call is rare from me grepping the 
perl source and machine code. If var len has to be saved across another 
call, under my proposal, then after the call, the high return register 
will be copied to C stack.

The amount of candidates to be 2 pointer return types has been growing 
over the last couple years

and many many more (search embed.fnc for STRLEN *, I32 * and U32 *)


What should be the name of this build feature?

_2WR (two word return)
LR "large return"
SR "struct return"
_2AR (two arg return)
_2ARGS (two args)
_2PR (two pointer return)

In some ABIs, since there is no 2nd return register (Win64), returning > 
sizeof(void *) means that what in C on 32 bit OS is

char[8] functionName(OBJECT * object);

becomes machine code identical to the following C prototype

void functionName( char (*return) [8], OBJECT * object);

This is now less efficient, since both ints/words have to be read from C 
stack. Because of this, a CPP framework must be set up so with a def, 
Perl can be compiled to use 2 word returns or doing it the current way.

Also I believe that inlined stubs must exist since existing C code will 
be writing "&foo", like for sv_pos_u2b_flags and utf8n_to_uvchr. If my C 
knowledge is correct the 2 words in the return type can't be split into 
2 vars without assigning to a var, so a macro-only solution wont work 
without a do{}while(0);, and do{}while(0); can't be used to implement
"#define sv_2pv_flags" since

char * buffer = do{ _2pv_t _2pv = Perl_sv_2pv_flags_s(aTHX_ sv, flags); 
if(lp) *lp = _2pv.len; _2pv.pv;}while(0);

isn't valid C and

char * buffer = Perl_sv_2pv_flags_s(aTHX_ sv, flags).pv;

fails to capture the length and

buffer, { len } = Perl_sv_2pv_flags_s(aTHX_ sv, flags).{.len, .pv};

isn't C (maybe it will be valid C in C22 or C33 (^_-), someone wanna 
help me submit a proposal to ISO jkjk )

Compound statements are are GCC only so 
that is useless

So stubs like

__forceinline char *
S_sv_2pv_flags(pTHX_ SV *const sv, STRLEN *const lp, const I32 flags)
_2pv_t _2pv = Perl_sv_2pv_flags_s(aTHX_ sv, flags);
*lp = _2pv.len;
return _2pv.pv;

have to be written and put in some .h (inline.h or some other file?). I 
believe this can be done automatically by from embed.fnc if the 
high register arg is marked in embed.fnc. We have NN and NULLOK tokens 
already, "2WR" or "_2WR" or "TWR" can be the other token. A

#ifdef USE_TWR
# define TWRARG(x)
# define TWRARG_(x)
# define TWRARG(x) x
# define TWRARG_(x) x,

has to be in the .c file func definition proto to remove the arg if in 
TWR mode.

The return type in the proto has to be

#ifdef USE_TWR
# define TWRRTEXP(x) x

To return/exit a TWR function there are 2 choices that I can think of

A RETURN_TWR(var1, var2)

#ifdef USE_TWR
# ifdef TWR_IS_U64 /* on some SBIs aggregates never get return by copy */
# define RETURN_TWRNOK(low, high) STMT_START {TWRTYPE _xtwrtmp; 
_xtwrtm.lowmemb = (void *) low; _xtwrtm.highmemb = (void *) high; return 
*(uint64_t *)(&_xtwrtmp) } STMT_END
# ifdef TWR_IS_U128
# define RETURN_TWRNOK(low, high) STMT_START {TWRTYPE _xtwrtmp; 
_xtwrtm.lowmemb = (void *) low; _xtwrtm.highmemb = (void *) high; return 
*(uint128_t *)(&_xtwrtmp) } STMT_END
# ifdef TWR_IS_F128
# define RETURN_TWRNOK(low, high) STMT_START {TWRTYPE _xtwrtmp; 
_xtwrtm.lowmemb = (void *) low; _xtwrtm.highmemb = (void *) high; return 
*(__float128 *)(&_xtwrtmp) } STMT_END
# else
# define RETURN_TWRNOK(low, high) STMT_START {TWRTYPE _xtwrtmp; 
_xtwrtm.lowmemb = (void *) low; _xtwrtm.highmemb = (void *) high; return 
_xtwrtmp;} STMT_END
# endif
# define RETURN_TWRNN(low, high) RETURN_TWRNOK(low, high)
# define RETURN_TWRNOK(low, high) return (high#p? *high#p = high: 0), low
# define RETURN_TWRNN(low, high) return (*high#p = high), low

Another API would be
#define dTWR
#define SETTWRLOW()
#define READTWRLOW()
#define SETTWRHIGH()
#define RETURN_TWR /* no args, the under the hood C auto names are a 
secret */

And with dTWR (choice 2) there are either 2 separate C autos, or the 1 
aggregate struct return type. Remember the legal return type might not 
be a struct type, but a uint64_t, __float128, __int128, long double 
(aslong as 128 bit long double is returned in 2 integer regs by ABI 
spec, and not 1 or 2 FP/SSE registers, loading into a FP register and 
then to GPR in the caller is insane/bad), __m128i, since some ABIs say 
only integers or floats get to be split into 2 return registers and ALL 
structs, unions, and arrays become secret 1st arg * supplied by caller.

Rest of document is ABI details on CPUs/OSes, so you can stop reading 
here if you dont care about machine code:

Win32 on 32 (2 word return not supported on Win64 ABI, no RDX)
EAX EDX registers for returning 64 bit structs or a 64 bit int

iOS ARM 32 and 64 (for ARM32 uint64_t probably required, not a struct, 
see composite type)

ARM 32 from ARM Corp
5.4 Result Return
The manner in which a result is returned from a function is determined 
by the type of that result.
For the base standard:
- A word-sized Fundamental Data Type (e.g., int, float) is returned in r0.
- A double-word sized Fundamental Data Type (e.g., long long, double and 
64-bit containerized vectors) is returned in r0 and r1.
- A 128-bit containerized vector is returned in r0-r3.
- A Composite Type not larger than 4 bytes is returned in r0. The format 
is as if the result had been stored in memory at a word-aligned address 
and then loaded into r0 with an LDR instruction. Any bits in r0 that lie 
outside the bounds of the result have unspecified values.
- A Composite Type larger than 4 bytes, or whose size cannot be 
determined statically by both caller and callee, is stored in memory at 
an address passed as an extra argument when the function was called 
(§5.5, rule A.4). The memory to be used for the result may be modified 
at any point during the function call.

ARM 64 from ARM Corp

5.5 Result Return
The manner in which a result is returned from a function is determined 
by the type of that result:
- If the type, T, of the result of a function is such that
void func(T arg)
would require that arg be passed as a value in a register (or set of 
registers) according to the rules in §5.4
Parameter Passing, then the result is returned in the same registers as 
would be used for such an argument.
- Otherwise, the caller shall reserve a block of memory of sufficient 
size and alignment to hold the result. The
address of the memory block shall be passed as an additional argument to 
the function in x8. The callee may
modify the result memory block at any point during the
B.3 If the argument type is a Composite Type that is larger than 16 
bytes, then the argument is copied to
memory allocated by the caller and the argument is replaced by a pointer 
to the copy.
Table 2, General purpose registers and AAPCS64 usage
The first eight registers, r0-r7, are used to pass argument values into 
a subroutine and to return result values from
a function. They may also be used to hold intermediate values within a 
routine (but, in general, only between
subroutine calls).
C.10 If the argument is a Composite Type and the size in double-words of 
the argument is not more than 8
minus NGRN, then the argument is copied into consecutive general-purpose 
registers, starting at
x[NGRN]. The argument is passed as though it had been loaded into the 
registers from a double-wordaligned
address with an appropriate sequence of LDR instructions loading 
consecutive registers from
memory (the contents of any unused parts of the registers are 
unspecified by this standard). The NGRN
is incremented by the number of registers used. The argument has now 
been allocated.


o GRs 28 and 29 are used for return values up to 128 bits long. These 
are scratch registers.

PA-RISC 32 is GR 28 and GR29 for 64 bit values

When calling functions that return results larger than 64 bits, the 
caller passes a short
pointer (using SR5 - SR7) in GR28 (ret0) which describes the memory 
location for the
function result. The address given should be the address for the 
high-order byte of the

X64 SysV (Linux/OSX 64)
3. If the class is INTEGER, the next available register of the sequence 
%rdx is used.

x86-32 SysV (Linux)


OSX 32


Solaris 64

Functions that return an integer value return it in %o0 or %o0 and %o1. 
For 32-bit code, long long data are returned with the upper 32-bits in 
%o0 and the lower 32-bits in %o1, treating %o0 and %o1 as if they were 
32-bit registers.

Some PDF on MIPS

Function results are returned in $2 (and $3 if needed), or$f0 (and $f2 
if needed),
as appropriate for the type. Composite results (struct, union, or array) are
returned in $2/$f0 and $3/$f2 according to the following rules:
– A struct with only one or two floating point fields is returned in $f0 
$f2 if necessary). This is a generalization of the Fortran COMPLEX case.
– Any other struct or union results of at most 128 bits are returned in 
$2 (first
64 bits) and $3 (remainder, if necessary). Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About