Front page | perl.perl5.porters |
Postings from September 2014
RFC 2 pointer sized in registers only return struct optimization
From:
bulk88
Date:
September 25, 2014 00:52
Subject:
RFC 2 pointer sized in registers only return struct optimization
Message ID:
BLU437-SMTP40E69371B1D8026D83D70DFBE0@phx.gbl
I resurrected a patch from a couple years and rebased it. Every OS and
CPU that Perl runs on except Win64 has a function's return value to be 2
pointer sized registers.
Reasons:
Perl very often has prototype with an outgoing only STRLEN * len and a
buffer * as the return value. In these cases always returning the len in
the high register instead of passing STRLEN * will lead to better code
generation. *nolen() functions will cease to exist on an assembly level.
No more passing NULL to the callee or stub that passes NULL for you. If
the caller doesn't want len, it just ignores the high register and
overwrites whenever it wants. Of course if calculating len is expensive,
and "if(len)" calls more funcs only if "if(len)", then this optimization
definetly shouldn't be done on that function. If len is from SvCUR or
something equally lightweight (Perl__to_fold_latin1). All the regcall
ABIs (SysV x64, ARM, PARISC, MIPS) have atleast 4 registers for incoming
args. By removing the &len from the callers, a C stack memory slot is
freed in most callers unless var len has to survive across another call.
Also if it makes a function call not need to pass any args on C stack,
then certain optimizations with shadow/reg spill space can happen. A
strlen being saved across a function call is rare from me grepping the
perl source and machine code. If var len has to be saved across another
call, under my proposal, then after the call, the high return register
will be copied to C stack.
The amount of candidates to be 2 pointer return types has been growing
over the last couple years
sv_2pv_flags
to_utf8_*
utf8_to_bytes
utf8_to_uvchr
utf8n_to_uvchr
bytes_from_utf8
bytes_to_utf8
sv_pvn_force_flags
sv_pos_*
hv_iterkey
Perl__to_fold_latin1
and many many more (search embed.fnc for STRLEN *, I32 * and U32 *)
Implementation:
What should be the name of this build feature?
_2WR (two word return)
_2WORDS
LR "large return"
RL
SR "struct return"
RS
TWR
_2AR (two arg return)
_2ARGS (two args)
_2PR (two pointer return)
TWOVOIDPTRRET
In some ABIs, since there is no 2nd return register (Win64), returning >
sizeof(void *) means that what in C on 32 bit OS is
-------------------------------------------------------------
char[8] functionName(OBJECT * object);
-------------------------------------------------------------
becomes machine code identical to the following C prototype
-------------------------------------------------------------
void functionName( char (*return) [8], OBJECT * object);
-------------------------------------------------------------
This is now less efficient, since both ints/words have to be read from C
stack. Because of this, a CPP framework must be set up so with a def,
Perl can be compiled to use 2 word returns or doing it the current way.
Also I believe that inlined stubs must exist since existing C code will
be writing "&foo", like for sv_pos_u2b_flags and utf8n_to_uvchr. If my C
knowledge is correct the 2 words in the return type can't be split into
2 vars without assigning to a var, so a macro-only solution wont work
without a do{}while(0);, and do{}while(0); can't be used to implement
"#define sv_2pv_flags" since
char * buffer = do{ _2pv_t _2pv = Perl_sv_2pv_flags_s(aTHX_ sv, flags);
if(lp) *lp = _2pv.len; _2pv.pv;}while(0);
isn't valid C and
char * buffer = Perl_sv_2pv_flags_s(aTHX_ sv, flags).pv;
fails to capture the length and
buffer, { len } = Perl_sv_2pv_flags_s(aTHX_ sv, flags).{.len, .pv};
isn't C (maybe it will be valid C in C22 or C33 (^_-), someone wanna
help me submit a proposal to ISO jkjk )
Compound statements are
https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html are GCC only so
that is useless
So stubs like
__forceinline char *
S_sv_2pv_flags(pTHX_ SV *const sv, STRLEN *const lp, const I32 flags)
{
_2pv_t _2pv = Perl_sv_2pv_flags_s(aTHX_ sv, flags);
if(lp)
*lp = _2pv.len;
return _2pv.pv;
}
have to be written and put in some .h (inline.h or some other file?). I
believe this can be done automatically by regen.pl from embed.fnc if the
high register arg is marked in embed.fnc. We have NN and NULLOK tokens
already, "2WR" or "_2WR" or "TWR" can be the other token. A
#ifdef USE_TWR
# define TWRARG(x)
# define TWRARG_(x)
#else
# define TWRARG(x) x
# define TWRARG_(x) x,
#endif
has to be in the .c file func definition proto to remove the arg if in
TWR mode.
The return type in the proto has to be
#ifdef USE_TWR
# define TWRRTEXP(x) TWRTYPE
#else
# define TWRRTEXP(x) x
#endif
To return/exit a TWR function there are 2 choices that I can think of
A RETURN_TWR(var1, var2)
#ifdef USE_TWR
# ifdef TWR_IS_U64 /* on some SBIs aggregates never get return by copy */
# define RETURN_TWRNOK(low, high) STMT_START {TWRTYPE _xtwrtmp;
_xtwrtm.lowmemb = (void *) low; _xtwrtm.highmemb = (void *) high; return
*(uint64_t *)(&_xtwrtmp) } STMT_END
# ifdef TWR_IS_U128
# define RETURN_TWRNOK(low, high) STMT_START {TWRTYPE _xtwrtmp;
_xtwrtm.lowmemb = (void *) low; _xtwrtm.highmemb = (void *) high; return
*(uint128_t *)(&_xtwrtmp) } STMT_END
# ifdef TWR_IS_F128
# define RETURN_TWRNOK(low, high) STMT_START {TWRTYPE _xtwrtmp;
_xtwrtm.lowmemb = (void *) low; _xtwrtm.highmemb = (void *) high; return
*(__float128 *)(&_xtwrtmp) } STMT_END
# else
# define RETURN_TWRNOK(low, high) STMT_START {TWRTYPE _xtwrtmp;
_xtwrtm.lowmemb = (void *) low; _xtwrtm.highmemb = (void *) high; return
_xtwrtmp;} STMT_END
# endif
# define RETURN_TWRNN(low, high) RETURN_TWRNOK(low, high)
#else
# define RETURN_TWRNOK(low, high) return (high#p? *high#p = high: 0), low
# define RETURN_TWRNN(low, high) return (*high#p = high), low
#endif
Another API would be
#define dTWR
#define SETTWRLOW()
#define READTWRLOW()
#define SETTWRHIGH()
#define READTWRHIGH()
#define RETURN_TWR /* no args, the under the hood C auto names are a
secret */
And with dTWR (choice 2) there are either 2 separate C autos, or the 1
aggregate struct return type. Remember the legal return type might not
be a struct type, but a uint64_t, __float128, __int128, long double
(aslong as 128 bit long double is returned in 2 integer regs by ABI
spec, and not 1 or 2 FP/SSE registers, loading into a FP register and
then to GPR in the caller is insane/bad), __m128i, since some ABIs say
only integers or floats get to be split into 2 return registers and ALL
structs, unions, and arrays become secret 1st arg * supplied by caller.
Rest of document is ABI details on CPUs/OSes, so you can stop reading
here if you dont care about machine code:
Win32 on 32 (2 word return not supported on Win64 ABI, no RDX)
EAX EDX registers for returning 64 bit structs or a 64 bit int
iOS ARM 32 and 64 (for ARM32 uint64_t probably required, not a struct,
see composite type)
ARM 32 from ARM Corp
5.4 Result Return
The manner in which a result is returned from a function is determined
by the type of that result.
For the base standard:
...........
- A word-sized Fundamental Data Type (e.g., int, float) is returned in r0.
- A double-word sized Fundamental Data Type (e.g., long long, double and
64-bit containerized vectors) is returned in r0 and r1.
- A 128-bit containerized vector is returned in r0-r3.
- A Composite Type not larger than 4 bytes is returned in r0. The format
is as if the result had been stored in memory at a word-aligned address
and then loaded into r0 with an LDR instruction. Any bits in r0 that lie
outside the bounds of the result have unspecified values.
- A Composite Type larger than 4 bytes, or whose size cannot be
determined statically by both caller and callee, is stored in memory at
an address passed as an extra argument when the function was called
(§5.5, rule A.4). The memory to be used for the result may be modified
at any point during the function call.
ARM 64 from ARM Corp
5.5 Result Return
The manner in which a result is returned from a function is determined
by the type of that result:
- If the type, T, of the result of a function is such that
void func(T arg)
would require that arg be passed as a value in a register (or set of
registers) according to the rules in §5.4
Parameter Passing, then the result is returned in the same registers as
would be used for such an argument.
- Otherwise, the caller shall reserve a block of memory of sufficient
size and alignment to hold the result. The
address of the memory block shall be passed as an additional argument to
the function in x8. The callee may
modify the result memory block at any point during the
...........................
B.3 If the argument type is a Composite Type that is larger than 16
bytes, then the argument is copied to
memory allocated by the caller and the argument is replaced by a pointer
to the copy.
..........................
Table 2, General purpose registers and AAPCS64 usage
The first eight registers, r0-r7, are used to pass argument values into
a subroutine and to return result values from
a function. They may also be used to hold intermediate values within a
routine (but, in general, only between
subroutine calls).
............................
C.10 If the argument is a Composite Type and the size in double-words of
the argument is not more than 8
minus NGRN, then the argument is copied into consecutive general-purpose
registers, starting at
x[NGRN]. The argument is passed as though it had been loaded into the
registers from a double-wordaligned
address with an appropriate sequence of LDR instructions loading
consecutive registers from
memory (the contents of any unused parts of the registers are
unspecified by this standard). The NGRN
is incremented by the number of registers used. The argument has now
been allocated.
.............................
HPUX PA RISC
PA-RISC 64
o GRs 28 and 29 are used for return values up to 128 bits long. These
are scratch registers.
PA-RISC 32 is GR 28 and GR29 for 64 bit values
When calling functions that return results larger than 64 bits, the
caller passes a short
pointer (using SR5 - SR7) in GR28 (ret0) which describes the memory
location for the
function result. The address given should be the address for the
high-order byte of the
result.
X64 SysV (Linux/OSX 64)
3. If the class is INTEGER, the next available register of the sequence
%rax,
%rdx is used.
x86-32 SysV (Linux)
EAX EDX http://wiki.osdev.org/Calling_Conventions
OSX 32
EAX EDX
https://developer.apple.com/library/mac/documentation/developertools/conceptual/lowlevelabi/130-IA-32_Function_Calling_Conventions/IA32.html
Solaris 64
Functions that return an integer value return it in %o0 or %o0 and %o1.
For 32-bit code, long long data are returned with the upper 32-bits in
%o0 and the lower 32-bits in %o1, treating %o0 and %o1 as if they were
32-bit registers.
Some PDF on MIPS
Function results are returned in $2 (and $3 if needed), or$f0 (and $f2
if needed),
as appropriate for the type. Composite results (struct, union, or array) are
returned in $2/$f0 and $3/$f2 according to the following rules:
A struct with only one or two floating point fields is returned in $f0
(and
$f2 if necessary). This is a generalization of the Fortran COMPLEX case.
Any other struct or union results of at most 128 bits are returned in
$2 (first
64 bits) and $3 (remainder, if necessary).
-
RFC 2 pointer sized in registers only return struct optimization
by bulk88