develooper Front page | perl.perl5.porters | Postings from November 2012

[perl #115746] [PATCH] clean up inline string comparisons in gv_fetchpvn_flags

Thread Previous | Thread Next
From:
bulk 88 via RT
Date:
November 16, 2012 12:33
Subject:
[perl #115746] [PATCH] clean up inline string comparisons in gv_fetchpvn_flags
Message ID:
rt-3.6.HEAD-17500-1353098023-1670.115746-15-0@perl.org
On Thu Nov 15 16:34:10 2012, public@khwilliamson.com wrote:
> On 11/15/2012 05:06 PM, Tony Cook wrote:
> >
> > This code makes non-portable assumptions about alignment, or at least
> > about accessing unaligned memory as an I32.
> >
> > Tony
> >
> 
> This thread may have some bearing:
> http://markmail.org/message/aspjihku5iaszw5q
> 

The description of U32_ALIGNMENT_REQUIRED sounds strange to me. It
implys that 32 bit casting and bitshift are required to manipulate
strings but that can't be the case. In Digest::MD5 there is actual use
of U32_ALIGNMENT_REQUIRED, but Digest::MD5's Makefile.PL defines it
using its own logic and with test compiles so it ignores config.h. So
globally, the U32_ALIGNMENT_REQUIRED define's description is either
wrong (probably), or Digest::MD5's repurposing of it is wrong.

So is a new def required for what I want to do or not or is
U32_ALIGNMENT_REQUIRED the correct def to use for what I want to do?
Win32's config has U32_ALIGNMENT_REQUIRED defined, so I would need to
change that.

Another question is, what is faster? I did some benchmarking, and the
numbers are all over the place +/- 30% based on the sample data. Some
things that flip which is faster, if element "IOP" is 3 or 4 characters.
If there are elements after "IOP", O1 and O2, and 20 mil or 10 mil
iterations the results flip. Each one flips the result individually,
combined the results I think are random. One benchmark run untimed was
done to in theory preload the cache. I dont think there was a difference
between a run with the prerun and no prerun loop, or it was below the
per process run time noise. On O1 new func is 0x9D, old is 0x145. On O2
old is 0x148, new is 0xB8. Each test at 10 mil takes 1 to 2.5 secs. I
think the benchmark is flawed somehow (not random (branch predictor
optimizes away), sample data (arrays and strings) are in RO memory, not
enough never match strings/not enough matching strings, no runloop/cache
invalidation between runs) but I would need to play around with it more
to figure out exactly what is going on.

The reason I bring up the issue here is,
Perl_keyword/Devel::Tokenizer::C does the same thing that the old code
here does. Perl_keyword is the 3rd largest function in the interp at
~15-16KB 32bit x86. It could use the same modification that I did here
but on a larger scale.

______________________________________________________
void
StrEqTree()
PREINIT:
    int i;
    int iarr;
#define W(x) x
    static const char * arr [] =
    
    {W("ARGVOUT"), W("ZOO"), W("ENV"), W("ARGV"), W("NO"), W("STDOUT"),
                                W("_"), W("SIG"), W("INC"), W("*"),
W("FOO"), W("STDIN"),
                                W("STDERR"), W("ARGV"), W("INC"),
W("STDOUT"), W("STDZER"),
    //                            W("IOP"), W("ZIP"), W("ENV"),
W("CAT"), W("ZOOF"), W("STDIN"),
    //W("ARGVOUT"), W("ENV"), W("ARGV"), W("STDOUT"),
    //                            W("_"), W("SIG"), W("INC"), W("*")
                                
                                };
#define W(x) sizeof(x)-1
    static const char arrlen [] =
    {W("ARGVOUT"), W("ZOO"), W("ENV"), W("ARGV"), W("NO"), W("STDOUT"),
                                W("_"), W("SIG"), W("INC"), W("*"),
W("FOO"), W("STDIN"),
                                W("STDERR"), W("ARGV"), W("INC"),
W("STDOUT"), W("STDZER"),
    //                            W("IOP"), W("ZIP"), W("ENV"),
W("CAT"), W("ZOOF"), W("STDIN"),
    //W("ARGVOUT"), W("ENV"), W("ARGV"), W("STDOUT"),
    //                            W("_"), W("SIG"), W("INC"), W("*")
                                
                                };
    LARGE_INTEGER my_beg;
    LARGE_INTEGER my_end;
    LARGE_INTEGER my_freq;
    int tc;
CODE:
    QueryPerformanceFrequency(&my_freq);

    tc=0;
    for(i=0; i < 20000000; i++){
        for(iarr=0; iarr < sizeof(arrlen); iarr++){
            tc += streqold(arr[iarr], arrlen[iarr]);
        }
    }
    
    QueryPerformanceCounter(&my_beg);
    tc=0;
    for(i=0; i < 20000000; i++){
        for(iarr=0; iarr < sizeof(arrlen); iarr++){
             tc += streqnew(arr[iarr], arrlen[iarr]);
        }
    }
    QueryPerformanceCounter(&my_end);
    printf("new time=%f, opt=" OPTIMIZE " true=%i\n",
((double)(my_end.QuadPart-my_beg.QuadPart))/(double)my_freq.QuadPart, tc);
    
    QueryPerformanceCounter(&my_beg);
    tc=0;
    for(i=0; i < 10000000; i++){
        for(iarr=0; iarr < sizeof(arrlen); iarr++){
            tc += streqold(arr[iarr], arrlen[iarr]);
        }
    }
    QueryPerformanceCounter(&my_end);
    printf("old time=%f, opt=" OPTIMIZE " true=%i\n",
((double)(my_end.QuadPart-my_beg.QuadPart))/(double)my_freq.QuadPart, tc);
___________________________________________________________

bool streqold(const char * name, STRLEN len){
#undef NVSIZE
#undef I32SIZE
#define NVSIZE 9
#define I32SIZE 5
	    bool global = FALSE;

	    switch (len) {
	    case 1:
		if (*name == '_')
		    global = TRUE;
		break;
	    case 3: {
#if I32SIZE == 4
		    /* assume we have a term null even though len is 3 */
		    const char INCs [4] = "INC\0"; 
		    const char ENVs [4] = "ENV\0";
		    const char SIGs [4] = "SIG\0";
		    if (*(I32 *)name == *(I32 *)INCs
			|| *(I32 *)name == *(I32 *)ENVs
			|| *(I32 *)name == *(I32 *)SIGs)
#else
		    if ((name[0] == 'I' && name[1] == 'N' && name[2] == 'C')
			|| (name[0] == 'E' && name[1] == 'N' && name[2] == 'V')
			|| (name[0] == 'S' && name[1] == 'I' && name[2] == 'G'))
#endif
			 global = TRUE;
		}
		break;
	    case 4: {
#if I32SIZE == 4
		    const char ARGVs [4] = "ARGV";
		    if (*(I32 *)name == *(I32 *)ARGVs)
#else
		    if (name[0] == 'A' && name[1] == 'R' && name[2] == 'G'
			&& name[3] == 'V')
#endif
			global = TRUE;
		}
		break;
	    case 5: {
#if I32SIZE == 4
		    const char STDIs [4] = "STDI";
		    if (*(I32 *)name == *(I32 *)STDIs && name[4] == 'N')
#else
		    if (name[0] == 'S' && name[1] == 'T' && name[2] == 'D'
			&& name[3] == 'I' && name[4] == 'N')
#endif
			global = TRUE;
		}
		break;
	    case 6: { /* s2=string [part] 2 */
#if I32SIZE == 4 && I16SIZE == 2
		    const char STDOUTs1 [4] = "STDO";
		    const char STDOUTs2 [2] = "UT";
		    const char STDERRs1 [4] = "STDE";
		    const char STDERRs2 [2] = "RR";
		    if ((*(I32 *)name == *(I32 *)STDOUTs1 && *(I16 *)&name[4] == *(I16
*)STDOUTs2)
			|| (*(I32 *)name == *(I32 *)STDERRs1 && *(I16 *)&name[4] == *(I16
*)STDERRs2))
#else
		    if ((name[0] == 'S' && name[1] == 'T' && name[2] == 'D')
			&&((name[3] == 'O' && name[4] == 'U' && name[5] == 'T')
			   ||(name[3] == 'E' && name[4] == 'R' && name[5] == 'R')))
#endif
			global = TRUE;
		}
		break;
	    case 7: {
/* use only native machine i64s, not compiler emulated ones */
#if defined(HAS_QUAD) && I64SIZE == 8 && PTRSIZE == 8
		    const char ARGVOUTs [8] = "ARGVOUT\0";
		    if (*(I64 *)name == *(I64 *)ARGVOUTs)
#elif NVSIZE == 8
		    /* least CPU and instruction set (X87, SSE2, etc) dependent way */
		    if (*(NV *)name ==  *(NV *)"ARGVOUT")
#elif I32SIZE == 4
		    /* assume we have a term null even though len is 7 */
		    const char ARGVOUTs1 [4] = "ARGV";
		    const char ARGVOUTs2 [4] = "OUT\0";
		    if (*(I32 *)name == *(I32 *)ARGVOUTs1 && *(I32 *)&name[4] == *(I32
*)ARGVOUTs2)
#else
		    if (name[0] == 'A' && name[1] == 'R' && name[2] == 'G'
			&& name[3] == 'V' && name[4] == 'O' && name[5] == 'U'
			&& name[6] == 'T')
#endif
			global = TRUE;
		}
		break;
	    }
        return global;
#undef NVSIZE
#undef I32SIZE
#define NVSIZE 8
#define I32SIZE 4
}

bool streqnew(const char * name, STRLEN len){
	    bool global = FALSE;

	    switch (len) {
	    case 1:
		if (*name == '_')
		    global = TRUE;
		break;
	    case 3: {
#if I32SIZE == 4
		    /* assume we have a term null even though len is 3 */
		    const char INCs [4] = "INC\0"; 
		    const char ENVs [4] = "ENV\0";
		    const char SIGs [4] = "SIG\0";
		    if (*(I32 *)name == *(I32 *)INCs
			|| *(I32 *)name == *(I32 *)ENVs
			|| *(I32 *)name == *(I32 *)SIGs)
#else
		    if ((name[0] == 'I' && name[1] == 'N' && name[2] == 'C')
			|| (name[0] == 'E' && name[1] == 'N' && name[2] == 'V')
			|| (name[0] == 'S' && name[1] == 'I' && name[2] == 'G'))
#endif
			 global = TRUE;
		}
		break;
	    case 4: {
#if I32SIZE == 4
		    const char ARGVs [4] = "ARGV";
		    if (*(I32 *)name == *(I32 *)ARGVs)
#else
		    if (name[0] == 'A' && name[1] == 'R' && name[2] == 'G'
			&& name[3] == 'V')
#endif
			global = TRUE;
		}
		break;
	    case 5: {
#if I32SIZE == 4
		    const char STDIs [4] = "STDI";
		    if (*(I32 *)name == *(I32 *)STDIs && name[4] == 'N')
#else
		    if (name[0] == 'S' && name[1] == 'T' && name[2] == 'D'
			&& name[3] == 'I' && name[4] == 'N')
#endif
			global = TRUE;
		}
		break;
	    case 6: { /* s2=string [part] 2 */
#if I32SIZE == 4 && I16SIZE == 2
		    const char STDOUTs1 [4] = "STDO";
		    const char STDOUTs2 [2] = "UT";
		    const char STDERRs1 [4] = "STDE";
		    const char STDERRs2 [2] = "RR";
		    if ((*(I32 *)name == *(I32 *)STDOUTs1 && *(I16 *)&name[4] == *(I16
*)STDOUTs2)
			|| (*(I32 *)name == *(I32 *)STDERRs1 && *(I16 *)&name[4] == *(I16
*)STDERRs2))
#else
		    if ((name[0] == 'S' && name[1] == 'T' && name[2] == 'D')
			&&((name[3] == 'O' && name[4] == 'U' && name[5] == 'T')
			   ||(name[3] == 'E' && name[4] == 'R' && name[5] == 'R')))
#endif
			global = TRUE;
		}
		break;
	    case 7: {
/* use only native machine i64s, not compiler emulated ones */
#if defined(HAS_QUAD) && I64SIZE == 8 && PTRSIZE == 8
		    const char ARGVOUTs [8] = "ARGVOUT\0";
		    if (*(I64 *)name == *(I64 *)ARGVOUTs)
//#elif NVSIZE == 8
//		    /* least CPU and instruction set (X87, SSE2, etc) dependent way */
//		    if (*(NV *)name ==  *(NV *)"ARGVOUT")
#elif I32SIZE == 4
		    /* assume we have a term null even though len is 7 */
		    const char ARGVOUTs1 [4] = "ARGV";
		    const char ARGVOUTs2 [4] = "OUT\0";
		    if (*(I32 *)name == *(I32 *)ARGVOUTs1 && *(I32 *)&name[4] == *(I32
*)ARGVOUTs2)
#else
		    if (name[0] == 'A' && name[1] == 'R' && name[2] == 'G'
			&& name[3] == 'V' && name[4] == 'O' && name[5] == 'U'
			&& name[6] == 'T')
#endif
			global = TRUE;
		}
		break;
	    }
        return global;
}
____________________________________________________________

I won't post times yet since they are so unstable from the different
combination, that there is no clear winner, and no pattern to what is
faster, and the code is probably flawed as representing the "real world".

---
via perlbug:  queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=115746

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About