develooper Front page | perl.perl5.porters | Postings from October 2017

Re: RFC: Add new string comparison macros in handy.h

Thread Previous | Thread Next
From:
Karl Williamson
Date:
October 26, 2017 02:27
Subject:
Re: RFC: Add new string comparison macros in handy.h
Message ID:
3d67bcd9-4108-e4af-8182-157a79118bf9@khwilliamson.com
On 06/01/2017 02:53 PM, demerphq wrote:
> On 11 May 2017 at 17:22, Karl Williamson <public@khwilliamson.com> wrote:
>> I would like to add the macros given below to handy.h.  The situations they
>> handle occur reasonably frequently in the core, and these can save
>> developers from thinking they have to manually count the characters in a
>> string.
>>
>> I am not confident at all about the names, and would like to see if people
>> have better ones.
> 
> I think creating a new set of macros with clearer names is a good
> idea, but  how easy is it for us to deprecate the old ones?
> 
> I wanted to give a summary of the history at stake here:
> 
> We have had the following macros since the history of perl:
> 
> ^8d063cd (Larry Wall               1987-12-18 00:00:00 +0000  478)
> #define strNE(s1,s2) (strcmp(s1,s
> ^8d063cd (Larry Wall               1987-12-18 00:00:00 +0000  479)
> #define strEQ(s1,s2) (!strcmp(s1,
> ^8d063cd (Larry Wall               1987-12-18 00:00:00 +0000  480)
> #define strLT(s1,s2) (strcmp(s1,s
> ^8d063cd (Larry Wall               1987-12-18 00:00:00 +0000  481)
> #define strLE(s1,s2) (strcmp(s1,s
> ^8d063cd (Larry Wall               1987-12-18 00:00:00 +0000  482)
> #define strGT(s1,s2) (strcmp(s1,s
> ^8d063cd (Larry Wall               1987-12-18 00:00:00 +0000  483)
> #define strGE(s1,s2) (strcmp(s1,s
> ^8d063cd (Larry Wall               1987-12-18 00:00:00 +0000  485)
> #define strnNE(s1,s2,l) (strncmp(
> ^8d063cd (Larry Wall               1987-12-18 00:00:00 +0000  486)
> #define strnEQ(s1,s2,l) (!strncmp
> 
> We have had these since 1996:
> 
> 36477c24 (Perl 5 Porters           1996-12-06 18:56:00 +1200  497) #
> define memNE(s1,s2,l) (memcmp(
> 36477c24 (Perl 5 Porters           1996-12-06 18:56:00 +1200  498) #
> define memEQ(s1,s2,l)
> 
> We have had these since 2007:
> 
> 568a785a (Nicholas Clark           2007-03-23 16:55:13 +0000  505)
> #define memEQs(s1, l, s2) \
> 777fa2cb (Yves Orton               2016-10-19 10:32:29 +0200  506)
>      (((sizeof(s2)-1) == (l))
> 568a785a (Nicholas Clark           2007-03-23 16:55:13 +0000  507)
> #define memNEs(s1, l, s2) !memEQs
> 
> You added these in September 2016:
> 
> 062b6850 (Karl Williamson          2016-09-10 08:54:36 -0600  515)
> #define memLT(s1,s2,l) (memcmp(s1
> 062b6850 (Karl Williamson          2016-09-10 08:54:36 -0600  516)
> #define memLE(s1,s2,l) (memcmp(s1
> 062b6850 (Karl Williamson          2016-09-10 08:54:36 -0600  517)
> #define memGT(s1,s2,l) (memcmp(s1
> 062b6850 (Karl Williamson          2016-09-10 08:54:36 -0600  518)
> #define memGE(s1,s2,l) (memcmp(s1
> 
> I added these in October 2016 (in a post I just send I realize they
> were misnamed and should have been called strnNEs(), note the missing
> 'n' to comply with strnNE(). )
> 
> 62946e08 (Yves Orton               2016-10-19 10:30:44 +0200  492)
> #define strNEs(s1,s2) (strncmp(s1
> 62946e08 (Yves Orton               2016-10-19 10:30:44 +0200  493)
> #define strEQs(s1,s2) (!strncmp(s
> 
> and these:
> 
> 777fa2cb (Yves Orton               2016-10-19 10:32:29 +0200  511)
> #define _memEQs(s1, s2) \
> 777fa2cb (Yves Orton               2016-10-19 10:32:29 +0200  512)
>      (memEQ((s1), ("" s2 ""),
> 777fa2cb (Yves Orton               2016-10-19 10:32:29 +0200  513)
> #define _memNEs(s1, s2) (memNE((s
> 
> 
>> I also would like to document memEQs, memLE, memLT, memGE, and memGT. And
>> move all similar macros to a new section, "String comparison functions",
>> from the current "Miscellaneous".
>>
>>      strSTARTS_WITHs
>>              Test if the "NUL"-terminated string "s1" begins with the
>> substring
>>              given by the string literal "s2", returning non-zero if so
>>              (including if the two are identical); zero otherwise.
>>
>>                      bool    strSTARTS_WITHs(char* s1, char* s2)
> 
> So this is equivalent to the current strEQs().
> 
> To comply with existing convention strEQs() should be renamed strnEQs().
> 
> I think adding a long form equivalent is ok, but i think the old
> naming convention (assuming the name is corrected to include the 'n')
> make sense too.
> 
>>      memSTARTS_WITHs
>>              Test if the string buffer "s1" with length "l1" begins with the
>>              substring given by the string literal "s2", returning non-zero
>> if
>>              so (including if the two are identical); zero otherwise. The
>>              comparison does not include the final "NUL" of "s2". "s1" does
>> not
>>              have to be "NUL"-terminated,
> 
> So the difference with the 'str' version is that str() considers a
> null byte to be end of string, and mem() does not. Is there any case
> where using memcmp() instead of str[n]cmp() is wrong for this type of
> macro? If not maybe we should just have one (using memcmp).
> 
> 
>>                      bool    memSTARTS_WITHs(char* s1, STRLEN l1, char* s2)
>>
>>      memENDS_WITHs
>>              Test if the string buffer "s1" with length "l1" ends with the
>>              substring given by the string literal "s2", returning non-zero
>> if
>>              so (including if the two are identical); zero otherwise. The
>>              comparison does not include the final "NUL" of "s2". "s1" does
>> not
>>              have to be "NUL"-terminated,
>>
>>                      bool    memENDS_WITHs(char* s1, STRLEN l1, char* s2)
> 
> Do we actually have/use this? Beyond the comments above about "mem" vs
> "str" I dont have any problem with this.
> 
>>
>>      memFOO_STARTING_WITHs
>>              Test if the string buffer "s1" with length "l1" begins with the
>>              substring given by the string literal "s2", and that "s1" is
>>              longer than "s2", returning non-zero if so; zero otherwise. In
>>              other words, "s2" begins "s1" but is not all of "s1". The
>>              comparison does not include the final "NUL" of "s2". "s1" does
>> not
>>              have to be "NUL"-terminated,
>>
>>                      bool    memFOO_STARTING_WITHs(char* s1, STRLEN l1,
>>                                                    char* s2)
>>
>>      memFOO_ENDING_WITHs
>>              Test if the string buffer "s1" with length "l1" ends with the
>>              substring given by the string literal "s2", and that "s1" is
>>              longer than "s2", returning non-zero if so; zero otherwise. In
>>              other words, "s2" ends "s1" but is not all of "s1". The
>> comparison
>>              does not include the final "NUL" of "s2". "s1" does not have to
>> be
>>              "NUL"-terminated,
>>
>>                      bool    memFOO_ENDING_WITHs(char* s1, STRLEN l1,
>>                                                  char* s2)
> 
> So we need something better than FOO.
> 
> Personally i would prefer to see a convention more like:
> 
> (mem|str)IS_(PREFIX|SUFFIX|EQ|NE|LT|GT|GE|LE)[ls]*
> 
> With the appropriate mix of arguments specified by the suffix.
> 
> That would mean all of the macros of the form strIS() and memIS() come
> from the new convention, and everything else is historical.
> 
> So i could imagine a macro
> 
> as well as
> 
> strIS_EQ(s1,s2)
> strIS_EQs(s1,s2)
> strIS_EQls(s1,l1,s2)
> strIS_EQl(s1,l1,s2)
> strIS_EQll(s1,l1,s2,l2)
> 
> and possibly a few other permutations.
> 
> I like the idea of standardizing this stuff with conventions that well
> described and predictable so if we have to add a new variant it is
> well defined what it should be called.
> 
> cheers,
> Yves
> 

I have finally looked at Yves' proposal, and I'm not convinced we 
currently have an inconsistent interface that needs to be fixed, and so 
I think we should stay with what we have as a template.  That interface 
closely follows the C library ones, which is a good thing since those 
using it are programming in C.

That interface I believe is

(mem|strn?)(EQ|NE|LT|GT|GE|LE)s?

The optional 'n' in 'str' calls follows the C convention of meaning 
"There is a third parameter at the end, giving the maximum number of 
bytes to use in the comparison"

The trailing 's' means the 2nd parameter is a C literal double-quoted 
string.  It's length is known by at compile time, and is not explicitly 
specified.

The mem functions all require an explicit length parameter.  If the same 
length applies to both of the buffer parameters, it is the third 
parameter.  If it applies to just the first buffer parameter, it 
immediately follows that one.  (All the cases so far where there might 
be a different length for the second parameter use the trailing 's' form.)

That's the existing convention.  I don't see any inconsistencies with 
the existing macros, except in the macros Yves added: strNEs and strEQs. 
  And those macros are protected from non-core usage in 5.26 by 
restricting them to PERL_CORE.

If we added a mem macro where the 2nd buffer needed a different explicit 
length parameter to be supplied, then I'm fine with using 'l' as a 
suffix in the macro name to mean that, and documenting that in handy.h 
as our plan.

The strEQs that is confined to core really is looking to see if the 
second string is an initial substring of the first.  What I would have 
thought it meant instead is: are two strings the same? given the second 
is a compile time constant.

My original point is that it would be clearer to have a name that 
indicates the initial substring check.  I'm not happy with what I 
proposed, and would be OK with using strPREFIX and strPREFIXs, 'PREFIX' 
being a term that Yves proposed.

But there is an occasional need for checking that the initial substring 
is not the complete main string.  In mathematical terms, that it is a 
"proper" substring.  We could say strPROPER_PREFIX, but that's getting a 
bit long.  We could say strPPREFIX, but I worry that the doubled-P 
wouldn't stand out enough as different from a single 'P'.  I'm now 
leaning to strBEGIN and strPBEGIN.  or just strBEG, strPBEG.

Similarly there are cases where we are looking for the final substring, 
both proper and not.  I now think strEND and strPEND are ok for this.

strINITIAL_SUBSTRING is accurate but I think too long.

Whatever name we choose to signify the concepts 'initial' and 'final', 
it could and should also be applied to the mem functions if the need arose.

I suspect that a bunch of strnEQ calls are really looking for an initial 
substring, and changing such to say so means the code reader doesn't 
have to do any counting to see what the real effect is.

To summarize, I think that there are no inconsistencies that aren't 
already the same as C library calls, in the macros usable outside of 
core, so I don't believe we need to come up with a consistent set. 
Doing so might actually confuse C programmers.  I do want new macros 
that test if the second parameter is an initial or final substring of 
the first parameter.  Such additions should be consistent, across str 
and mem forms, and could be documented in handy.h.

One version of what I'm thinking is

(mem|strn?)(EQ|NE|LT|GT|GE|LE|P?(BEG|END))s?

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About