develooper Front page | perl.perl5.porters | Postings from March 2021

Re: Let's talk about trim() so more

Thread Previous | Thread Next
From:
B. Estrade
Date:
March 31, 2021 15:43
Subject:
Re: Let's talk about trim() so more
Message ID:
12703788-b80a-c734-8226-ec89d16c3562@cpanel.net
Here I present some additional research that might help bolster 
someone's case. Frankly, I have no dog in the fight. Would I use "trim"? 
Probably, I use chomp pretty much only when iterating over lines in 
<$fh>, but I am not above that kind of thing - lines in files can have 
leading garbage I don't want, too.

In any case, there are 101 PHP string functions listed at [1]; Provided 
in Appendix A for additional perspective. The top of [1] also states,

 >For even more powerful string handling and manipulating functions take
 >a look at the Perl compatible regular expression functions. For working
 >with multibyte character encodings, take a look at the Multibyte String
 >functions [2].

There are 59 additional "multibyte" string functions (see Appendix B).

And STILL more, there are 11 functions [3] PHP conveniently provides 
interfaces to PCRE for "even more powerful string handling ... something 
something Perl something something..." [1]. (See Appendix C for the full 
listing). Yes, I know PRCE is not actually Perl's implementation.

I am no math wizard, but that's around 160 in total. I certainly don't 
expect anyone to take the time to convert these to regular expressions 
or Perl idioms [certainly not *all* are just regexes]; but maybe there 
are a few more of these that might provide similar conveniences and can 
be justified with how common less friendly forms are on CPAN? I mean out 
of 160 there's got to be like an order of magnitude less that Perl could 
also use? And please, whatever you do, don't point me to PHP::Strings in 
CPAN. I already know about it, and it's failing all of its tests.

References

1. https://www.php.net/manual/en/ref.strings.php
2. https://www.php.net/manual/en/ref.mbstring.php
3. https://www.php.net/manual/en/ref.pcre.php

Appendix A.

addcslashes — Quote string with slashes in a C style
addslashes — Quote string with slashes
bin2hex — Convert binary data into hexadecimal representation
chop — Alias of rtrim
chr — Generate a single-byte string from a number
chunk_split — Split a string into smaller chunks
convert_cyr_string — Convert from one Cyrillic character set to another
convert_uudecode — Decode a uuencoded string
convert_uuencode — Uuencode a string
count_chars — Return information about characters used in a string
crc32 — Calculates the crc32 polynomial of a string
crypt — One-way string hashing
echo — Output one or more strings
explode — Split a string by a string
fprintf — Write a formatted string to a stream
get_html_translation_table — Returns the translation table used by 
htmlspecialchars and htmlentities
hebrev — Convert logical Hebrew text to visual text
hebrevc — Convert logical Hebrew text to visual text with newline conversion
hex2bin — Decodes a hexadecimally encoded binary string
html_entity_decode — Convert HTML entities to their corresponding characters
htmlentities — Convert all applicable characters to HTML entities
htmlspecialchars_decode — Convert special HTML entities back to characters
htmlspecialchars — Convert special characters to HTML entities
implode — Join array elements with a string
join — Alias of implode
lcfirst — Make a string's first character lowercase
levenshtein — Calculate Levenshtein distance between two strings
localeconv — Get numeric formatting information
ltrim — Strip whitespace (or other characters) from the beginning of a 
string
md5_file — Calculates the md5 hash of a given file
md5 — Calculate the md5 hash of a string
metaphone — Calculate the metaphone key of a string
money_format — Formats a number as a currency string
nl_langinfo — Query language and locale information
nl2br — Inserts HTML line breaks before all newlines in a string
number_format — Format a number with grouped thousands
ord — Convert the first byte of a string to a value between 0 and 255
parse_str — Parses the string into variables
print — Output a string
printf — Output a formatted string
quoted_printable_decode — Convert a quoted-printable string to an 8 bit 
string
quoted_printable_encode — Convert a 8 bit string to a quoted-printable 
string
quotemeta — Quote meta characters
rtrim — Strip whitespace (or other characters) from the end of a string
setlocale — Set locale information
sha1_file — Calculate the sha1 hash of a file
sha1 — Calculate the sha1 hash of a string
similar_text — Calculate the similarity between two strings
soundex — Calculate the soundex key of a string
sprintf — Return a formatted string
sscanf — Parses input from a string according to a format
str_contains — Determine if a string contains a given substring
str_ends_with — Checks if a string ends with a given substring
str_getcsv — Parse a CSV string into an array
str_ireplace — Case-insensitive version of str_replace
str_pad — Pad a string to a certain length with another string
str_repeat — Repeat a string
str_replace — Replace all occurrences of the search string with the 
replacement string
str_rot13 — Perform the rot13 transform on a string
str_shuffle — Randomly shuffles a string
str_split — Convert a string to an array
str_starts_with — Checks if a string starts with a given substring
str_word_count — Return information about words used in a string
strcasecmp — Binary safe case-insensitive string comparison
strchr — Alias of strstr
strcmp — Binary safe string comparison
strcoll — Locale based string comparison
strcspn — Find length of initial segment not matching mask
strip_tags — Strip HTML and PHP tags from a string
stripcslashes — Un-quote string quoted with addcslashes
stripos — Find the position of the first occurrence of a 
case-insensitive substring in a string
stripslashes — Un-quotes a quoted string
stristr — Case-insensitive strstr
strlen — Get string length
strnatcasecmp — Case insensitive string comparisons using a "natural 
order" algorithm
strnatcmp — String comparisons using a "natural order" algorithm
strncasecmp — Binary safe case-insensitive string comparison of the 
first n characters
strncmp — Binary safe string comparison of the first n characters
strpbrk — Search a string for any of a set of characters
strpos — Find the position of the first occurrence of a substring in a 
string
strrchr — Find the last occurrence of a character in a string
strrev — Reverse a string
strripos — Find the position of the last occurrence of a 
case-insensitive substring in a string
strrpos — Find the position of the last occurrence of a substring in a 
string
strspn — Finds the length of the initial segment of a string consisting 
entirely of characters contained within a given mask
strstr — Find the first occurrence of a string
strtok — Tokenize string
strtolower — Make a string lowercase
strtoupper — Make a string uppercase
strtr — Translate characters or replace substrings
substr_compare — Binary safe comparison of two strings from an offset, 
up to length characters
substr_count — Count the number of substring occurrences
substr_replace — Replace text within a portion of a string
substr — Return part of a string
trim — Strip whitespace (or other characters) from the beginning and end 
of a string
ucfirst — Make a string's first character uppercase
ucwords — Uppercase the first character of each word in a string
vfprintf — Write a formatted string to a stream
vprintf — Output a formatted string
vsprintf — Return a formatted string
wordwrap — Wraps a string to a given number of characters

Appendix B.

mb_check_encoding — Check if strings are valid for the specified encoding
mb_chr — Get a specific character
mb_convert_case — Perform case folding on a string
mb_convert_encoding — Convert character encoding
mb_convert_kana — Convert "kana" one from another ("zen-kaku", 
"han-kaku" and more)
mb_convert_variables — Convert character code in variable(s)
mb_decode_mimeheader — Decode string in MIME header field
mb_decode_numericentity — Decode HTML numeric string reference to character
mb_detect_encoding — Detect character encoding
mb_detect_order — Set/Get character encoding detection order
mb_encode_mimeheader — Encode string for MIME header
mb_encode_numericentity — Encode character to HTML numeric string reference
mb_encoding_aliases — Get aliases of a known encoding type
mb_ereg_match — Regular expression match for multibyte string
mb_ereg_replace_callback — Perform a regular expression search and 
replace with multibyte support using a callback
mb_ereg_replace — Replace regular expression with multibyte support
mb_ereg_search_getpos — Returns start point for next regular expression 
match
mb_ereg_search_getregs — Retrieve the result from the last multibyte 
regular expression match
mb_ereg_search_init — Setup string and regular expression for a 
multibyte regular expression match
mb_ereg_search_pos — Returns position and length of a matched part of 
the multibyte regular expression for a predefined multibyte string
mb_ereg_search_regs — Returns the matched part of a multibyte regular 
expression
mb_ereg_search_setpos — Set start point of next regular expression match
mb_ereg_search — Multibyte regular expression match for predefined 
multibyte string
mb_ereg — Regular expression match with multibyte support
mb_eregi_replace — Replace regular expression with multibyte support 
ignoring case
mb_eregi — Regular expression match ignoring case with multibyte support
mb_get_info — Get internal settings of mbstring
mb_http_input — Detect HTTP input character encoding
mb_http_output — Set/Get HTTP output character encoding
mb_internal_encoding — Set/Get internal character encoding
mb_language — Set/Get current language
mb_list_encodings — Returns an array of all supported encodings
mb_ord — Get code point of character
mb_output_handler — Callback function converts character encoding in 
output buffer
mb_parse_str — Parse GET/POST/COOKIE data and set global variable
mb_preferred_mime_name — Get MIME charset string
mb_regex_encoding — Set/Get character encoding for multibyte regex
mb_regex_set_options — Set/Get the default options for mbregex functions
mb_scrub — Description
mb_send_mail — Send encoded mail
mb_split — Split multibyte string using regular expression
mb_str_split — Given a multibyte string, return an array of its characters
mb_strcut — Get part of string
mb_strimwidth — Get truncated string with specified width
mb_stripos — Finds position of first occurrence of a string within 
another, case insensitive
mb_stristr — Finds first occurrence of a string within another, case 
insensitive
mb_strlen — Get string length
mb_strpos — Find position of first occurrence of string in a string
mb_strrchr — Finds the last occurrence of a character in a string within 
another
mb_strrichr — Finds the last occurrence of a character in a string 
within another, case insensitive
mb_strripos — Finds position of last occurrence of a string within 
another, case insensitive
mb_strrpos — Find position of last occurrence of a string in a string
mb_strstr — Finds first occurrence of a string within another
mb_strtolower — Make a string lowercase
mb_strtoupper — Make a string uppercase
mb_strwidth — Return width of string
mb_substitute_character — Set/Get substitution character
mb_substr_count — Count the number of substring occurrences
mb_substr — Get part of string

Appendix C.

preg_filter — Perform a regular expression search and replace
preg_grep — Return array entries that match the pattern
preg_last_error_msg — Returns the error message of the last PCRE regex 
execution
preg_last_error — Returns the error code of the last PCRE regex execution
preg_match_all — Perform a global regular expression match
preg_match — Perform a regular expression match
preg_quote — Quote regular expression characters
preg_replace_callback_array — Perform a regular expression search and 
replace using callbacks
preg_replace_callback — Perform a regular expression search and replace 
using a callback
preg_replace — Perform a regular expression search and replace
preg_split — Split string by a regular expression


On 3/31/21 10:04 AM, Scott Baker wrote:
> Excellent research Mr. Bullock!
> 
> You bring up a very good point Grinnz came across 
> <https://www.reddit.com/r/perl/comments/hf3jlx/announcing_perl_7/fvwp1zt/?utm_source=reddit&utm_medium=web2x&context=3> 
> when we initially started implementing trim(). One of the main reasons 
> we want it done in core is because it's implemented so many times in 
> other places, and *often* implemented *incorrectly*. Putting it in core 
> we can implement it correctly and stop developers from having to 
> reinvent the wheel.
> 
> In your research did you find out if people implementing trim() as a sub 
> do it in-place or as a return value? That seems to be hotly debated 
> right now.
> 
> - Scott
> 
> On 3/31/21 6:01 AM, Ben Bullock wrote:
>> On Wed, 31 Mar 2021 at 08:59,<neilb@neilb.org>  wrote:
>>
>>> Trimming is something that is frequently wanted, but though you
>>> think it a no-brainer, people don=E2=80=99t always get it right. I found
>>> about 7500 distributions with an "inline trim". Here are some of the
>>> ones I found:
>>>
>>>      s/(^\s+)|(\s+$)//;
>>>      s/(^\s+)|\n//gm;
>>>      s/(^\s+|\s+$)//g;
>>>      s/(^\s+)|(\s+$)//g;
>>>      s/(^\s+|\s+$)//os;
>>>      s/(^\s+|\s+$)//gs;
>>>      s/^\s*//; s/\s+$//;
>>>
>>> Not all of those work.
>> There are any number of "gotcha" failures using regex trim on CPAN and
>> even within Perl core modules. Further, at least two core module
>> authors have duplicated "trim".
>>
>> A search for "trim string" on metacpan.org finds
>> https://metacpan.org/pod/POOF:
>>
>>                  # trim leading and trailing white spaces
>>                  $val =3D~ s/^\s*|\s*$//;
>>
>> This substitution will return true even if it matches nothing due to
>> the asterisk, and it will fail to remove trailing whitespace if there
>> is also leading whitespace due to the lack of a /g flag.
>>
>>      $ perl -e 'my $g=3D"   x   ";$g=3D~s/^\s+|\s+$//;print "!$g!\n";'
>>      !x   !
>>
>> We can find many more examples of the "omitted /g" error on CPAN:
>>
>>      https://grep.metacpan.org/search?q=3D%5CQs%2F%5E%5Cs%2B%7C%5Cs%2B%24%2F=
>> %2F%5CE%5B%5Eg%5D*%24&qd=3D&qft=3D
>>
>> Using * instead of + after \s causes the substitution to always return
>> a true value even if nothing changed. This is also fairly common:
>>
>>      https://grep.metacpan.org/search?q=3D%5CQs%2F%5E%5Cs*%7C%5Cs*%24%2F%2F&=
>> qd=3D&qft=3D
>>
>> It says "80 distributions". I looked through all of them but I didn't
>> find anywhere where the return value of the substitution was being
>> used, perhaps because that bug would have been caught quickly, except
>> for here:
>>
>>      https://grep.metacpan.org/search?qci=3D&q=3D%5CQs%2F%5E%5Cs*%7C%5Cs*%24=
>> %2F%2F&qft=3D&qd=3DCohortExplorer
>>
>> where the programmer seems actually to be using the fact that it
>> always returns a true value.
>>
>> Furthermore, there are several examples in Perl core modules.
>>
>> Mistaken use of the /s flag (make . match \n) to mean /m (make ^ and $
>> match new lines) is seen in such modules as Pod::Simple, CPAN::Module,
>> Net::SMTP, Pod::Checker, Locale::Maketext, and I18N::LangTags.
>>
>> Mistaken use of s/^\s*// for trimming is seen in core modules like
>> Win32, bigint.pm, and CPAN::Complete.
>>
>> I also found one example of s/^\s+|\s+$// (omitted /g flag means it
>> fails to remove the end space from " this ") in the core modules, in
>> ExtUtils::CBuilder::Platform::Windows:
>>
>>      map {$a=3D$_;$a=3D~s/\t/ /g;$a=3D~s/^\s+|\s+$//;$a}
>>
>> Individual core modules which implement their own "trim" function
>> include ExtUtils::ParseXS (trim_whitespace) and TAP::Parser (_trim).
>>
> 

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About