develooper Front page | perl.perl5.porters | Postings from December 2017

Re: RFC: optional parameter to Unicode::UCD::num()

Thread Previous
Karl Williamson
December 27, 2017 21:38
Re: RFC: optional parameter to Unicode::UCD::num()
Message ID:
On 05/23/2017 09:11 PM, Karl Williamson wrote:
> On 05/22/2017 05:37 PM, Karl Williamson wrote:
>> On 05/22/2017 03:30 AM, Dave Mitchell wrote:
>>> On Fri, May 19, 2017 at 12:11:31PM -0600, Karl Williamson wrote:
>>>> I propose to add an optional parameter to calling the num() function.
>>>> Recall that num() is used to get the numeric value of the input string
>>>> parameter, or undef if none.  Only if the entire string is a valid 
>>>> number is
>>>> something other than undef returned.   If you call it with a single
>>>> character that means 1/2, then 0.5 is returned, but for strings of 
>>>> more than
>>>> a single character, the string must consist entirely of decimal 
>>>> digits used
>>>> in a positional notation, all from the same script, for something 
>>>> other than
>>>> undef to be returned.
>>>> Thus, it can be used to defeat spoofing, where something that 
>>>> appears to be
>>>> a digit, but is really from a different script makes the number 
>>>> appear to be
>>>> a different value than it actually is.  For example, someone could 
>>>> say, I'll
>>>> pay you $৪০,using the Bengali characters for 4 and 0.  num() detects 
>>>> this
>>>> and return 40.
>>>> Some applications would prefer num() to work more like atoi(), so 
>>>> that if
>>>> the first segment of the input string is all digits, they would want 
>>>> that
>>>> substring's numeric value.  This optional parameter would help 
>>>> them.  It
>>>> would be a reference to a scalar, and num would set it to how many
>>>> characters in a row at the beginning of the string form a valid 
>>>> number.  0
>>>> would be returned if the string begins with a character that has no 
>>>> numeric
>>>> value.  If num returns a numeric value, this parameter would be set 
>>>> to the
>>>> length of the input string.  It would be the case that if you call
>>>> num($string, \$len), then substr($string, 0, $len) would be numeric 
>>>> for $len
>>>>> 0.
>>> +1.
>>> The only thing I didn't quite understand from your description was
>>> "0 would be returned" - is that the return value of 
>>> Unicode::UCD::num() or
>>> the value it sets $$len to? Or both?
>> It was carelessly worded.  $$num would be set to 0, as your example 
>> just below correctly shows.
>>> So IIUC:
>>>      num("ABC")       returns undef
>>>      num("ABC",\$l)   returns 0, sets $l to 0
>>>      num("12ABC")     returns undef
>>>      num("12ABC",\$l) returns 12, sets $l to 2
>>>      num("12")        returns 12
>>>      num("12",\$l)    returns 12, sets $l to 2
>>> An alternative API might be to give num() an optional boolean second 
>>> arg;
>>> if true, it enables partial matches. And in that case, num() returns a
>>> second value, which is the matched length.
>>> Which feels a little more Perlish.
>> I'd be open to that; what do others think?
> I was leaning towards this latter API until I started to think about 
> implementing it.  The problem is that the caller will want to know where 
> the first non-included character is in the string.  One might think that 
> the log10 of the result could be used (but who wants to calculate that), 
> and it isn't true anyway.  For example, the single character 兆 would 
> return 1000000000000.  Surprisingly there is a 2nd character with that 
> value 𖭡
> though I don't have a font for it on my system.

The original proposed implementation is now in blead as 

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About