Front page | perl.perl5.porters |
Postings from December 2017
Re: RFC: optional parameter to Unicode::UCD::num()
Thread Previous
From:
Karl Williamson
Date:
December 27, 2017 21:38
Subject:
Re: RFC: optional parameter to Unicode::UCD::num()
Message ID:
60fef0b4-cc5f-f438-c752-9ee509eff952@khwilliamson.com
On 05/23/2017 09:11 PM, Karl Williamson wrote:
> On 05/22/2017 05:37 PM, Karl Williamson wrote:
>> On 05/22/2017 03:30 AM, Dave Mitchell wrote:
>>> On Fri, May 19, 2017 at 12:11:31PM -0600, Karl Williamson wrote:
>>>> I propose to add an optional parameter to calling the num() function.
>>>>
>>>> Recall that num() is used to get the numeric value of the input string
>>>> parameter, or undef if none. Only if the entire string is a valid
>>>> number is
>>>> something other than undef returned. If you call it with a single
>>>> character that means 1/2, then 0.5 is returned, but for strings of
>>>> more than
>>>> a single character, the string must consist entirely of decimal
>>>> digits used
>>>> in a positional notation, all from the same script, for something
>>>> other than
>>>> undef to be returned.
>>>>
>>>> Thus, it can be used to defeat spoofing, where something that
>>>> appears to be
>>>> a digit, but is really from a different script makes the number
>>>> appear to be
>>>> a different value than it actually is. For example, someone could
>>>> say, I'll
>>>> pay you $৪০,using the Bengali characters for 4 and 0. num() detects
>>>> this
>>>> and return 40.
>>>>
>>>> Some applications would prefer num() to work more like atoi(), so
>>>> that if
>>>> the first segment of the input string is all digits, they would want
>>>> that
>>>> substring's numeric value. This optional parameter would help
>>>> them. It
>>>> would be a reference to a scalar, and num would set it to how many
>>>> characters in a row at the beginning of the string form a valid
>>>> number. 0
>>>> would be returned if the string begins with a character that has no
>>>> numeric
>>>> value. If num returns a numeric value, this parameter would be set
>>>> to the
>>>> length of the input string. It would be the case that if you call
>>>> num($string, \$len), then substr($string, 0, $len) would be numeric
>>>> for $len
>>>>> 0.
>>>
>>> +1.
>>>
>>> The only thing I didn't quite understand from your description was
>>> "0 would be returned" - is that the return value of
>>> Unicode::UCD::num() or
>>> the value it sets $$len to? Or both?
>>
>> It was carelessly worded. $$num would be set to 0, as your example
>> just below correctly shows.
>>>
>>> So IIUC:
>>>
>>>
>>> num("ABC") returns undef
>>> num("ABC",\$l) returns 0, sets $l to 0
>>>
>>> num("12ABC") returns undef
>>> num("12ABC",\$l) returns 12, sets $l to 2
>>>
>>> num("12") returns 12
>>> num("12",\$l) returns 12, sets $l to 2
>>>
>>> An alternative API might be to give num() an optional boolean second
>>> arg;
>>> if true, it enables partial matches. And in that case, num() returns a
>>> second value, which is the matched length.
>>>
>>> Which feels a little more Perlish.
>>>
>>
>> I'd be open to that; what do others think?
>>
>
>
> I was leaning towards this latter API until I started to think about
> implementing it. The problem is that the caller will want to know where
> the first non-included character is in the string. One might think that
> the log10 of the result could be used (but who wants to calculate that),
> and it isn't true anyway. For example, the single character 兆 would
> return 1000000000000. Surprisingly there is a 2nd character with that
> value 𖭡
> though I don't have a font for it on my system.
>
The original proposed implementation is now in blead as
56da55346bed3bc3537f276968054d464175b71e
Thread Previous