Re: [perl #131685] Rename utf8::is_utf8() (and other functions)

Sawyer X
July 19, 2017 16:31
Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
On 07/19/2017 08:58 AM, Tony Cook wrote:
> On Tue, Jul 18, 2017 at 10:53:53AM +1000, Tony Cook wrote:
>> On Mon, Jul 17, 2017 at 10:46:59AM +0200, Sawyer X wrote:
>>> [Top-posted]
>>> I have mixed thoughts about this.
>>> I'm sympathetic to both considerations: Having properly-named functions
>>> to reduce confusion for future developers (we hope to have some, right?)
>>> but not introduce additional cognitive load for existing developers.
>>> A few ways to make such a situation easier:
>>> * Document utf8::is_utf8() to prevent this confusion: This is by far the
>>> first thing that should be done. I have double checked the wording for
>>> utf8::is_utf8() from my blead (978b185):
>>>         (Since Perl 5.8.1) Test whether $string is marked internally as
>>>         encoded in UTF-8. Functionally the same as "Encode::is_utf8()".
>>> This is confusing, to say the least. "Marked internally" is the words
>>> core hackers are looking for and recognize, but "UTF-8" is what non-core
>>> hackers (those without the cognitive bias in core terms) see and
>>> understand. If we head over to Encode::is_utf8() we see:
>>>     [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/.
>>>     If /CHECK/ is true, also checks whether /STRING/ contains
>>>     well-formed UTF-8. Returns true if successful, false otherwise.
>>>     As of Perl 5.8.1, utf8 <> also has the
>>>     |utf8::is_utf8| function.
>>> I like this wording better for several reasons: It is under the title
>>> "Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds
>>> that it checks for well-formed UTF-8 only if that flag is true. There
>>> are improvements to be made here too. We can note what the flag means
>>> (subtle, complicated, bike-shed-able) or at the very least add a nice
>>> "this isn't the flag you're looking for" warning. We can also suggest
>>> when to use and when not to use the function (otherwise it's left to the
>>> reader, who can easily get it wrong, which is why we're here).
>> utf8::is_utf8() doesn't accept the second parameter and does no
>> validity checks (we have utf8::valid() for that), despite the note in
>>> If the document on both was better, then we could have possibly left
>>> this as unfortunate naming errors we're carrying with us (along with
>>> "wantarray" for noting whether the context is scalar, list, or void).
>> ...
>>> Overall, I'm still undecided. Maybe we could start with improving the
>>> existing documentation?
>> Perhaps something like:
>> =item * C<$flag = utf8::is_utf8($string)>
>> (Since Perl 5.8.1) Test whether I<$string> is marked internally as
>> encoded in UTF-8.  Functionally the same as C<Encode::is_utf8($string)>.
>> Typically only necessary for debugging.
>> If you need to force Unicode semantics for code that needs to be
>> compatible with perls older than 5.12, call C<utf8::upgrade($string)>
>> unconditionally.
>> Using this flag to decide whether a string should be treated as
>> already encoded bytes or characters is wrong, this should be decided
>> as part of the interface of your function.
>> If you're accepting bytes:
>>   utf8::downgrade($string); # throws an exception if code point over 0xFF
>>   utf8::downgrade($string, 1) # our own error handling
>>     or die "\$string must be representable as bytes"
>> or if you're accepting characters and need encoded bytes:
>>   utf8::encode($string); # unconditionally
>> The only exception is if you're dealing with filenames, since perl
>> uses the internal representation of the string for system calls.
>> <<
>> Are there any other cases someone might be tempted to call
>> utf8::is_utf8()?
> Thinking about it further, I'm pretty sure this doesn't all belong
> here.
> L<perlunifaq/What is "the UTF8 flag"?> provides a good description of
> the flag is_utf8() returns, and the whole of perlunifaq covers some of
> the things the above tries to cover.
> perlunicook largely works at a higher level than the functions in
> utf8::* work at.

+1 on the suggested text.

I think this addition is useful, even if it is also covered in more
documents. We could also link to those documents for further learning.

