develooper Front page | perl.perl5.porters | Postings from July 2017

Re: [perl #131685] Rename utf8::is_utf8() (and other functions)

Thread Previous | Thread Next
From:
Tony Cook
Date:
July 18, 2017 00:54
Subject:
Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Message ID:
20170718005352.GA11348@mars.tony.develop-help.com
On Mon, Jul 17, 2017 at 10:46:59AM +0200, Sawyer X wrote:
> [Top-posted]
> 
> I have mixed thoughts about this.
> 
> I'm sympathetic to both considerations: Having properly-named functions
> to reduce confusion for future developers (we hope to have some, right?)
> but not introduce additional cognitive load for existing developers.
> 
> A few ways to make such a situation easier:
> 
> * Document utf8::is_utf8() to prevent this confusion: This is by far the
> first thing that should be done. I have double checked the wording for
> utf8::is_utf8() from my blead (978b185):
> 
>         (Since Perl 5.8.1) Test whether $string is marked internally as
>         encoded in UTF-8. Functionally the same as "Encode::is_utf8()".
> 
> This is confusing, to say the least. "Marked internally" is the words
> core hackers are looking for and recognize, but "UTF-8" is what non-core
> hackers (those without the cognitive bias in core terms) see and
> understand. If we head over to Encode::is_utf8() we see:
> 
>     [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/.
>     If /CHECK/ is true, also checks whether /STRING/ contains
>     well-formed UTF-8. Returns true if successful, false otherwise.
> 
>     As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the
>     |utf8::is_utf8| function.
> 
> I like this wording better for several reasons: It is under the title
> "Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds
> that it checks for well-formed UTF-8 only if that flag is true. There
> are improvements to be made here too. We can note what the flag means
> (subtle, complicated, bike-shed-able) or at the very least add a nice
> "this isn't the flag you're looking for" warning. We can also suggest
> when to use and when not to use the function (otherwise it's left to the
> reader, who can easily get it wrong, which is why we're here).

utf8::is_utf8() doesn't accept the second parameter and does no
validity checks (we have utf8::valid() for that), despite the note in
utf8.pm.

> If the document on both was better, then we could have possibly left
> this as unfortunate naming errors we're carrying with us (along with
> "wantarray" for noting whether the context is scalar, list, or void).
...
> Overall, I'm still undecided. Maybe we could start with improving the
> existing documentation?

Perhaps something like:

>>

=item * C<$flag = utf8::is_utf8($string)>

(Since Perl 5.8.1) Test whether I<$string> is marked internally as
encoded in UTF-8.  Functionally the same as C<Encode::is_utf8($string)>.
Typically only necessary for debugging.

If you need to force Unicode semantics for code that needs to be
compatible with perls older than 5.12, call C<utf8::upgrade($string)>
unconditionally.

Using this flag to decide whether a string should be treated as
already encoded bytes or characters is wrong, this should be decided
as part of the interface of your function.

If you're accepting bytes:

  utf8::downgrade($string); # throws an exception if code point over 0xFF

  utf8::downgrade($string, 1) # our own error handling
    or die "\$string must be representable as bytes"

or if you're accepting characters and need encoded bytes:

  utf8::encode($string); # unconditionally

The only exception is if you're dealing with filenames, since perl
uses the internal representation of the string for system calls.

<<

Are there any other cases someone might be tempted to call
utf8::is_utf8()?

Tony

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About