develooper Front page | perl.perl5.porters | Postings from July 2017

Re: [perl #131685] Rename utf8::is_utf8() (and other functions)

Thread Previous | Thread Next
Tony Cook
July 18, 2017 00:54
Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Message ID:
On Mon, Jul 17, 2017 at 10:46:59AM +0200, Sawyer X wrote:
> [Top-posted]
> I have mixed thoughts about this.
> I'm sympathetic to both considerations: Having properly-named functions
> to reduce confusion for future developers (we hope to have some, right?)
> but not introduce additional cognitive load for existing developers.
> A few ways to make such a situation easier:
> * Document utf8::is_utf8() to prevent this confusion: This is by far the
> first thing that should be done. I have double checked the wording for
> utf8::is_utf8() from my blead (978b185):
>         (Since Perl 5.8.1) Test whether $string is marked internally as
>         encoded in UTF-8. Functionally the same as "Encode::is_utf8()".
> This is confusing, to say the least. "Marked internally" is the words
> core hackers are looking for and recognize, but "UTF-8" is what non-core
> hackers (those without the cognitive bias in core terms) see and
> understand. If we head over to Encode::is_utf8() we see:
>     [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/.
>     If /CHECK/ is true, also checks whether /STRING/ contains
>     well-formed UTF-8. Returns true if successful, false otherwise.
>     As of Perl 5.8.1, utf8 <> also has the
>     |utf8::is_utf8| function.
> I like this wording better for several reasons: It is under the title
> "Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds
> that it checks for well-formed UTF-8 only if that flag is true. There
> are improvements to be made here too. We can note what the flag means
> (subtle, complicated, bike-shed-able) or at the very least add a nice
> "this isn't the flag you're looking for" warning. We can also suggest
> when to use and when not to use the function (otherwise it's left to the
> reader, who can easily get it wrong, which is why we're here).

utf8::is_utf8() doesn't accept the second parameter and does no
validity checks (we have utf8::valid() for that), despite the note in

> If the document on both was better, then we could have possibly left
> this as unfortunate naming errors we're carrying with us (along with
> "wantarray" for noting whether the context is scalar, list, or void).
> Overall, I'm still undecided. Maybe we could start with improving the
> existing documentation?

Perhaps something like:


=item * C<$flag = utf8::is_utf8($string)>

(Since Perl 5.8.1) Test whether I<$string> is marked internally as
encoded in UTF-8.  Functionally the same as C<Encode::is_utf8($string)>.
Typically only necessary for debugging.

If you need to force Unicode semantics for code that needs to be
compatible with perls older than 5.12, call C<utf8::upgrade($string)>

Using this flag to decide whether a string should be treated as
already encoded bytes or characters is wrong, this should be decided
as part of the interface of your function.

If you're accepting bytes:

  utf8::downgrade($string); # throws an exception if code point over 0xFF

  utf8::downgrade($string, 1) # our own error handling
    or die "\$string must be representable as bytes"

or if you're accepting characters and need encoded bytes:

  utf8::encode($string); # unconditionally

The only exception is if you're dealing with filenames, since perl
uses the internal representation of the string for system calls.


Are there any other cases someone might be tempted to call


Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About