develooper Front page | perl.perl5.porters | Postings from July 2017

Re: [perl #131685] Rename utf8::is_utf8() (and other functions)

Thread Previous | Thread Next
From:
H.Merijn Brand
Date:
July 18, 2017 07:04
Subject:
Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Message ID:
20170718090406.0b5ac0c0@pc09.procura.nl
On Tue, 18 Jul 2017 10:53:53 +1000, Tony Cook <tony@develop-help.com>
wrote:

> On Mon, Jul 17, 2017 at 10:46:59AM +0200, Sawyer X wrote:
> > [Top-posted]
> > 
> > I have mixed thoughts about this.
> > 
> > I'm sympathetic to both considerations: Having properly-named functions
> > to reduce confusion for future developers (we hope to have some, right?)
> > but not introduce additional cognitive load for existing developers.
> > 
> > A few ways to make such a situation easier:
> > 
> > * Document utf8::is_utf8() to prevent this confusion: This is by far the
> > first thing that should be done. I have double checked the wording for
> > utf8::is_utf8() from my blead (978b185):
> > 
> >         (Since Perl 5.8.1) Test whether $string is marked internally as
> >         encoded in UTF-8. Functionally the same as "Encode::is_utf8()".
> > 
> > This is confusing, to say the least. "Marked internally" is the words
> > core hackers are looking for and recognize, but "UTF-8" is what non-core
> > hackers (those without the cognitive bias in core terms) see and
> > understand. If we head over to Encode::is_utf8() we see:
> > 
> >     [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/.
> >     If /CHECK/ is true, also checks whether /STRING/ contains
> >     well-formed UTF-8. Returns true if successful, false otherwise.
> > 
> >     As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the
> >     |utf8::is_utf8| function.
> > 
> > I like this wording better for several reasons: It is under the title
> > "Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds
> > that it checks for well-formed UTF-8 only if that flag is true. There
> > are improvements to be made here too. We can note what the flag means
> > (subtle, complicated, bike-shed-able) or at the very least add a nice
> > "this isn't the flag you're looking for" warning. We can also suggest
> > when to use and when not to use the function (otherwise it's left to the
> > reader, who can easily get it wrong, which is why we're here).  
> 
> utf8::is_utf8() doesn't accept the second parameter and does no
> validity checks (we have utf8::valid() for that), despite the note in
> utf8.pm.
> 
> > If the document on both was better, then we could have possibly left
> > this as unfortunate naming errors we're carrying with us (along with
> > "wantarray" for noting whether the context is scalar, list, or void).  
> ...
> > Overall, I'm still undecided. Maybe we could start with improving the
> > existing documentation?  
> 
> Perhaps something like:
> 
> >>  
> 
> =item * C<$flag = utf8::is_utf8($string)>
> 
> (Since Perl 5.8.1) Test whether I<$string> is marked internally as
> encoded in UTF-8.  Functionally the same as C<Encode::is_utf8($string)>.
> Typically only necessary for debugging.
> 
> If you need to force Unicode semantics for code that needs to be
> compatible with perls older than 5.12, call C<utf8::upgrade($string)>
> unconditionally.
> 
> Using this flag to decide whether a string should be treated as
> already encoded bytes or characters is wrong, this should be decided
> as part of the interface of your function.
> 
> If you're accepting bytes:
> 
>   utf8::downgrade($string); # throws an exception if code point over 0xFF
> 
>   utf8::downgrade($string, 1) # our own error handling
>     or die "\$string must be representable as bytes"
> 
> or if you're accepting characters and need encoded bytes:
> 
>   utf8::encode($string); # unconditionally
> 
> The only exception is if you're dealing with filenames, since perl
> uses the internal representation of the string for system calls.
> 
> <<
> 
> Are there any other cases someone might be tempted to call
> utf8::is_utf8()?
> 
> Tony

I like this. What I miss here is a small example of how to guarantee
preventing double encoding/decoding, as I think that is what is
function is most often (erroneously) used for.

-- 
H.Merijn Brand  http://tux.nl   Perl Monger  http://amsterdam.pm.org/
using perl5.00307 .. 5.27   porting perl5 on HP-UX, AIX, and openSUSE
http://mirrors.develooper.com/hpux/        http://www.test-smoke.org/
http://qa.perl.org   http://www.goldmark.org/jeff/stupid-disclaimers/

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About