On Mon, Jul 17, 2017 at 10:46:59AM +0200, Sawyer X wrote: > [Top-posted] > > I have mixed thoughts about this. > > I'm sympathetic to both considerations: Having properly-named functions > to reduce confusion for future developers (we hope to have some, right?) > but not introduce additional cognitive load for existing developers. > > A few ways to make such a situation easier: > > * Document utf8::is_utf8() to prevent this confusion: This is by far the > first thing that should be done. I have double checked the wording for > utf8::is_utf8() from my blead (978b185): > > (Since Perl 5.8.1) Test whether $string is marked internally as > encoded in UTF-8. Functionally the same as "Encode::is_utf8()". > > This is confusing, to say the least. "Marked internally" is the words > core hackers are looking for and recognize, but "UTF-8" is what non-core > hackers (those without the cognitive bias in core terms) see and > understand. If we head over to Encode::is_utf8() we see: > > [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/. > If /CHECK/ is true, also checks whether /STRING/ contains > well-formed UTF-8. Returns true if successful, false otherwise. > > As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the > |utf8::is_utf8| function. > > I like this wording better for several reasons: It is under the title > "Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds > that it checks for well-formed UTF-8 only if that flag is true. There > are improvements to be made here too. We can note what the flag means > (subtle, complicated, bike-shed-able) or at the very least add a nice > "this isn't the flag you're looking for" warning. We can also suggest > when to use and when not to use the function (otherwise it's left to the > reader, who can easily get it wrong, which is why we're here). utf8::is_utf8() doesn't accept the second parameter and does no validity checks (we have utf8::valid() for that), despite the note in utf8.pm. > If the document on both was better, then we could have possibly left > this as unfortunate naming errors we're carrying with us (along with > "wantarray" for noting whether the context is scalar, list, or void). ... > Overall, I'm still undecided. Maybe we could start with improving the > existing documentation? Perhaps something like: >> =item * C<$flag = utf8::is_utf8($string)> (Since Perl 5.8.1) Test whether I<$string> is marked internally as encoded in UTF-8. Functionally the same as C<Encode::is_utf8($string)>. Typically only necessary for debugging. If you need to force Unicode semantics for code that needs to be compatible with perls older than 5.12, call C<utf8::upgrade($string)> unconditionally. Using this flag to decide whether a string should be treated as already encoded bytes or characters is wrong, this should be decided as part of the interface of your function. If you're accepting bytes: utf8::downgrade($string); # throws an exception if code point over 0xFF utf8::downgrade($string, 1) # our own error handling or die "\$string must be representable as bytes" or if you're accepting characters and need encoded bytes: utf8::encode($string); # unconditionally The only exception is if you're dealing with filenames, since perl uses the internal representation of the string for system calls. << Are there any other cases someone might be tempted to call utf8::is_utf8()? TonyThread Previous | Thread Next