On Tue, Jul 18, 2017 at 10:53:53AM +1000, Tony Cook wrote: > On Mon, Jul 17, 2017 at 10:46:59AM +0200, Sawyer X wrote: > > [Top-posted] > > > > I have mixed thoughts about this. > > > > I'm sympathetic to both considerations: Having properly-named functions > > to reduce confusion for future developers (we hope to have some, right?) > > but not introduce additional cognitive load for existing developers. > > > > A few ways to make such a situation easier: > > > > * Document utf8::is_utf8() to prevent this confusion: This is by far the > > first thing that should be done. I have double checked the wording for > > utf8::is_utf8() from my blead (978b185): > > > > (Since Perl 5.8.1) Test whether $string is marked internally as > > encoded in UTF-8. Functionally the same as "Encode::is_utf8()". > > > > This is confusing, to say the least. "Marked internally" is the words > > core hackers are looking for and recognize, but "UTF-8" is what non-core > > hackers (those without the cognitive bias in core terms) see and > > understand. If we head over to Encode::is_utf8() we see: > > > > [INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/. > > If /CHECK/ is true, also checks whether /STRING/ contains > > well-formed UTF-8. Returns true if successful, false otherwise. > > > > As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the > > |utf8::is_utf8| function. > > > > I like this wording better for several reasons: It is under the title > > "Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds > > that it checks for well-formed UTF-8 only if that flag is true. There > > are improvements to be made here too. We can note what the flag means > > (subtle, complicated, bike-shed-able) or at the very least add a nice > > "this isn't the flag you're looking for" warning. We can also suggest > > when to use and when not to use the function (otherwise it's left to the > > reader, who can easily get it wrong, which is why we're here). > > utf8::is_utf8() doesn't accept the second parameter and does no > validity checks (we have utf8::valid() for that), despite the note in > utf8.pm. > > > If the document on both was better, then we could have possibly left > > this as unfortunate naming errors we're carrying with us (along with > > "wantarray" for noting whether the context is scalar, list, or void). > ... > > Overall, I'm still undecided. Maybe we could start with improving the > > existing documentation? > > Perhaps something like: > > >> > > =item * C<$flag = utf8::is_utf8($string)> > > (Since Perl 5.8.1) Test whether I<$string> is marked internally as > encoded in UTF-8. Functionally the same as C<Encode::is_utf8($string)>. > Typically only necessary for debugging. > > If you need to force Unicode semantics for code that needs to be > compatible with perls older than 5.12, call C<utf8::upgrade($string)> > unconditionally. > > Using this flag to decide whether a string should be treated as > already encoded bytes or characters is wrong, this should be decided > as part of the interface of your function. > > If you're accepting bytes: > > utf8::downgrade($string); # throws an exception if code point over 0xFF > > utf8::downgrade($string, 1) # our own error handling > or die "\$string must be representable as bytes" > > or if you're accepting characters and need encoded bytes: > > utf8::encode($string); # unconditionally > > The only exception is if you're dealing with filenames, since perl > uses the internal representation of the string for system calls. > > << > > Are there any other cases someone might be tempted to call > utf8::is_utf8()? Thinking about it further, I'm pretty sure this doesn't all belong here. L<perlunifaq/What is "the UTF8 flag"?> provides a good description of the flag is_utf8() returns, and the whole of perlunifaq covers some of the things the above tries to cover. perlunicook largely works at a higher level than the functions in utf8::* work at. One thing from the above that doesn't seem to be discussed well[1] is what I tried to cover briefly in: > Using this flag to decide whether a string should be treated as > already encoded bytes or characters is wrong, this should be decided > as part of the interface of your function. which could perhaps use some expansion in perlunicode. I'm not sure where the cheat sheet following belongs, though perlunifaq covers some of it (though using Encode instead of utf8::*). Tony [1] perlunifaq briefly mentions some of the issues under "What about binary data, like image?" and more detail in "What if I don't decode?"Thread Previous | Thread Next