Front page | perl.perl5.porters |
Postings from July 2017
Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
From: Sawyer X
July 17, 2017 08:47
Re: [perl #131685] Rename utf8::is_utf8() (and other functions)
Message ID: firstname.lastname@example.org
I have mixed thoughts about this.
I'm sympathetic to both considerations: Having properly-named functions
to reduce confusion for future developers (we hope to have some, right?)
but not introduce additional cognitive load for existing developers.
A few ways to make such a situation easier:
* Document utf8::is_utf8() to prevent this confusion: This is by far the
first thing that should be done. I have double checked the wording for
utf8::is_utf8() from my blead (978b185):
(Since Perl 5.8.1) Test whether $string is marked internally as
encoded in UTF-8. Functionally the same as "Encode::is_utf8()".
This is confusing, to say the least. "Marked internally" is the words
core hackers are looking for and recognize, but "UTF-8" is what non-core
hackers (those without the cognitive bias in core terms) see and
understand. If we head over to Encode::is_utf8() we see:
[INTERNAL] Tests whether the UTF8 flag is turned on in the /STRING/.
If /CHECK/ is true, also checks whether /STRING/ contains
well-formed UTF-8. Returns true if successful, false otherwise.
As of Perl 5.8.1, utf8 <https://metacpan.org/pod/utf8> also has the
I like this wording better for several reasons: It is under the title
"Messing with Perl's Internals"; it notes the "UTF8" flag, and it adds
that it checks for well-formed UTF-8 only if that flag is true. There
are improvements to be made here too. We can note what the flag means
(subtle, complicated, bike-shed-able) or at the very least add a nice
"this isn't the flag you're looking for" warning. We can also suggest
when to use and when not to use the function (otherwise it's left to the
reader, who can easily get it wrong, which is why we're here).
If the document on both was better, then we could have possibly left
this as unfortunate naming errors we're carrying with us (along with
"wantarray" for noting whether the context is scalar, list, or void).
* Provide different functions and document all functions in all other
If we decide to have better named functions, we will have additional
cognitive load for both experienced core developers and new developers.
For core developers, it is a muscle memory to undo and two different
sets of code to deal with - those with the old name and those with the
new name. For new developers, it will be simple at first, until you come
in contact with the old name. It is likely this will also happen early,
so you need to learn two names anyway. However, their muscle memory will
be geared towards using a more descriptive name.
I mix those when it comes to English.pm. I use $_, $@, $!, $#, $/, $^X,
$0 and a few more, but I use English.pm for $<, $>, $(, $), $", and a
few more. The reasoning is simple: $_, $@, $!, and $# are so common it
will be built into every muscle memory. On the other hand, for many
developers, if they see $<, they will need to look it up in perlvar
anyway. However, $UID or $REAL_USER_ID is readable right away and no
need to look it up.
One additional point about English is that, unlike what we're suggesting
here, the punctuation variable names are the right name, they're just
not descriptive. is_utf8() is not about descriptive, but misleading. It
is a misnomer. It makes it an undesired pitfall.
I see value in adding proper names, but then we would need to take care
of at least making all possible names available in the documentation of
all other names. If you're reading utf8.pm, you need to find
"is_upgraded" in "is_utf8" and "is_utf8" in "is_upgraded". This makes
it easy to quickly find what they mean and differentiate when we see
* Move all known usages in core to new functions
Another way to improve this new cognitive load is by reducing it in the
codebase. Removing as many instances of the old name will reduce the
mixture of names, thus helping us move towards the new name. This is a
much more intrusive change but has a high potential of helping seasoned
developers to deal with the new name.
* Automated policies for improving CPAN code quality
This is beyond the scope of core, but I think it's worthwhile taking
into account the perspective of the community. Realizing the misused
"is_utf8" brings with it a question of whether and how we could reduce
this problem's scope outside the core, and this could have been done
with a kwalitee check (CPANTS) that checked for "is_utf8" and
recommends reviewing its use. This is far more complicated since there
is a legitimate (but narrow) use for it, and you might get false
positives. I believe only a human could find the situations in which
Still, it is worthwhile keeping in mind.
Overall, I'm still undecided. Maybe we could start with improving the
 Using "is_upgraded" as an example different name.
On 07/13/2017 06:53 AM, Karl Williamson wrote:
> On 07/12/2017 12:36 AM, H.Merijn Brand wrote:
>> On Tue, 11 Jul 2017 10:41:37 -0700, "Father Chrysostomos via RT"
>> <email@example.com> wrote:
>>> On Tue, 11 Jul 2017 00:55:51 -0700, davem wrote:
>>>> On Mon, Jul 10, 2017 at 12:45:48PM -0400, Sawyer X wrote:
>>>>> Does anyone have any comments on this? Tony, Dave, Zefram? *Karl*? :)
>>>> My opinion on this sort of proposal (and it's an opinion which has
>>>> stronger over time (*)) is rarely/never to add a new alias name to an
>>>> existing function.
>>>> Alias names just increase the cognitive load. If the old names were
>>>> confusing, having more names will just increase the confusion.
>>>> Before, you would have to remember that a particular function foo() is
>>>> badly named and doesn't do what you might expect it to do, based
>>>> solely on
>>>> the name.
>>>> Afterwards, you have to remember that that are two functions foo() and
>>>> bar(), one is deprecated (which one?), one is badly named (which
>>>> but they both do the same thing (Or do they? Sigh. Let's check the
>>>> documentation one more time).
>>>> Life is now harder.
>>>> (*) My opinion firmed over AvFILL(). It was a weird name, but I was
>>>> used to
>>>> it. Now I can never remember what the new alias is called (just looked
>>>> it up - av_top_index()). In hindsight, I would have voted against
>>> I agree with everything you have said. I brought up the same
>>> objection when this proposal was first put forward, but I thought I
>>> had lost the debate. Well, at least there are two of us now. :-)
>> Count me in: three. I like the way Dave has written down my feelings :)
> I guess we have a fundamental disagreement about language design and
> the direction Perl should go, which makes me sad.
> The point of adding synonyms for deceptively-named functions and
> macros is to make life easier overall. Forbidding new better-named
> synonyms for problematically named things forces everyone who comes
> along to deal with the gotchas and cognitive load that those people
> already here have had to deal with. By creating better-named things,
> those people can largely avoid these problems. This allows them to
> work more efficiently, avoiding traps, and with less cursing Perl.
> Unless Perl is close to death, the number of people who are going to
> come along before it does die dwarfs the number who are already
> expert. Some people are knowledgeable in parts of Perl, but not
> all. They also gain if gotchas get removed before they have to deal
> with them.
> Specifically about av_top_index, I don't believe that it is so poorly
> named that you have to keep consulting the documentation as to what it
> It came about not because of AvFILL, but because of the
> already-existing synonym, the evilly named "av_len". This name
> implies it gives a length, but in fact it is one-off from that.
> av_top_index, though cumbersome, accurately indicates what it returns.
> Using av_len is a bug waiting to happen. It is a foreseeable problem.
> I believe that it would be unethical to not create a non-deceptive
> alternative. It's kind of like a safety recall.
> Writing code using deceptively named things or with poor API's is
> slower and more error prone. Every time you use one, you have to get
> out of your mental pipeline and recall that this is a gotcha, and have
> to figure out how it is a gotcha and how you have to compensate. You
> are effectively flushing your mental instruction cache. In the case
> of av_len, you have to remember which way is the off-by-one problem here.
> Code reviews also are affected. It is just too easy to read the thing
> and forget that it doesn't do what you would want.
> In researching the issue back when av_top_index was created, I found
> published modules that used av_len, as its name implies, as a length.
> Others undoubtedly had caught the problem earlier, say through their
> unit testing.
> But all this could be avoided by the code using a non-deceptive name.
> Hopefully, the coder won't even be aware that there exist deceptive
> ones for hysterical reasons.
> It is foreseeable that av_len is going to cause problems. It would be
> irresponsible of us to not create a non-deceptive synonym when it is
> so easy to do.
> No one was really happy with "av_top_index" as a name. So AvFILL was
> retained in the core. All occurrences of av_len were removed. If we
> could have come up with a short, pithy synonym, we would have replaced
> AvFILL as well, and then people looking at the core would have seen
> that and gotten used to it, and over time the memory of the less
> well-named versions would have faded.
> Writing good APIs is hard. I have flattered myself at times into
> thinking I'm good at it. Maybe I am actually good, but if so, I'm
> still not good enough. And few, if any, are. If we have a poor API
> in some area, we should not tie our hands and say tough to all those
> people who come along later, and give them more reason to use some
> other language