develooper Front page | perl.perl5.porters | Postings from November 2010

Re: "perl: utf8.c:1997: Perl_swash_fetch: Assertion `klen <=sizeof(PL_last_swash_key)' failed." [5.12.1]

Thread Previous | Thread Next
November 28, 2010 02:34
Re: "perl: utf8.c:1997: Perl_swash_fetch: Assertion `klen <=sizeof(PL_last_swash_key)' failed." [5.12.1]
Message ID:
On 27 November 2010 23:53, Reverend Chip <> wrote:
> On 11/27/2010 4:06 AM, Nicholas Clark wrote:
>> On Fri, Nov 26, 2010 at 01:03:20PM -0800, Reverend Chip wrote:
>>> You seriously equate Encode::_utf8_on() with, say, playing around with
>>> optrees using B?  You seriously equate a bad pointer in an SV to a
>>> misplaced byte in a utf8 string?
>> Yes. Totally.
> There are some similarities, but since the ':utf8' layer just slaps the
> utf8 bit on whatever comes in, the situations are not identical.

To me *that* is the bug that you should be reviewing.

>  It's
> obvious to me that since a regex can die of an assertion due to bad
> input data, then we might at least want to clue the user in about which
> regex is dying so he can guard it.

If a better error message was possible *without* having to validate
the utf8 string then I would say you are right.

However if it means we have to validate the string every time we do a
utf8 operation then I would say you are wrong.

If we stop trusting the utf8 bit, then we will have to validate utf8
data really quite a lot. Of course, we would soon then add a bit
saying that the string was validated, otherwise perl would basically
grind to halt veryifying and reverifying its utf8 data all the time -
all it would do is spin its wheels validating utf8. So adding a bit
saying "this string is valid utf8" would be inevitable. But once we
added this new bit "utf8_valid", we would then have *two* bits saying
the string is utf8! And we would no longer have an escape hatch for
code that wants to work by contract and avoid the overhead of
validating well formed input. Much worse once we added that bit, we
would then inevitably have someone requesting, or writing a utility
function to flip *that* bit. And then we would be right back where we
started, with people filing bugs about being able to seg fault perl by
flipping on the bits inappropriately...

So then I guess  we would add a third bit, "this string really really
is valid"... Oh wait.

Which to me makes it pretty obvious that we arent going to start not
trusting the utf8 bit - that way lies madness.

So if we can have a better error message without implying that every
utf8 operation has be guarded by utf8 validation logic then I doubt
there is any debate. But that doesnt seem to be what you were

> Since no one is chiming in to agree
> with me, I guess I'll just stop.  I'm quite disappointed by the apparent
> lack of concern for the basic usability issues.  I'm left with no option
> but to think of it as evolution in action.

I think this is unfair. We care about usability issues, however we
dont agree on the characterization of the *source* of the bug.

Lets say cosmic rays, or hardware failure, cause one of your dimms to
store a 1 bit instead of a 0 somewhere in an internal data structure
that Perl operates on. Would you say that it is perls job to be robust
to such forms of failure? I think most of us would agree that if
something is to be robust to that form of failure it isnt *perl*.
Likewise with the case of turning on the utf8 bit on a string that
does not contain valid utf8 data.

However I would argue that the :utf8 not validating input before
marking the string as utf8 is probably a bug, and if you complained
about such a bug you find me agreeing with you. But for me it would be
purely in regard to the fact that despite our long tradition of
allowing people to blow their foot off, making it that easy probably
isnt a good idea. In otherwords the name of the flag isnt ideal.

>> I'd really prefer that it didn't exist at all.

Sure, in an ideal world world we shouldn't have it. On the other hand.
what are we to do?  For performance reasons we really *want* an escape
hatch so that in cases where there is no doubt of the validity one can
avoid checks. At work, for various reasons this is a common occurance.
We know that data returned from a certain external system is utf8
encoded, and we *really* don't want to have to validate it when we
expose it in the perl code as a variable.

So it seems to me that if anything the problem here is that :utf8 was
a poor name. If it had been named
:valid_utf8_will_segv_if_you_are_wong, and had Encode::_utf8_on been
called Encode::_utf8_on_do_not_file_bugs_if_you_use_this_on_bad_data.

However, at this point what are we to do? I think probably we should
make :utf8 validate, but im less sure about explicit use of
Encode::_utf8_on. We could for instance add a second parameter
"no_validate", which would have the effect that code that uses
Encode::_utf8_on($string) would now start validating the string first,
and that anybody using the no_validate flag would have read the docs
DATA WHICH IS NOT WELL FORMED UTF8", however I personally would be
little irritated as I know how much code would have to be touched at
$work to upgrade, otoh, its not that hard a fix, we have done similar
things many times without too much effort.

> I actually use it (properly) in conjunction with utf8::valid to detect
> and repair double encoding, so I'm very happy it's available.  If it
> weren't, I'd have to write it.

Do you mean Encode::_utf8_off()?  I can understand how
Encode::_utf8_off() might be useful in recursively decoding
multi-encoded utf8, as I've done the same thing myself (without using
utf8::valid tho), but I've never needed to use Encode::_utf8_on() for
that purpose. But Encode::_utf8_off() is a completely safe operation,
and will never cause anything to seg fault, unlike its inverse.


perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About