Front page | perl.perl5.porters |
Postings from December 2017
Re: Behavior of bitwise ops on unencountered wide characters
Thread Previous
|
Thread Next
From:
Karl Williamson
Date:
December 20, 2017 02:16
Subject:
Re: Behavior of bitwise ops on unencountered wide characters
Message ID:
bd8d2ec0-a357-4574-1bd6-c0758fa2a459@khwilliamson.com
On 07/12/2017 11:02 AM, Karl Williamson wrote:
> On 07/12/2017 04:50 PM, Sawyer X wrote:
>>
>>
>> On 07/11/2017 01:09 PM, Karl Williamson wrote:
>>> On 07/10/2017 11:12 PM, Father Chrysostomos wrote:
>>>> Karl Williamson wrote:
>>>>> I don't yet have a fully formulated opinion on this, but one
>>>>> question I
>>>>> would have is "How is this different from division by 0" that people
>>>>> seem to deal ok with.
>>>>
>>>> Fatal division by zero is ancient. Fatalizing bitwise operations on
>>>> utf8 breaks stuff.
>>>>
>>>> As I suggested in another thread (I seem to have been ignored), it
>>>> would be *much* kinder to users to make it a warning. (Wide character
>>>> in blah blah blah.) That way users who care can fatalize it, or sup-
>>>> press it. You have the best of all three worlds.
>>>>
>>>
>>> I believe I've referred to your suggestion in some thread. It is the
>>> minimum we should do. And others believe it should be deprecated.
>>
>> There is a specific cost here Graham noted. This method is currently
>> used to determine if a variable is a number without loading "B", which
>> isn't cheap. While it is a simple argument of "users shouldn't care,"
>> serializations (like JSON) need to be able to map them to their right
>> type. It would be nice if there was a way to do this without B.
>>
>
> It would be good to have some alternative that requires only a cheaply
> loaded, or internal module, something named like "Internals" that
> provides a clear access path for the things we have determined warrant
> it, such as Graham's use case. He had to explain to me how it worked,
> and he had to explain to Yves as well. That demonstrates is is
> non-obvious. When the tools aren't available, people will do clever,
> but non-maintainable things to get what they need. But it is best to
> furnish the tools when it becomes known that they would be useful.
>
I'm not sure which is the best message on this thread to reply to, so I
chose this one.
I have worked up a patch that deprecates the use of wide characters in
bit operations. Though I'm not sure why we can't just warn forever,
like sprout suggested. My first question is
1) Does this really have to be deprecated? Can a simple warning suffice?
The patch causes failures in ExtUtils-MakeMaker/t/unicode.t because
JSON::PP in blead still uses the trick mentioned on this thread
I'm not sure what to do.
On irc, Paul Evans pointed out that he does something similar, and that
this avoids the problem
https://metacpan.org/source/PEVANS/Tangence-0.24/lib/Tangence/Type/Primitive.pm#L488
He doesn't try the binary op unless the string is all ASCII. His
version could be sped up slightly by using AND instead of XOR, which
short-circuits, like Graham does. But JSON::PP could adopt something
like this and have things work as well as they currently do, without
having to load B. Graham has ideas for alternative implementations, and
has agreed to look into this.
I'm working on this now, as we are getting close to our first freeze
date for this release. I see that the deadline has been extended from
today (European and further East time) to next month. But we need to
have something soon, in any event.
My second question is
2) Should I put in whatever warning we add now, creating a temporary
customization for J:P? or wait until a new version is integrated into blead.
Recently, I added vectorization of detecting UTF-8 invariant strings,
speeding things up for long strings by huge amounts, 1/8 the
conditionals for example on 64-bit machines. On ASCII platforms,
invariant strings are the same thing as what's matched by [[:ascii:]]+
It occurs to me that patterns that contain this could be optimized to
use this vectorization. My final question is
3) Is there enough usage of quantified [[:ascii:]] in the wild to
justify doing this optimization? (I was surprised to see only 132 CPAN
modules have plain :ascii: (this grep also would catch negation))
Thread Previous
|
Thread Next