develooper Front page | perl.perl5.porters | Postings from November 2015

Re: RFC: what to do about bitwise string operators (related to [perl#63574])

Thread Previous | Thread Next
Karl Williamson
November 17, 2015 02:58
Re: RFC: what to do about bitwise string operators (related to [perl#63574])
Message ID:
On 06/12/2015 01:15 PM, Ricardo Signes wrote:
> * Karl Williamson <> [2015-05-02T14:12:16]
>> When | & ^ ~ are executed on strings, or when the new |. &. ^. ~. operators
>> are run, the internal representation of those strings is relied on (and
>> hence exposed).  This means different behaviors will often result on EBCDIC
>> vs ASCII platforms.
>> More importantly, whether a string is in UTF-8 or not may affect the result.
>> There is no such problem if the string is comprised solely of ASCII
>> characters (on ASCII machines or ASCII-equivalent characters plus the C1
>> controls on EBCDIC machines), which is why people may not have been bitten
>> much by this in the past.
> Karl and I discussed this at YAPC.
> My current thinking:
> A string's codepoints should be treated as octets and operated upon bitwise.
> There should be no "Unicode bug."  "😊" & "😟" should raise an exception.
> Probably:
>    String with code points over 0xFF may not be used as bit strings on %s side
>    of %s operator
>    ("left", "&.")
> On the other hand, ("😊" | "😟") should return "😐".
> (That's a joke.  Please don't.)

It turns out that I was mostly wrong about how things currently work. 
The operations are done on the underlying code points.  For those of you 
just joining this discussion, the original message is

and so the current operation is item C) from that message.

I still don't think that is useful in general.  Code points are not 
generally assigned where the relationship between them is meaningful in 
bit operations ways.

The exceptions I can think of are in some cased languages, such as ASCII 
and Cyrillic, the upper and lower case characters are typically 2**X 
code points apart (typically 2**5), so that one could create a mask to 
see if something is an upper- or lowercase FOO with one test.  The macro 
isALPHA_FOLD_EQ in handy.h does this, but the result is undefined if it 
is passed something that isn't an ASCII alphabetic.  In many cased 
scripts this would work mostly, but not entirely.  In Greek, for 
example, it doesn't work for sigma, as there are 2 lowercase ones, and 
just one uppercase.

Also, all decimal digits in a script are in contiguous blocks of 10 code 
points.  (Chinese, for example, doesn't meet this, and so the Han digits 
are not considered decimal digits.)  The zero need not be a code point 
that ends in 0.  The operations on these are typically addition and 
subtraction.  If you know, for example that you have a Lao digit, you 
can find its numeric value by how many code points above LAO DIGIT ZERO 
it is.  So the bit ands, ors, xors, are not meaningful.

There may be other relationships that I'm unaware of, so I can see just 
leaving this working the way it is and not deprecating it.  But I am 
willing to go either way.  I've already written the small amount of code 
necessary to do the deprecation on binary operations.

However, I can't think of a reason to get the value of the complement of 
a code point.  The way it works currently is if all characters in the 
string being complemented are less than 256, the complement is byte 
based even if the string is in UTF-8.  If any are above 255, the 
complement is based on the full UV word size.  This is different on 
64-bit and 32-bit platforms, so code doing this leaks the word size, and 
is not portable.  We also have agreed to deprecate code points that are 
above IV_MAX.  This means that if we continue to implement this, we 
should be complementing on 63-bit and 31-bit word sizes instead, which 
gives different results than currently.

I think we should definitely deprecate complementing above-Latin1 code 

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About