develooper Front page | perl.perl5.porters | Postings from March 2014

[perl #90646] perlop doesn't document bitwise &, |, and ^ on Unicode strings

Karl Williamson via RT
March 22, 2014 18:43
[perl #90646] perlop doesn't document bitwise &, |, and ^ on Unicode strings
Message ID:
On Sun Sep 08 11:32:38 2013, wrote:
> On Sat, Sep 7, 2013 at 10:40 PM, Father Chrysostomos via RT <
>> wrote:
> > I think what Tom is getting at is that it is not documented what
> > ~"\x{100}" will do.  In fact, I couldn’t even tell you what it does do.
> >  I suspect it is buggy and inconsistent.
> >
> Based on testing with 5.16.3, it's functionally equivalent to the following:
> if (/[^\x00-\xFF]/) {
>     # e.g. chr(0x100) -> chr(0xFFFFFEFF) when ivsize==4
>     return pack 'C*', map { ~$_ } unpack 'C*', $_;
> } else {
>     # e.g. chr(0x10) -> chr(0xEF)
>     return pack 'C*', map { (~$_) & 0xFF } unpack 'C*', $_;
> }
> ~$U8s always works. ~$UVs doesn't always work. I'd call that a buggy design.
> (Note that it doesn't suffer from The Unicode Bug. All decisions are based
> on the content of the string, not its internal storage format.)

I took a stab at this, and a proposed wording patch is attached.  I looked at the code, and the pre-existing text for ~ appears to be correct.  I'm saying that code should avoid using the other three operators on anything but numbers and bitstrings. though I don't go into any details as to why.  The results will vary if the string contains one of the 128 UTF-8 variant characters, depending on whether the string is encoded in UTF-8 or not.  Further I don't think we should offer a guarantee that the internal encoding is never going to change to be something other than what we have now.  So, code should not rely on the UTF-8 or non-UTF8 representation of strings
Karl Williamson

via perlbug:  queue: perl5 status: open Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About