develooper Front page | perl.perl5.porters | Postings from June 2012

Re: [perl #113088] Data::Dumper::Useqq('utf8') broken [PATCH]

Thread Previous | Thread Next
From:
Jesse Luehrs
Date:
June 7, 2012 17:06
Subject:
Re: [perl #113088] Data::Dumper::Useqq('utf8') broken [PATCH]
Message ID:
20120608000612.GM5599@tozt.net
On Fri, Jun 08, 2012 at 01:54:14AM +0200, demerphq wrote:
> On 7 June 2012 19:33, Jesse Luehrs <doy@tozt.net> wrote:
> > On Thu, Jun 07, 2012 at 07:16:15PM +0200, demerphq wrote:
> >> On 7 June 2012 17:33, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
> >> > * Jim Avera <perlbug-followup@perl.org> [2012-05-26 03:10]:
> >> >> However it seems wrong to test for #chars != #bytes, because binary
> >> >> data _should_ be passed as byte strings, that is, with Perl's internal
> >> >> utf8 flag off.
> >> >
> >> > Disagree.
> >> >
> >> > The UTF8 flag is completely irrelevant to a string’s semantics.
> >>
> >> Please stop saying this. It is the same flawed logic that means I cant
> >> send a bitvector in JSON reliably, which is a problem we DO NOT WANT
> >> in Perl.
> >>
> >> It is simply not true. If a string contains binary data then it is
> >> binary, and treating it as utf8 in any form is completely and utterly
> >> wrong.
> >
> > But it is true. I don't really see how what you said contradicts what
> > Aristotle said. If a binary string happens to contain all bytes less
> > than 0x7f, then whether the UTF8 flag is on or off is irrelevant - perl
> > will treat them the same, and application code should treat them the
> > same as well. You're conflating the way that perl stores the string data
> > internally (which is what the UTF8 flag represents) with what the data
> > actually represents (which is a string of characters, which could be
> > interpreted as a byte string if all of the characters are less than or
> > equal to 0xff). A string containing binary data could easily have the
> > UTF8 flag on without changing its meaning, because the UTF8 flag has no
> > relevance to the semantic meaning of the data.
> 
> Some strings contain binary data, such as structs intended to passed
> into C code, the result of pack, or vec, if you treat them as utf8 you
> either have broken utf8 (such as via vec() iirc), or you have broken
> binary data.

Sure - in cases where the UTF8 flag actually changes the interpretation
of the underlying data, it's clearly relevant. I'm just saying that
using the UTF8 flag to determine whether something is binary data is
flawed, because there are some classes of binary data where the UTF8
flag does not affect the interpretation (in particular, where the binary
data happens to contain bytes that are all at most 0x7f), and so using
the UTF8 flag to determine whether data is "binary" can give incorrect
results.

As for vec(), it seems like it should probably forcibly downgrade its
first argument. I can't really see how using vec() on a string with
codepoints greater than 0xff makes any sense at all. pack() is more
complicated because of the U template character, but we already had this
discussion in some depth a few months ago, and we probably don't need to
reopen it now.

-doy

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About