develooper Front page | perl.perl5.porters | Postings from January 2014

Re: Marking a scalar as an unupgradable binary blob.

Thread Previous
Aristotle Pagaltzis
January 28, 2014 08:19
Re: Marking a scalar as an unupgradable binary blob.
Message ID:
Hi demerphq,

* demerphq <> [2014-01-28 05:45]:
> [Retitled this to start a new thread.]
> On 28 January 2014 03:55, Zefram <> wrote:
> > demerphq wrote:
> >>FWIW Figuring out a way to mark a string as not being upgradable,
> >
> > Sounds like a bad idea.
> Given that corruption of binary data by utf8 upgrading has been
> a repeated and expensive error at $work I disagree. Rather strongly.
> > What exactly do you mean by "a string" here? Is this a status that
> > would be preserved across copying such as scalar assignment?
> I haven't decided, but probably not.
> > If so, adding any new hidden flag is a bad idea, but one that
> > prevents using what is otherwise always a legitimate representation
> > of a string is especially bad.
> Depends on your point of view. If you are one of the "there is no
> binary data mob"

I am. :-)

> then I suppose you might consider it bad.

I don’t. :-)  The simple point of the mobsters, or at least of this
mobster, is that Perl currently only has a single string type, and the
UTF8 flag is semantically meaningless. No religion, just facts.

> On the other hand if you are not religious about this stuff then you
> know that Perl has to deal with binary data, and that it should NOT be
> upgraded, even if concatenated with unicode strings, and seen from
> that point of view it is a very good idea.

I do think Perl lacks a distinction for the two types strings. And in
the absence of string types, the UTF8 flag is an attractive nuisance,
because people do want and need string types, and so they try to hang
the distinction on the meaningless UTF8 flag. More or less all of us
have, at one point or another. So I’m all for adding string types.

> So I consider Perl dieing (or perhaps warning, im flexible in that
> regard) when someone tries to concatenate a *designated* binary blob
> with unicode data exactly the right thing to do. Which is what
> I intend to make happen if I can.
> my $string= pack "N*", @nums;
> mark_as_binary($string);
> $string.=$unicode; # boom
> my $binary= encode_sereal($struct); # does mark_as_binary() internally
> my $unicode= "\x{100}";
> $unicode .= $binary; # boom
> $binary .= $unicode; #boom
> sprintf "%s %s", $binary, $unicode; # boom
> My current thoughts are that this can be facilitated by attaching
> magic to the scalar and some minor patches to the appropriate parts of
> the internals.

I like the proposal but think it is incomplete, and I am uncomfortable
with its sole reliance on the UTF8 flag.

I’d like to see this expanded to include marking a string as a character
string. Concatenating any marked string to any unmarked string should
then warn, and importantly, concatenating incompatibly marked strings
must croak *even if neither has UTF8=1*.

That way we get string types, and the meaningless flag becomes distantly
secondary in prominence in code that works with typed strings, as well
it should always have been.

Someday in the future we could make decoding automatically mark strings
as characters, and encoding automatically mark them as binary – but not
now, as far too many things would break. But if the mechanism exists at
all then and friends can start allowing opting into it so that
things can be ported over gradually, and it would give people a tool to
clean up their data flows. No more wondering where those $#@!* mojibake
are happening: big fat crashes will tell you *exactly* where you goofed
(as long as you ask for it).

Aristotle Pagaltzis // <>

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About