Front page | perl.perl5.porters |
Postings from February 2010
Re: warding against bytes.pm
Thread Previous
|
Thread Next
From:
Ben Morrow
Date:
February 25, 2010 11:49
Subject:
Re: warding against bytes.pm
Message ID:
20100225194907.GA34873@osiris.mauzo.dyndns.org
Quoth ikegami@adaelis.com (Eric Brine):
> On Wed, Feb 24, 2010 at 11:00 PM, Ben Morrow <ben@morrow.me.uk> wrote:
> > Quoth pagaltzis@gmx.de (Aristotle Pagaltzis):
> > > * Jesse Vincent <jesse@fsck.com> [2010-02-25 01:50]:
> > >
> > > > Or do we actually have a better way to do _everything_ bytes
> > > > does?
> > >
> > > Well, no. We do have correct ways for every misguided use of
> > > bytes.pm and I believe we have better ways for all its correct
> > > uses, but I don’t think we have anything to offer to people with
> > > insane uses. :-)
> >
> > OK, so how do I ensure that certain strings will never be upgraded
>
> bytes doesn't affect "certain strings", and it doesn't prevent upgrading.
I know. I'm not arguing that bytes.pm isn't broken, I'm arguing that it
needs to be fixed rather than thrown away.
FWIW, if you are writing a short script to process binary data (a not
uncommon occurrence), bytes *will* prevent anything from getting
upgraded automatically. It doesn't protect you against modules upgrading
things behind your back, but that's one of the things that's broken.
> It
> just downgrades or encodes everything automatically, and you can do that
> (with more predictable results) using utf8::downgrade or utf8::encode.
>
> use strict;
> use warnings;
> use Test::More tests => 2;
> use bytes;
> my $x = "abcd\x{E9}fghij";
> utf8::upgrade($x);
> is($x, "abcd\x{E9}fghij", "prevents upgrade 1"); # pass
> $x .= chr(0x2660);
> chop($x) for 1..3;
> is($x, "abcd\x{E9}fghij", "prevents upgrade 2"); # fail
>
> You lose the automatic aspect by using utf8::encode or utf8::decode, though.
> I don't know of anything that does it automatically and predictably.
Yes. That's much too fussy to be useful.
I'm increasingly thinking that's what needed here is blob magic type,
that stops given strings from ever being upgraded. I'm not certain
that's right though, since I think it falls foul of 'polymorphic data
types but not polymorphic operators', which I believe is the reason
bytes was implemented as a lexically-scoped pragma in the first place.
> How do I make /\d\s\w/ match ASCII-only?
>
>
> use bytes only does that if the input string doesn't contains greater than
> 8-bit values. utf8:downgrade does the same.
...but IIUC Karl and Yves are (for good reason) trying to change things
so that isn't the case, and ISO8859-1 letters match as well as ASCII. Of
course, if we get an /a regex flag, and some sort of re::default_flags
pragma, then bytes could use that.
> > How, in general terms, do I say 'This is 8-bit data so
> > keep your grubby Unicode hands off it'?
> >
>
> use bytes is lexically scoped, so use bytes will be completely ineffectual
> at keeping grubby hands away. It just pretends everything is 8-bit data,
> corrupting it if it isn't.
So it needs fixing, perhaps by throwing an exception if any bytes-ified
operator gets passed a string with codepoints >255. At the very least it
should be able to transparently convert SvUTF8 strings that would in
fact fit into bytes.
Ben
Thread Previous
|
Thread Next