Front page | perl.perl5.porters |
Postings from February 2000
Re: should "use byte" be "use bytes"?
From: Larry Wall
February 11, 2000 19:05
Re: should "use byte" be "use bytes"?
Message ID: 200002120305.TAA20678@kiev.wall.org
Gisle Aas writes:
: Larry Wall <email@example.com> writes:
: > There is, however, a difference between code that is explictly in the
: > scope of a "use bytes" and code that is implicitly binary. And that
: > difference lies in how utf8 data from other modules is treated. The
: > default assumption is that any data that has been marked as utf8 should
: > be treated as utf8, so an implicitly binary script will try to do
: > the right thing, such as promoting binary/latin-1 strings to utf8
: > if you concatenate it with utf8, for instance.
: > But in the scope of "use bytes", no such promotion happens, because
: > "use bytes" basically says, "I don't care if the string is marked as
: > utf8 or not, just treat it as a bucket of bits." So if you concatenate
: > a latin-1 string with a utf8 string, you'll get nonsense. But that's
: > your problem, just as it is in old Perl. The "use bytes" declaration
: > indicates you're willing to accept that responsibility.
: I think this is the wrong thing to do. My thinking is that 'use byte'
: should never expose the internal UTF8 encoding. It should be possible
: to switch to UCS4 encoding internally without any user visible change.
: I suggest that if you operate on a UTF8-flagged string in 'use byte'
: scope, then the string should be downgraded to a binary/latin-1 string
: if it doesn't contain any chars with ord() > 255, and we should croak
: if chars that can't be represented in 8bits exists. We should never
: end up with UTF8-flagged strings that contain illegal UTF8-sequences.
: There should never be any user visible difference between two strings
: containing the same logical chars, just because one has the UTF8-flag
: set and the other has not. This makes perl free to internally convert
: between UTF8 and binary representation for strings when it feels like
: it. I think this freedom is needed.
: What I am saying is that functions like length(), substr() should not
: change behaviour at all when in 'use byte' scope, but they might croak
: if they try to operate on chars with ord() > 255. For string
: concatenation we will also downgrade UTF8 stuff and croak if it is not
While I think your viewpoint is valid, I also think it's incomplete. :-)
A lot of current Perl code already happily operates on UTF-8 without
knowing it, just as there's a lot of perl code that operates on Russian
or Chinese without knowing it.. That is the semantics we're preserving
with the current definition of "use bytes".
But I can see the use of your semantics too, so maybe we need to
separate them into
use bytes "lax"; # no translation of utf8 forced
use bytes "strict"; # translation forced
Actually, I think it's rather something-centric to assume people would
want their bytes forced to Latin-1. I'm wondering if what's really
going on here is a multiway split:
use bytes; # no translation of utf8 forced
use bytes "ISO-8859-1"; # force translation to Latin-1
use bytes "KOI-8R"; # force translation to Russian
use bytes "ASCII"; # force translation to American
use bytes "Big5"; # force translation to Big5
use bytes "JIS"; # force translation to JIS
But I don't know if I like this at all. If someone is smart enough to
put a "use bytes" into their program, they ought to be smart enough to
deal with utf8 where it's necessary with an explicit conversion. This
has the great advantage that the errors will happen in a spot where
they can expect an exception. With a lazy downgrade you could get a
fatal error almost anywhere in your code, on data that was good the
first 999 times you tested the program.
Going the other way, I don't really see a big problem. A "use bytes"
chunk of code can't produce illegal utf8, in the sense that anything
it produced wouldn't be marked as utf8, but would be considered funny
Latin-1 or whatever. So at worst you'd get doubly-encoded utf8, which
makes gobbledygook, but doesn't violate the constraint that strings
marked as utf8 should be legal utf8.
On the other hand, with your approach, at least Latin-1 would get
re-encoded in utf8 the way it came out, even if everything else blew
you out of the water.
And even with lazy downgrade, you can force the downgrade to happen
at a particular spot if you want, so it's not all that unpredictable.
So it seems as though we're back to letting the user choose the lesser
evil. I suppose a choice of poisons is better than no choice at all.