Front page | perl.perl5.porters |
Postings from February 2000
Re: should "use byte" be "use bytes"?
From:
Gisle Aas
Date:
February 12, 2000 14:17
Subject:
Re: should "use byte" be "use bytes"?
Message ID:
m3og9m3w2y.fsf@eik.g.aas.no
Larry Wall <larry@wall.org> writes:
> Gisle Aas writes:
> : Larry Wall <larry@wall.org> writes:
> :
> : > There is, however, a difference between code that is explictly in the
> : > scope of a "use bytes" and code that is implicitly binary. And that
> : > difference lies in how utf8 data from other modules is treated. The
> : > default assumption is that any data that has been marked as utf8 should
> : > be treated as utf8, so an implicitly binary script will try to do
> : > the right thing, such as promoting binary/latin-1 strings to utf8
> : > if you concatenate it with utf8, for instance.
> : >
> : > But in the scope of "use bytes", no such promotion happens, because
> : > "use bytes" basically says, "I don't care if the string is marked as
> : > utf8 or not, just treat it as a bucket of bits." So if you concatenate
> : > a latin-1 string with a utf8 string, you'll get nonsense. But that's
> : > your problem, just as it is in old Perl. The "use bytes" declaration
> : > indicates you're willing to accept that responsibility.
> :
> : I think this is the wrong thing to do. My thinking is that 'use byte'
> : should never expose the internal UTF8 encoding. It should be possible
> : to switch to UCS4 encoding internally without any user visible change.
> :
> : I suggest that if you operate on a UTF8-flagged string in 'use byte'
> : scope, then the string should be downgraded to a binary/latin-1 string
> : if it doesn't contain any chars with ord() > 255, and we should croak
> : if chars that can't be represented in 8bits exists. We should never
> : end up with UTF8-flagged strings that contain illegal UTF8-sequences.
> :
> : There should never be any user visible difference between two strings
> : containing the same logical chars, just because one has the UTF8-flag
> : set and the other has not. This makes perl free to internally convert
> : between UTF8 and binary representation for strings when it feels like
> : it. I think this freedom is needed.
> :
> : What I am saying is that functions like length(), substr() should not
> : change behaviour at all when in 'use byte' scope, but they might croak
> : if they try to operate on chars with ord() > 255. For string
> : concatenation we will also downgrade UTF8 stuff and croak if it is not
> : possible.
>
> While I think your viewpoint is valid, I also think it's incomplete. :-)
>
> A lot of current Perl code already happily operates on UTF-8 without
> knowing it, just as there's a lot of perl code that operates on Russian
> or Chinese without knowing it.. That is the semantics we're preserving
> with the current definition of "use bytes".
Wouldn't this code nearly always work just as well outside "use bytes"
too?
> But I can see the use of your semantics too, so maybe we need to
> separate them into
>
> use bytes "lax"; # no translation of utf8 forced
> use bytes "strict"; # translation forced
>
> Actually, I think it's rather something-centric to assume people would
> want their bytes forced to Latin-1.
I don't think of it this way. Strings are just a sequences of
integers. Normally the range of these integers are 0 .. 2^32, and
'use bytes' is a way to tell that you want the range to be 0 .. 255 as
with older perls. I want any string of integers within these bounds
to work the same under 'use bytes' regardless of its history. I don't
care about character sets.
I guess my main problem is that I don't really see what problem 'use
bytes' solve and I am afraid it will be over-used.
The following behaviour is kind of related I think:
$ perl -w -le 'print unpack("N", v0.0.0.127);';
127
$ perl -w -le 'print unpack("N", v0.0.0.128);';
194
$ perl -w -le 'print unpack("N", v0.0.0.300);';
196
I consider this to be a bug. Agree?
Should things be allowed to work differently if I wrapped the examples
with 'use bytes'?
> But I don't know if I like this at all. If someone is smart enough to
> put a "use bytes" into their program, they ought to be smart enough to
> deal with utf8 where it's necessary with an explicit conversion. This
> has the great advantage that the errors will happen in a spot where
> they can expect an exception. With a lazy downgrade you could get a
> fatal error almost anywhere in your code, on data that was good the
> first 999 times you tested the program.
Yes. That is a problem with my downgrading proposal. Other ways to
downgrade chars > 255 would be to just truncate them to 8bit, expand
them to SGML entities or simply skip them. If we make perl invoke a
callback routine in this case, then everybody might be happy. The
default could even be to insert UTF8 encoded stuff :-)
> Going the other way, I don't really see a big problem. A "use bytes"
> chunk of code can't produce illegal utf8, in the sense that anything
> it produced wouldn't be marked as utf8, but would be considered funny
> Latin-1 or whatever. So at worst you'd get doubly-encoded utf8, which
> makes gobbledygook, but doesn't violate the constraint that strings
> marked as utf8 should be legal utf8.
At least with current perl this happens.
$ perl -w -MDevel::Peek -le '$a = v1.255; { use byte; chop($a) }; Dump($a); chop($a)';
SV = PVNV(0x8187648) at 0x816a688
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
IV = 0
NV = 1.255
PV = 0x8164a98 "\1\303"\0
CUR = 2
LEN = 4
Malformed UTF-8 character at -e line 1.
This one can be fixed with this patch.
Index: doop.c
===================================================================
RCS file: /local/perl/build/CVSROOT/perl5.6tobe/doop.c,v
retrieving revision 1.1.1.2
diff -u -p -u -p -r1.1.1.2 doop.c
--- doop.c 2000/02/09 22:25:09 1.1.1.2
+++ doop.c 2000/02/12 20:21:39
@@ -956,6 +956,7 @@ Perl_do_chop(pTHX_ register SV *astr, re
sv_setpvn(astr, s, 1);
*s = '\0';
SvCUR_set(sv, len);
+ SvUTF8_off(sv);
SvNIOK_off(sv);
}
else
How many other cases like this are there?
Regards,
Gisle