develooper Front page | perl.perl5.porters | Postings from February 2000

Re: should "use byte" be "use bytes"?

Gisle Aas
February 12, 2000 14:17
Re: should "use byte" be "use bytes"?
Message ID:
Larry Wall <> writes:

> Gisle Aas writes:
> : Larry Wall <> writes:
> : 
> : > There is, however, a difference between code that is explictly in the
> : > scope of a "use bytes" and code that is implicitly binary.  And that
> : > difference lies in how utf8 data from other modules is treated.  The
> : > default assumption is that any data that has been marked as utf8 should
> : > be treated as utf8, so an implicitly binary script will try to do
> : > the right thing, such as promoting binary/latin-1 strings to utf8
> : > if you concatenate it with utf8, for instance.
> : > 
> : > But in the scope of "use bytes", no such promotion happens, because
> : > "use bytes" basically says, "I don't care if the string is marked as
> : > utf8 or not, just treat it as a bucket of bits."  So if you concatenate
> : > a latin-1 string with a utf8 string, you'll get nonsense.  But that's
> : > your problem, just as it is in old Perl.  The "use bytes" declaration
> : > indicates you're willing to accept that responsibility.
> : 
> : I think this is the wrong thing to do.  My thinking is that 'use byte'
> : should never expose the internal UTF8 encoding.  It should be possible
> : to switch to UCS4 encoding internally without any user visible change.
> : 
> : I suggest that if you operate on a UTF8-flagged string in 'use byte'
> : scope, then the string should be downgraded to a binary/latin-1 string
> : if it doesn't contain any chars with ord() > 255, and we should croak
> : if chars that can't be represented in 8bits exists.  We should never
> : end up with UTF8-flagged strings that contain illegal UTF8-sequences.
> : 
> : There should never be any user visible difference between two strings
> : containing the same logical chars, just because one has the UTF8-flag
> : set and the other has not.  This makes perl free to internally convert
> : between UTF8 and binary representation for strings when it feels like
> : it.  I think this freedom is needed.
> : 
> : What I am saying is that functions like length(), substr() should not
> : change behaviour at all when in 'use byte' scope, but they might croak
> : if they try to operate on chars with ord() > 255.  For string
> : concatenation we will also downgrade UTF8 stuff and croak if it is not
> : possible.
> While I think your viewpoint is valid, I also think it's incomplete.  :-)
> A lot of current Perl code already happily operates on UTF-8 without
> knowing it, just as there's a lot of perl code that operates on Russian
> or Chinese without knowing it..  That is the semantics we're preserving
> with the current definition of "use bytes".

Wouldn't this code nearly always work just as well outside "use bytes"

> But I can see the use of your semantics too, so maybe we need to
> separate them into
>     use bytes "lax";		# no translation of utf8 forced
>     use bytes "strict";		# translation forced
> Actually, I think it's rather something-centric to assume people would
> want their bytes forced to Latin-1.

I don't think of it this way.  Strings are just a sequences of
integers.  Normally the range of these integers are 0 .. 2^32, and
'use bytes' is a way to tell that you want the range to be 0 .. 255 as
with older perls.  I want any string of integers within these bounds
to work the same under 'use bytes' regardless of its history.  I don't
care about character sets.

I guess my main problem is that I don't really see what problem 'use
bytes' solve and I am afraid it will be over-used.

The following behaviour is kind of related I think:

  $ perl -w -le 'print unpack("N", v0.0.0.127);';
  $ perl -w -le 'print unpack("N", v0.0.0.128);';
  $ perl -w -le 'print unpack("N", v0.0.0.300);';

I consider this to be a bug.  Agree?

Should things be allowed to work differently if I wrapped the examples
with 'use bytes'?

> But I don't know if I like this at all.  If someone is smart enough to
> put a "use bytes" into their program, they ought to be smart enough to
> deal with utf8 where it's necessary with an explicit conversion.  This
> has the great advantage that the errors will happen in a spot where
> they can expect an exception.  With a lazy downgrade you could get a
> fatal error almost anywhere in your code, on data that was good the
> first 999 times you tested the program.

Yes.  That is a problem with my downgrading proposal.  Other ways to
downgrade chars > 255 would be to just truncate them to 8bit, expand
them to SGML entities or simply skip them.  If we make perl invoke a
callback routine in this case, then everybody might be happy.  The
default could even be to insert UTF8 encoded stuff :-)

> Going the other way, I don't really see a big problem.  A "use bytes"
> chunk of code can't produce illegal utf8, in the sense that anything
> it produced wouldn't be marked as utf8, but would be considered funny
> Latin-1 or whatever.  So at worst you'd get doubly-encoded utf8, which
> makes gobbledygook, but doesn't violate the constraint that strings
> marked as utf8 should be legal utf8.

At least with current perl this happens.

$ perl -w -MDevel::Peek -le '$a = v1.255; { use byte; chop($a) }; Dump($a); chop($a)';
SV = PVNV(0x8187648) at 0x816a688
  REFCNT = 1
  IV = 0
  NV = 1.255
  PV = 0x8164a98 "\1\303"\0
  CUR = 2
  LEN = 4
Malformed UTF-8 character at -e line 1.

This one can be fixed with this patch.

Index: doop.c
RCS file: /local/perl/build/CVSROOT/perl5.6tobe/doop.c,v
retrieving revision
diff -u -p -u -p -r1.1.1.2 doop.c
--- doop.c	2000/02/09 22:25:09
+++ doop.c	2000/02/12 20:21:39
@@ -956,6 +956,7 @@ Perl_do_chop(pTHX_ register SV *astr, re
 	sv_setpvn(astr, s, 1);
 	*s = '\0';
 	SvCUR_set(sv, len);
+	SvUTF8_off(sv);

How many other cases like this are there?

Gisle Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About