develooper Front page | perl.perl5.porters | Postings from February 2000

Re: should "use byte" be "use bytes"?

Larry Wall
February 10, 2000 12:45
Re: should "use byte" be "use bytes"?
Message ID:
Andy Dougherty writes:
: Do you mean to say that it's impossible (not unlikely, but impossible) for
: me to currently have a literal UTF-8 string constant in a program
: (possibly automatically generated by another program) designed to deal
: with arbitrary 8-bit binary data?  I guess I could answer that for myself
: if I knew precisely what was meant by a 'literal UTF-8 string constant'.

Depends on what you mean by "currently".

Certainly in current maintenence versions of Perl, you can embed binary
string constants in your script that might or might not resemble utf8.
Old Perl doesn't care.

With 5.6, I think the best parsing approach is this:

     1) Perl will assume your script is written in 7-bit ASCII until
	one of the following happens.

     2) You give it a command line switch or environment variable
	indicating the script is to be interpreted one way or another.

     3) Perl runs into a high bit in your script.  At that point it
	takes a look at what it has in its buffer.  If it looks like
	utf8, mark the script filehandle as utf8 and continue.  If not,
	mark the script filehandle as binary (equivalent to latin-1)
	and continue.

     4) Perl runs into a "use bytes;" declaration.  Mark the script
	filehandle as binary and continue.

     5) Perl runs into a charset declaration indicating the
	literal strings are to be interpreted in some other character
	set, such as JIS.  Mark the script as binary and continue.
	(But literals are marked to autotranslate to Unicode if
	conversion to utf8 is necessary.)

Any string coming from a "binary" filehandle will always be represented
internally in 8-bit mode rather than in utf8, so it will not accidentally
turn into utf8.  (If there are no other filehandles open in utf8 mode,
the semantics should be exactly like Perl's current semantics.)  If you
have a script that has been declared to be in binary mode, and you embed
utf8 in it, you would have to explicitly convert your strings if you
want them treated as utf8.

There is, however, a difference between code that is explictly in the
scope of a "use bytes" and code that is implicitly binary.  And that
difference lies in how utf8 data from other modules is treated.  The
default assumption is that any data that has been marked as utf8 should
be treated as utf8, so an implicitly binary script will try to do
the right thing, such as promoting binary/latin-1 strings to utf8
if you concatenate it with utf8, for instance.

But in the scope of "use bytes", no such promotion happens, because
"use bytes" basically says, "I don't care if the string is marked as
utf8 or not, just treat it as a bucket of bits."  So if you concatenate
a latin-1 string with a utf8 string, you'll get nonsense.  But that's
your problem, just as it is in old Perl.  The "use bytes" declaration
indicates you're willing to accept that responsibility.

If you only want to mark the script filehandle as binary/latin1, and
don't want the other effects of "use bytes", and you don't want to let
it default to binary under 3, then it's probably better to specify a
charset of latin-1 (or iso-8859-1, or 8-bit, or whatever) instead of
relying on "use bytes", which disables Perl's utf8 smarts.

Requiring these declarations for certain idiosyncratic scripts is not
the path of least pain over the short haul, but over the long haul I
think it's the best way to get to a less painful world from where we are.
I don't think every script should have to declare "use utf8", even if
this approach breaks backward compatibility in certain cases.  There is
no completely transparent way to optimize for the common case here, but
we'll do our best.

I think we should seriously consider calling this Perl 6.  (And Topaz
would then of course be a candidate for Perl 7, a nice number.)

Larry Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About