Andy Dougherty writes: : Do you mean to say that it's impossible (not unlikely, but impossible) for : me to currently have a literal UTF-8 string constant in a program : (possibly automatically generated by another program) designed to deal : with arbitrary 8-bit binary data? I guess I could answer that for myself : if I knew precisely what was meant by a 'literal UTF-8 string constant'. Depends on what you mean by "currently". Certainly in current maintenence versions of Perl, you can embed binary string constants in your script that might or might not resemble utf8. Old Perl doesn't care. With 5.6, I think the best parsing approach is this: 1) Perl will assume your script is written in 7-bit ASCII until one of the following happens. 2) You give it a command line switch or environment variable indicating the script is to be interpreted one way or another. 3) Perl runs into a high bit in your script. At that point it takes a look at what it has in its buffer. If it looks like utf8, mark the script filehandle as utf8 and continue. If not, mark the script filehandle as binary (equivalent to latin-1) and continue. 4) Perl runs into a "use bytes;" declaration. Mark the script filehandle as binary and continue. 5) Perl runs into a charset declaration indicating the literal strings are to be interpreted in some other character set, such as JIS. Mark the script as binary and continue. (But literals are marked to autotranslate to Unicode if conversion to utf8 is necessary.) Any string coming from a "binary" filehandle will always be represented internally in 8-bit mode rather than in utf8, so it will not accidentally turn into utf8. (If there are no other filehandles open in utf8 mode, the semantics should be exactly like Perl's current semantics.) If you have a script that has been declared to be in binary mode, and you embed utf8 in it, you would have to explicitly convert your strings if you want them treated as utf8. There is, however, a difference between code that is explictly in the scope of a "use bytes" and code that is implicitly binary. And that difference lies in how utf8 data from other modules is treated. The default assumption is that any data that has been marked as utf8 should be treated as utf8, so an implicitly binary script will try to do the right thing, such as promoting binary/latin-1 strings to utf8 if you concatenate it with utf8, for instance. But in the scope of "use bytes", no such promotion happens, because "use bytes" basically says, "I don't care if the string is marked as utf8 or not, just treat it as a bucket of bits." So if you concatenate a latin-1 string with a utf8 string, you'll get nonsense. But that's your problem, just as it is in old Perl. The "use bytes" declaration indicates you're willing to accept that responsibility. If you only want to mark the script filehandle as binary/latin1, and don't want the other effects of "use bytes", and you don't want to let it default to binary under 3, then it's probably better to specify a charset of latin-1 (or iso-8859-1, or 8-bit, or whatever) instead of relying on "use bytes", which disables Perl's utf8 smarts. Requiring these declarations for certain idiosyncratic scripts is not the path of least pain over the short haul, but over the long haul I think it's the best way to get to a less painful world from where we are. I don't think every script should have to declare "use utf8", even if this approach breaks backward compatibility in certain cases. There is no completely transparent way to optimize for the common case here, but we'll do our best. I think we should seriously consider calling this Perl 6. (And Topaz would then of course be a candidate for Perl 7, a nice number.) Larry