develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
Glenn Linderman
February 27, 2008 03:59
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
On approximately 2/27/2008 2:41 AM, came the following characters from 
the keyboard of demerphq:
> On 27/02/2008, Glenn Linderman <> wrote:
>> On approximately 2/27/2008 1:13 AM, came the following characters from
>>  the keyboard of demerphq:
>>> On 27/02/2008, Glenn Linderman <> wrote:

> This is much easier to reply to. Thanks :-)
>>  * Deprecate "use encoding".
> Im all for this. But im not so sure that it will really help. As far
> as I can tell it is mostly deprecated already, meaning that only those
> with a really good reason to use it will use it. And for those users
> deprecating it isnt going to help much. This leaves aside the whole
> nasty debate of backwards is out compatibility :-)

It does seem there are fewer references to "use encoding" in 5.10 docs 
than in 5.8 docs.

>>  * Deprecate non-ASCII characters in Perl 5.12 source code unless a
>>  source encoding is specified.
> When you say ASCII you mean 7 bit codepoints only? I cant see that
> flying, latin-1 is the expected encoding of files if not otherwise
> indicated.

I did, in fact, mean ASCII as in 7-bit.  It is the only thing that is 
nearly universal (would be fully, except for EBCDIC).

The docs I could find only say "binary source" (implicitly ASCII+, since 
you need ASCII characters to parse Perl, but I suppose EBCDIC would 
qualify too)

So how hard is it to specify latin-1 encoding?  Well, no feature to do 
that today, except "use encoding".  But if there was, then latin-1 could 
  be supported trivially (read source, utf8::upgrade it).

So making someone that uses latin-1 say so, doesn't sound too onerous. 
And if they use something else, it is time to find out, eh?

>> Make UTF-8, rather than ASCII, the
>>  default source encoding for Perl 5.14.
> Well, it actually happens that if you put BOM markers on your file,
> like all well-behaved windows apps do when producing unicode, then
> Perl will automatically assume the source code is unicode (in fact
> perl can handle UTF-16 source code files as well as UTF8 ones). So all
> *nix programmers need to do it start using BOM markers.
> But... and heres a bit of a rant. As far as I can tell a combination
> of stupid decisions has made unicode much less useful on *nix than it
> is on windows. First they never modified their apis to address
> unicode, instead they latched on to the kludge that is utf8 and never
> changed anything internal, leaving it all up to the user. Second
> because of the piping tradition in *nix and the number of apps that
> would have to be changed to deal with them *nix programs dont produce
> BOM markers, so you cant identify a utf8 file without using
> heuristics, or using the environment settings.
> So environment settings determine how a file or file name is
> interpreted. Which is frankly insane. Windows at least got this right,
> although they basically doubled their API to do it.

BOM is interesting, but is a heuristic.  A pretty safe one, except for 
binary files, and Perl source files are not binary.  But BOM is not 
universal, per your rant.  Further, it isn't even Universal on Windows 
(but the programs I'm aware of that don't create it, were not written by 
Microsoft), but are reasonably well-respected programs, such as emacs, 
OpenOffice, (yep, Unix ports).  Microsoft lost the standards committee 
war for requiring Unicode files to contain a BOM, but they didn't invent 
a different solution, either.  A file system attribute could have been 

> So to bring this back to your point, how do we tell that a file is in
> utf8? By the locale settings? By bom markers? What about a utf8 file
> created on *nix but loaded on Windows? It wont have BOM markers, so it
> wont be identified as utf8 but rather as binary (unless we introduce
> more heuristics) etc...
> I can just see such a decision leading to a world of pain.

How do you tell?  By default.  But if it starts out with "use 
something-else;" or a BOM marker, then it might not be, and that would 
be OK.

>>  * Implement a pragma to apply Unicode semantics to all character
>>  operations (uc, \U, regex character classes, //i, et alia) regardless of
>>  the internal representation of the string (utf8).  [Could even deprecate
>>  source that doesn't use the pragma in 5.12, and could then make this the
>>  default in 5.14 also.  That'd be pretty aggressive though.]
> This is tough. It could be done (with a lot of work). But the
> implications I suspect are a lot deeper than you realize. Imagine
> peoples surprise when uc(chr(0xDF)) ends up being "SS".

Today they are already surprised when it doesn't.  I'm sorry to hear it 
is so much work: since the Unicode semantics are already implemented for 
  utf8 strings, I rather thought a lot of the implementation could be 
borrowed, or even shared code, just check the pragma setting and call 
one or the other.

>>  * Implement a pragma to specify a source charset/encoding.  Maybe this
>>  pragma should imply the one above!  It would translate all \x codes via
>>  the source encoding, disallowing \x{} codes and \N codes, except inside
>>  a new syntax qu (like qq) but is interpreted as UTF-8 -- all \x \x{} and
>>  \N codes would be interpreted _after_ the string is converted from the
>>  source encoding to utf8.  [This is probably the hardest part of the
>>  proposal.]
> Id have to think about this more. Its been discussed in the past that
> the various part of encoding need to be split out into different
> components so its not all or nothing. But the deeper implications are
> unclear to me.

Some further points: seems like a Latin-1 pragma would be 
straightforward, given the current behavior of utf8::upgrade :)  The 
ability to deal with ASCII, Latin-1, and Unicode would cover a lot of 
the territory, and all of those could permit full \xXX, \x{}, and \N 
usage.  ASCII would be supported by default, of course, as it is a 
proper subset of both Unicode and UTF-8.

Perhaps the others encodings should simply outlaw \x{} and \N... if the 
systems they run on don't support Unicode, then the characters won't be 
useful to them anyway, and if it does, they should use Unicode.

>>  * Under these pragmas, chr/ord would always deal in decoded numbers for
>>  characters (utf8 characters).  Code written for "use encoding" that used
>>   chr for source encoding constants (and even variables?) would have to
>>  change... that is one of the things that is broken in "use encoding" ...
>>  chr/ord are not inverses.
> I guess this is ok. Id have to think about it more.

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About