develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
From:
Glenn Linderman
Date:
February 26, 2008 15:16
Subject:
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
47C49DBF.9060703@NevCal.com
On approximately 2/26/2008 12:33 PM, came the following characters from 
the keyboard of Juerd Waalboer:
> demerphq skribis 2008-02-26 19:11 (+0100):
>> On 21/02/2008, Juerd Waalboer <juerd@convolution.nl> wrote:
>>>  If this backwards incompatibility is ruled unimportant, the general
>>>  assumption would be: ${^ENCODING} acts on literal source code only, and
>>>  the fix would be to make numeric character values always unicode
>>>  codepoints. Is this correct?
>> Im wondering if there isnt another option actually. We could make the
>> rules for handling \x{} escapes under encoding be context sensitive.
>> If such an escape is in code such that it would form an illegal utf8
>> sequence then it is treated as a codepoint and not an octet. If it
>> would form a valid utf8 seqence then it is treated as a octet.


Sounds confusing to document, and to get people to use correctly. 
Sounds like it would break as much buggy code as it "fixes".  If a 
sequence of characters formas a valid utf8 sequence, then it certainly 
wouldn't fit in 8-bits (the definition of octet).

The following two paragraphs may get flack from people that have 
historically been forced to use non-ASCII non-Unicode source and data 
files, but in this post-Unicode era, they make perfect sense to simplify 
the combinatorial explosion of multiple character encodings.

Perhaps all uses in source code of characters outside of the ASCII range 
should produce warnings in the 5.12, unless there is a pragma to specify 
what locale/encoding.  That would be a good step to finding the programs 
that do that (their solution is add one pragma to specify the 
locale/encoding, if such is a supported locale/encoding, or a more 
extensive change to change all characters to a supported locale/encoding 
(with UTF-8 recommended).  Note that it would be possible to create a 
5.12 very quickly if it were simply 5.10 + this deprecation warning. 
However, the only present solution would be to force everyone to convert 
their source code to be utf8 encoded, as "use encoding" seems to have 
problems.  Maybe that isn't onerous, as UTF-8 has been around quite a 
while by now.  But maybe a replacement for "use encoding" should be 
implemented simultaneously.

(N.B. At the risk of invalidating the current EBCDIC port, I observe 
that EBCDIC is just a source and data encoding.  Implementing a special 
version of Perl on EBCDIC seems like a waste of programmer 
productivity... just default on EBCDIC platforms to "use 
encoding(EBCDIC);", decode the source (and data) from EBCDIC to UTF-8, 
and charge onward with UTF-8 internally.)


With the above in mind, it sounds simple to specify that \x defines a 
character in the character set of the source code... this is probably 
what most programmers meant when they coded \x... they were using their 
native encoding by necessity.  \N{U+} is available to specify Unicode 
codepoints for people that cannot/will not use ASCII or UTF-8 source 
encoding.

Modern usage of \x to specify Unicode characters is probably erroneous, 
as \N{U+} should have been used.  But is easily fixed... add "use utf8;" 
to the source and such \x references will suddenly become Unicode again!

The above deprecation cycle would then mean that in 5.14, UTF-8, a 
superset of ASCII, can be made the default encoding for source code, as 
5.12 would have encouraged non-ASCII, non-UTF8 users to use the new 
source encoding pragma (the replacement for use encoding).



> It is not, in my opinion, a good solution for the "we should support
> scripts written in any encoding" problem. That problem, if it exists,
> should be addressed with a new mechanism instead of by adding even more
> complexity to an existing kludge.
> 
>> Looking into this i noticed that under normal circumstances \N{U+C2}
>> does not return a utf8 string, which i find quite odd.
> 
> Perl text strings are Unicode strings, that may be latin1 or utf8
> encoded internally. Semantics before and after utf8::upgrade must not be
> different. They are, and that should be considered a bug.


Perl text strings are one-byte or multi-bytes encoded number sequences. 
  When all the characters are numerically less than 256, they may be 
one-byte encoded sequences; when any of the characters are numerically 
greater than 255, they must be multi-bytes encoded sequences.

The semantics are either ASCII (for one-byte encoded sequences) or UTF-8 
(for multi-bytes encoded sequences).  Expecting latin1 semantics for 
one-byte encoded sequences is a bug.


> Having Perl use latin1 when possible is a very much desired performance
> optimization.
> 
> lc, uc, lcfirst, ucfirst, //i, and character classes should be fixed to
> be independent of the internal encoding.


This would be a useful extension: could "use utf8 semantics;" be 
implemented which would affect this stuff?  But would be a superset of 
"use utf8;", and therefore would encourage the use of utf8 source code 
encoding... by not implementing "use encoding "younameit" semantics".  I 
don't believe there is anything (but programmer effort) that would stop 
it from being implemented in the 5.12 timeframe... the above deprecation 
cycle could run concurrently... this feature requires a source change to 
enable.

However, if those other locales really wanted to be implemented, they 
could be, in terms of "use utf8 semantics", by reading the source as 
"younameit" encoding but immediately converting it to "utf8" internally.

Of course the source code for \x{} consists of "ASCII" characters so 
doesn't get upgraded, it can still be a binary number in the encoding of 
the source file, just as before... that encoding just has to be 
remembered during the interpretation of \x escapes.


>> I would expect any string with an \N{} escape in it to be utf8. I
>> should probably file a bug about it.
> 
> UTF8 is not Unicode. ord("\N{U+C2}") == 0xC2, exactly the unicode
> codepoint that was requested.


I see no reason to call this a bug either.  The code point is less than 
256.  On the other hand, without "use utf8 semantics;" ASCII semantics 
will be applied to it, as a one-byte encoded sequence, unless it is 
upgraded.  However, in light of the \N{name} sequence auto-upgrading, 
perhaps it is.

Neither \N{U+} nor \N{name} would have to be auto-upgraded in the 
presence of "use utf8 semantics;"


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About