develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
Glenn Linderman
February 27, 2008 01:50
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
On approximately 2/27/2008 1:13 AM, came the following characters from 
the keyboard of demerphq:
> On 27/02/2008, Glenn Linderman <> wrote:
>> On approximately 2/26/2008 6:15 PM, came the following characters from
>> the keyboard of Juerd Waalboer:
>>> Glenn Linderman skribis 2008-02-26 15:16 (-0800):

>>  >> just default on EBCDIC platforms to "use encoding(EBCDIC);", decode
>>  >> the source (and data) from EBCDIC to UTF-8, and charge onward with
>>  >> UTF-8 internally.)
>>  >
>>  > I was told that it's not that simple, but I forgot why.
> Because EBCIDIC doesnt use UTF-8 It uses UTF-EBCDIC, which isnt
> strictly part of unicode, but thats IBM for you.

Per it seems that UTF-EBCDIC is 
mostly unused.  While it possibly could be used by some programs, and 
possibly would need to be supported as an encoding by Dan the Encode 
maintainer, it would seem that the general technique of mapping EBCDIC 
to Unicode/UTF-8 could be applied on input/output/source code, and the 
rest of the program could deal with Unicode/UTF-8, only.

Sounds like z/OS supports UTF-16 APIs, so support for that OS could be 
done very similar to the way Windows support should be done, using 
Windows *W (Wide) APIs (but I think the current Windows stuff uses the 
*A (ANSI) APIs).

> The rest of your and Juerds mails are just too long for me to review
> in detail, sorry.

Shucks, I was hoping for your input.  I'll summarize my suggestions, 
leaving the justifications to your imagination, or my previous emails.

* Deprecate "use encoding".

* Deprecate non-ASCII characters in Perl 5.12 source code unless a 
source encoding is specified.  Make UTF-8, rather than ASCII, the 
default source encoding for Perl 5.14.

* Implement a pragma to apply Unicode semantics to all character 
operations (uc, \U, regex character classes, //i, et alia) regardless of 
the internal representation of the string (utf8).  [Could even deprecate 
source that doesn't use the pragma in 5.12, and could then make this the 
default in 5.14 also.  That'd be pretty aggressive though.]

* Implement a pragma to specify a source charset/encoding.  Maybe this 
pragma should imply the one above!  It would translate all \x codes via
the source encoding, disallowing \x{} codes and \N codes, except inside 
a new syntax qu (like qq) but is interpreted as UTF-8 -- all \x \x{} and 
\N codes would be interpreted _after_ the string is converted from the 
source encoding to utf8.  [This is probably the hardest part of the 

* Under these pragmas, chr/ord would always deal in decoded numbers for 
characters (utf8 characters).  Code written for "use encoding" that used 
  chr for source encoding constants (and even variables?) would have to 
change... that is one of the things that is broken in "use encoding" ... 
chr/ord are not inverses.

Glenn --
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About