develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
From:
demerphq
Date:
February 27, 2008 02:41
Subject:
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
9b18b3110802270241t59566055l8b2fa9f2aee97775@mail.gmail.com
On 27/02/2008, Glenn Linderman <perl@nevcal.com> wrote:
> On approximately 2/27/2008 1:13 AM, came the following characters from
>  the keyboard of demerphq:
>
> > On 27/02/2008, Glenn Linderman <perl@nevcal.com> wrote:
>  >> On approximately 2/26/2008 6:15 PM, came the following characters from
>  >>
>  >> the keyboard of Juerd Waalboer:
>  >>
>  >>> Glenn Linderman skribis 2008-02-26 15:16 (-0800):
>
>
> >>  >> just default on EBCDIC platforms to "use encoding(EBCDIC);", decode
>  >>  >> the source (and data) from EBCDIC to UTF-8, and charge onward with
>  >>  >> UTF-8 internally.)
>  >>  >
>  >>  > I was told that it's not that simple, but I forgot why.
>  >
>  > Because EBCIDIC doesnt use UTF-8 It uses UTF-EBCDIC, which isnt
>  > strictly part of unicode, but thats IBM for you.
>
>
>
> Per http://en.wikipedia.org/wiki/UTF-EBCDIC it seems that UTF-EBCDIC is
>  mostly unused.  While it possibly could be used by some programs, and
>  possibly would need to be supported as an encoding by Dan the Encode
>  maintainer, it would seem that the general technique of mapping EBCDIC
>  to Unicode/UTF-8 could be applied on input/output/source code, and the
>  rest of the program could deal with Unicode/UTF-8, only.
>
>  Sounds like z/OS supports UTF-16 APIs, so support for that OS could be
>  done very similar to the way Windows support should be done, using
>  Windows *W (Wide) APIs (but I think the current Windows stuff uses the
>  *A (ANSI) APIs).
>
>
>
>  > The rest of your and Juerds mails are just too long for me to review
>  > in detail, sorry.
>
>
>
> Shucks, I was hoping for your input.  I'll summarize my suggestions,
>  leaving the justifications to your imagination, or my previous emails.

This is much easier to reply to. Thanks :-)

>  * Deprecate "use encoding".

Im all for this. But im not so sure that it will really help. As far
as I can tell it is mostly deprecated already, meaning that only those
with a really good reason to use it will use it. And for those users
deprecating it isnt going to help much. This leaves aside the whole
nasty debate of backwards is out compatibility :-)

>  * Deprecate non-ASCII characters in Perl 5.12 source code unless a
>  source encoding is specified.

When you say ASCII you mean 7 bit codepoints only? I cant see that
flying, latin-1 is the expected encoding of files if not otherwise
indicated.

>Make UTF-8, rather than ASCII, the
>  default source encoding for Perl 5.14.

Well, it actually happens that if you put BOM markers on your file,
like all well-behaved windows apps do when producing unicode, then
Perl will automatically assume the source code is unicode (in fact
perl can handle UTF-16 source code files as well as UTF8 ones). So all
*nix programmers need to do it start using BOM markers.

But... and heres a bit of a rant. As far as I can tell a combination
of stupid decisions has made unicode much less useful on *nix than it
is on windows. First they never modified their apis to address
unicode, instead they latched on to the kludge that is utf8 and never
changed anything internal, leaving it all up to the user. Second
because of the piping tradition in *nix and the number of apps that
would have to be changed to deal with them *nix programs dont produce
BOM markers, so you cant identify a utf8 file without using
heuristics, or using the environment settings.

So environment settings determine how a file or file name is
interpreted. Which is frankly insane. Windows at least got this right,
although they basically doubled their API to do it.

So to bring this back to your point, how do we tell that a file is in
utf8? By the locale settings? By bom markers? What about a utf8 file
created on *nix but loaded on Windows? It wont have BOM markers, so it
wont be identified as utf8 but rather as binary (unless we introduce
more heuristics) etc...

I can just see such a decision leading to a world of pain.

>
>  * Implement a pragma to apply Unicode semantics to all character
>  operations (uc, \U, regex character classes, //i, et alia) regardless of
>  the internal representation of the string (utf8).  [Could even deprecate
>  source that doesn't use the pragma in 5.12, and could then make this the
>  default in 5.14 also.  That'd be pretty aggressive though.]

This is tough. It could be done (with a lot of work). But the
implications I suspect are a lot deeper than you realize. Imagine
peoples surprise when uc(chr(0xDF)) ends up being "SS".

>
>  * Implement a pragma to specify a source charset/encoding.  Maybe this
>  pragma should imply the one above!  It would translate all \x codes via
>  the source encoding, disallowing \x{} codes and \N codes, except inside
>  a new syntax qu (like qq) but is interpreted as UTF-8 -- all \x \x{} and
>  \N codes would be interpreted _after_ the string is converted from the
>  source encoding to utf8.  [This is probably the hardest part of the
>  proposal.]

Id have to think about this more. Its been discussed in the past that
the various part of encoding need to be split out into different
components so its not all or nothing. But the deeper implications are
unclear to me.

>
>  * Under these pragmas, chr/ord would always deal in decoded numbers for
>  characters (utf8 characters).  Code written for "use encoding" that used
>   chr for source encoding constants (and even variables?) would have to
>  change... that is one of the things that is broken in "use encoding" ...
>  chr/ord are not inverses.

I guess this is ok. Id have to think about it more.

yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About