develooper Front page | perl.perl5.porters | Postings from February 2008

Re: use encoding 'utf8' bug for Latin-1 range

Thread Previous | Thread Next
February 27, 2008 01:13
Re: use encoding 'utf8' bug for Latin-1 range
Message ID:
On 27/02/2008, Glenn Linderman <> wrote:
> On approximately 2/26/2008 6:15 PM, came the following characters from
> the keyboard of Juerd Waalboer:
> > Glenn Linderman skribis 2008-02-26 15:16 (-0800):
>  >> Perhaps all uses in source code of characters outside of the ASCII range
>  >> should produce warnings in the 5.12, unless there is a pragma to specify
>  >> what locale/encoding.
>  >
>  > Sounds useful, but I personally don't think that just assuming "use
>  > utf8;" by default would be a problem if that would interpret invalid
>  > UTF-8 as latin1. Really, actual latin1 data that happens to also be
>  > valid UTF-8 is immensely rare in my experience. (Counter examples,
>  > anyone?) To further reduce the risk, the fallback could be done per line
>  > or per file, instead of per invalid sequence itself.
>  >
>  > (e.g. utf8::decode($_) for @source_lines;)
>  >
>  > In any case, I think that in 5.12, non-ASCII byte data should either
>  > warn (as you suggest) or be interpreted as utf8 with latin1 fallback
>  > (dmq's suggestion, but applied elsewhere), maybe also with a warning.
> We're in pretty close agreement on this point.  The unfortunate part is
>  that people with different locale's may use character values of 128..255
>  without telling Perl.  When I speak of character values 128..255 I refer
>  not only to \x sequences but also the literal characters in the source file.
>  >> But maybe a replacement for "use encoding" should be implemented
>  >> simultaneously.
>  >
>  > I do not object to this, but I do question whether it's worth the tuits.
>  > Only the actual implementers can judge that.
> We're in total agreement on this.  I think the only practical way
>  forward is Unicode; UTF-8 being one encoding of Unicode.  A bit more
>  support for other Unicode encodings would be nice, but hard to put into
>  one bit, I guess.  So the program(mer) has to keep track of that part.
>  >> Implementing a special version of Perl on EBCDIC seems like a waste of
>  >> programmer productivity...
>  >
>  > Agreed, but again: those who implement things get to decide. It does,
>  > however, sometimes keep me from contributing! I'm glad that perlunitut
>  > and perlunifaq were accepted even though they pay no attention to EBCDIC
>  > at all. (It did delay my work, before I decided to simply ignore the
>  > entire EBCDIC world. I have not received even a single complaint about
>  > that.)
>  >
>  >> just default on EBCDIC platforms to "use encoding(EBCDIC);", decode
>  >> the source (and data) from EBCDIC to UTF-8, and charge onward with
>  >> UTF-8 internally.)
>  >
>  > I was told that it's not that simple, but I forgot why.

Because EBCIDIC doesnt use UTF-8 It uses UTF-EBCDIC, which isnt
strictly part of unicode, but thats IBM for you.

The rest of your and Juerds mails are just too long for me to review
in detail, sorry.

perl -Mre=debug -e "/just|another|perl|hacker/"

Thread Previous | Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About