develooper Front page | perl.perl5.porters | Postings from April 2012

Re: unicode question

Thread Previous | Thread Next
From:
Tom Christiansen
Date:
April 26, 2012 11:27
Subject:
Re: unicode question
Message ID:
10036.1335464819@chthon
Eric Brine <ikegami@adaelis.com> wrote
   on Thu, 26 Apr 2012 13:13:23 EDT: 
   
> When Perl sees, say, C3 A9 coming in from STDIN, it has no way of
> knowing whether that means "é" (UTF-8), 43459 (little-endian 16-bit
> unsigned integer) or something else. As such, absent instruction such
> as C<< use open ':std', ':locale'; >>, it will return those bytes as is.

> This is not a bug. This cannot be changed. 

No argument at all.  We once tried going down that route.  It didn't 
work.  At all.  From the perlrun manpage:

    "-C" on its own (not followed by any number or option list), or the
    empty string "" for the "PERL_UNICODE" environment variable, has the
    same effect as "-CSDL".  In other words, the standard I/O handles and
    the default "open()" layer are UTF-8-fied but only if the locale
    environment variables indicate a UTF-8 locale.  This behaviour follows
    the implicit (and problematic) UTF-8 behaviour of Perl 5.8.0.

It was a Bad Thing.

> Other languages do the same
> thing, because there is no choice. Take Java for example. Java came
> out after Unicode was out and embraced it. Yet, you must still specify
> the stream must be decoded.
> 
>    InputStream       stream      = System.in;
>    InputStreamReader byte_reader = new InputStreamReader(stream)
>    BufferedReader    char_reader = new BufferedReader(byte_reader);
>    char_reader.readLine();

This is one of the most common bugs with Java code: they don't set
the character encoding, which gets a "platform default" that I promise
you that you don't ever want: it's always 8 bit.  Furthermore, the 
default Java encoder/decoder code supresses errors.  You get bogus
input and bogus output *all the time*.  Here's my standard example
of getting all one's ducks in order in Java: 

     Process
     slave_process = Runtime.getRuntime().exec("perl -CS script args");

     OutputStream
     __bytes_into_his_stdin  = slave_process.getOutputStream();

     OutputStreamWriter
       chars_into_his_stdin  = new OutputStreamWriter(
				 __bytes_into_his_stdin,
	     /* DO NOT OMIT! */  Charset.forName("UTF-8").newEncoder()
			     );

     InputStream
     __bytes_from_his_stdout = slave_process.getInputStream();

     InputStreamReader
       chars_from_his_stdout = new InputStreamReader(
				 __bytes_from_his_stdout,
	     /* DO NOT OMIT! */  Charset.forName("UTF-8").newDecoder()
			     );

     InputStream
     __bytes_from_his_stderr = slave_process.getErrorStream();

     InputStreamReader
       chars_from_his_stderr = new InputStreamReader(
				 __bytes_from_his_stderr,
	     /* DO NOT OMIT! */  Charset.forName("UTF-8").newDecoder()
			     );


Blech.

--tom

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About