Front page | perl.perl5.porters |
Postings from April 2012
Re: unicode question
Thread Previous
|
Thread Next
From:
Tom Christiansen
Date:
April 26, 2012 11:27
Subject:
Re: unicode question
Message ID:
10036.1335464819@chthon
Eric Brine <ikegami@adaelis.com> wrote
on Thu, 26 Apr 2012 13:13:23 EDT:
> When Perl sees, say, C3 A9 coming in from STDIN, it has no way of
> knowing whether that means "é" (UTF-8), 43459 (little-endian 16-bit
> unsigned integer) or something else. As such, absent instruction such
> as C<< use open ':std', ':locale'; >>, it will return those bytes as is.
> This is not a bug. This cannot be changed.
No argument at all. We once tried going down that route. It didn't
work. At all. From the perlrun manpage:
"-C" on its own (not followed by any number or option list), or the
empty string "" for the "PERL_UNICODE" environment variable, has the
same effect as "-CSDL". In other words, the standard I/O handles and
the default "open()" layer are UTF-8-fied but only if the locale
environment variables indicate a UTF-8 locale. This behaviour follows
the implicit (and problematic) UTF-8 behaviour of Perl 5.8.0.
It was a Bad Thing.
> Other languages do the same
> thing, because there is no choice. Take Java for example. Java came
> out after Unicode was out and embraced it. Yet, you must still specify
> the stream must be decoded.
>
> InputStream stream = System.in;
> InputStreamReader byte_reader = new InputStreamReader(stream)
> BufferedReader char_reader = new BufferedReader(byte_reader);
> char_reader.readLine();
This is one of the most common bugs with Java code: they don't set
the character encoding, which gets a "platform default" that I promise
you that you don't ever want: it's always 8 bit. Furthermore, the
default Java encoder/decoder code supresses errors. You get bogus
input and bogus output *all the time*. Here's my standard example
of getting all one's ducks in order in Java:
Process
slave_process = Runtime.getRuntime().exec("perl -CS script args");
OutputStream
__bytes_into_his_stdin = slave_process.getOutputStream();
OutputStreamWriter
chars_into_his_stdin = new OutputStreamWriter(
__bytes_into_his_stdin,
/* DO NOT OMIT! */ Charset.forName("UTF-8").newEncoder()
);
InputStream
__bytes_from_his_stdout = slave_process.getInputStream();
InputStreamReader
chars_from_his_stdout = new InputStreamReader(
__bytes_from_his_stdout,
/* DO NOT OMIT! */ Charset.forName("UTF-8").newDecoder()
);
InputStream
__bytes_from_his_stderr = slave_process.getErrorStream();
InputStreamReader
chars_from_his_stderr = new InputStreamReader(
__bytes_from_his_stderr,
/* DO NOT OMIT! */ Charset.forName("UTF-8").newDecoder()
);
Blech.
--tom
Thread Previous
|
Thread Next