develooper Front page | perl.perl5.porters | Postings from October 2003

Re: [perl #24077] 5.8.1 Unicode: CR added to \n (in Windows) is 1-byte despite UCS-2LE

Thread Previous
Nick Ing-Simmons
October 4, 2003 10:52
Re: [perl #24077] 5.8.1 Unicode: CR added to \n (in Windows) is 1-byte despite UCS-2LE
Message ID:
Phill Wolf <> writes:
># New Ticket Created by  Phill Wolf 
># Please include the string:  [perl #24077]
># in the subject line of all future correspondence about this issue. 
># <URL: >
>Writing a "Unicode" (little-endian) text file in Windows, Perl
>corrupts the byte stream by writing 1-byte carriage-returns rather
>than 2-byte.

Oh heck - yes it will. The :crlf layer is byte-oriented thing.
However the bug serves as an example of how stack-able layers
can fix it :-)

Do you need 0x000A converting to 0x000D, 0x000A or would just the 
0x000A do?

If just 0x000A will do then simplest thing is use a non-translating 
buffer layer, overriding default :crlf with :perlio thus :-

open(FH, ">:perlio:encoding(UCS-2LE)", "wellformed.txt")

If you need 16-bit but CRLF then this seems to work for me:

open(FH, ">:encoding(UCS-2LE):crlf:utf8", "wellformed.txt");

What we do there is put size-expanding encoding at bottom of stack,
then put :crlf converter on top (now the 0xD gets expanded by encoding)
and then turn back on the UTF-8 flag (this last is a bit messy), so 
the BOM goes through.

I was doing my testing on Linux - you just might need to refix a :raw
on all those on Win32.

> require v5.8.1;
> use charnames ('BYTE ORDER MARK');
> open(FH, ">:encoding(UCS-2LE)", "malformed.txt");
> print FH "\N{BYTE ORDER MARK}";
> print FH "a\n";
> print FH "b\n";
> close(FH);
>Debug shows the following bytes in the file:
> FE FF 61 00 0D 0A 00 62-00 0D 0A 00   ..a....b....
>Note how 0D isn't getting a trailing 00 byte.

Thread Previous Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About