develooper Front page | perl.perl5.porters | Postings from December 2010

[perl #80058] [Bug Report] Bad \n convert, using UTF-16 on Win32

From:
MzM
Date:
December 1, 2010 23:24
Subject:
[perl #80058] [Bug Report] Bad \n convert, using UTF-16 on Win32
Message ID:
rt-3.6.HEAD-13564-1291207107-618.80058-75-0@perl.org
# New Ticket Created by  MzM 
# Please include the string:  [perl #80058]
# in the subject line of all future correspondence about this issue. 
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=80058 >


This is a bug report for perl from mezmerik@gmail.com,
generated with the help of perlbug 1.39 running under perl 5.12.2.


-----------------------------------------------------------------
[Please describe your issue here]

Hello,

I'm using ActivePerl 5.12.2 on Windows 7.

Perl for Win32 has a feature to convert a single "LF" (without
preceding "CR") to "CRLF", but my perl seems to determine what "LF" is
on UTF-16 incorrectly. In ANSI and UTF-8 files, LF's bytecode is "0A";
in UTF-16, LF should be "000A" (Big Endian)  or "0A00" (Little
Endian), but my perl seems to regard single "0A" as LF too! Thus, she
will do the wrong thing, which is adding a "0D" before "0A" (my perl
also regard "0D" as CR, the right CR in UTF-16 should be "000A").

Here's the test program:


open FH_IN, "<:encoding(utf16be)", "src.txt" or die;
open FH_OUT, ">:encoding(utf16be)", "output.txt" or die;

while (<FH_IN>) {
   print FH_OUT $_;
}


I think "src.txt" and "output.txt" should be identical. But not.

1) if "src.txt" is only two CRLFs, its bytecodes are "FE FF 00 0D 00
0A 00 0D 00 0A"; the "output.txt" becomes "FE FF 00 0D 00 0D 0A 00 0D
00 0D 0A", each "0A" gets a unnecessary and wrong preceding "0D".

2) if "src.txt" is only one chinese charater "上", whose unicode and
UTF-16BE bytecode is "4E 0A", with BOM, the file's whole bytes are "FE
FF 4E 0A"; the "output.txt" becomes "FE FF 4E 0D 0A".


modify the program code:

while (<FH_IN>) {
   chomp;
   print FH_OUT $_;
}

1) "src.txt" which is only two CRLFs, "FE FF 00 0D 00 0A 00 0D 00 0A"
becomes "FE FF 00 0D 00 0D". So, chomp only get rid of LF(00 0A). it
should erase 4 bytes "00 0D 00 0A".

That's what I found when operating UTF-16 files. I'll appreciate your
efforts to improve Unicode support. Many thanks!

 Joey


[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
   category=core
   severity=low
---
Site configuration information for perl 5.12.2:

Configured by SYSTEM at Mon Sep  6 23:12:49 2010.

Summary of my perl5 (revision 5 version 12 subversion 2) configuration:

 Platform:
   osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread
   uname=''
   config_args='undef'
   hint=recommended, useposix=true, d_sigaction=undef
   useithreads=define, usemultiplicity=define
   useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
   use64bitint=undef, use64bitall=undef, uselongdouble=undef
   usemymalloc=n, bincompat5005=undef
 Compiler:
   cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
-D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO
-D_USE_32BIT_TIME_T -DPERL_MSVCRT_READFIX',
   optimize='-MD -Zi -DNDEBUG -O1',
   cppflags='-DWIN32'
   ccversion='12.00.8804', gccversion='', gccosandvers=''
   intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
   d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=8
   ivtype='long', ivsize=4, nvtype='double', nvsize=8,
Off_t='__int64', lseeksize=8
   alignbytes=8, prototype=define
 Linker and Libraries:
   ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf
-libpath:"C:\Perl\lib\CORE"  -machine:x86'
   libpth=\lib
   libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib
netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib
odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib
   perllibs=  oldnames.lib kernel32.lib user32.lib gdi32.lib
winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib
oleaut32.lib  netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib
version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib
   libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl512.lib
   gnulibc_version=''
 Dynamic Linking:
   dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
   cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug
-opt:ref,icf  -libpath:"C:\Perl\lib\CORE"  -machine:x86'

Locally applied patches:
   ACTIVEPERL_LOCAL_PATCHES_ENTRY
   1fd8fa4 Add Wolfram Humann to AUTHORS
   f120055 make string-append on win32 100 times faster
   a2a8d15 Define _USE_32BIT_TIME_T for VC6 and VC7
   007cfe1 Don't pretend to support really old VC++ compilers
   6d8f7c9 Get rid of obsolete PerlCRT.dll support
   d956618 Make Term::ReadLine::findConsole fall back to STDIN if
/dev/tty can't be opened
   321e50c Escape patch strings before embedding them in patchlevel.h

---
@INC for perl 5.12.2:
   C:/Perl/site/lib
   C:/Perl/lib
   .

---
Environment for perl 5.12.2:
   HOME (unset)
   LANG (unset)
   LANGUAGE (unset)
   LD_LIBRARY_PATH (unset)
   LOGDIR (unset)
   PATH=C:\Program Files\ActiveState Komodo IDE
6\;C:\Perl\site\bin;C:\Perl\bin;C:\Program Files\NVIDIA
Corporation\PhysX\Common;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\
   PERL_BADLANG (unset)
   SHELL (unset)



nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About