Front page | perl.perl5.porters |
Postings from December 2010
[perl #80058] [Bug Report] Bad \n convert, using UTF-16 on Win32
From:
MzM
Date:
December 1, 2010 23:24
Subject:
[perl #80058] [Bug Report] Bad \n convert, using UTF-16 on Win32
Message ID:
rt-3.6.HEAD-13564-1291207107-618.80058-75-0@perl.org
# New Ticket Created by MzM
# Please include the string: [perl #80058]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=80058 >
This is a bug report for perl from mezmerik@gmail.com,
generated with the help of perlbug 1.39 running under perl 5.12.2.
-----------------------------------------------------------------
[Please describe your issue here]
Hello,
I'm using ActivePerl 5.12.2 on Windows 7.
Perl for Win32 has a feature to convert a single "LF" (without
preceding "CR") to "CRLF", but my perl seems to determine what "LF" is
on UTF-16 incorrectly. In ANSI and UTF-8 files, LF's bytecode is "0A";
in UTF-16, LF should be "000A" (Big Endian) or "0A00" (Little
Endian), but my perl seems to regard single "0A" as LF too! Thus, she
will do the wrong thing, which is adding a "0D" before "0A" (my perl
also regard "0D" as CR, the right CR in UTF-16 should be "000A").
Here's the test program:
open FH_IN, "<:encoding(utf16be)", "src.txt" or die;
open FH_OUT, ">:encoding(utf16be)", "output.txt" or die;
while (<FH_IN>) {
print FH_OUT $_;
}
I think "src.txt" and "output.txt" should be identical. But not.
1) if "src.txt" is only two CRLFs, its bytecodes are "FE FF 00 0D 00
0A 00 0D 00 0A"; the "output.txt" becomes "FE FF 00 0D 00 0D 0A 00 0D
00 0D 0A", each "0A" gets a unnecessary and wrong preceding "0D".
2) if "src.txt" is only one chinese charater "上", whose unicode and
UTF-16BE bytecode is "4E 0A", with BOM, the file's whole bytes are "FE
FF 4E 0A"; the "output.txt" becomes "FE FF 4E 0D 0A".
modify the program code:
while (<FH_IN>) {
chomp;
print FH_OUT $_;
}
1) "src.txt" which is only two CRLFs, "FE FF 00 0D 00 0A 00 0D 00 0A"
becomes "FE FF 00 0D 00 0D". So, chomp only get rid of LF(00 0A). it
should erase 4 bytes "00 0D 00 0A".
That's what I found when operating UTF-16 files. I'll appreciate your
efforts to improve Unicode support. Many thanks!
Joey
[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
category=core
severity=low
---
Site configuration information for perl 5.12.2:
Configured by SYSTEM at Mon Sep 6 23:12:49 2010.
Summary of my perl5 (revision 5 version 12 subversion 2) configuration:
Platform:
osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread
uname=''
config_args='undef'
hint=recommended, useposix=true, d_sigaction=undef
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
-D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO
-D_USE_32BIT_TIME_T -DPERL_MSVCRT_READFIX',
optimize='-MD -Zi -DNDEBUG -O1',
cppflags='-DWIN32'
ccversion='12.00.8804', gccversion='', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=8
ivtype='long', ivsize=4, nvtype='double', nvsize=8,
Off_t='__int64', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf
-libpath:"C:\Perl\lib\CORE" -machine:x86'
libpth=\lib
libs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib
netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib
odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib
perllibs= oldnames.lib kernel32.lib user32.lib gdi32.lib
winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib
oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib
version.lib odbc32.lib odbccp32.lib comctl32.lib msvcrt.lib
libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl512.lib
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug
-opt:ref,icf -libpath:"C:\Perl\lib\CORE" -machine:x86'
Locally applied patches:
ACTIVEPERL_LOCAL_PATCHES_ENTRY
1fd8fa4 Add Wolfram Humann to AUTHORS
f120055 make string-append on win32 100 times faster
a2a8d15 Define _USE_32BIT_TIME_T for VC6 and VC7
007cfe1 Don't pretend to support really old VC++ compilers
6d8f7c9 Get rid of obsolete PerlCRT.dll support
d956618 Make Term::ReadLine::findConsole fall back to STDIN if
/dev/tty can't be opened
321e50c Escape patch strings before embedding them in patchlevel.h
---
@INC for perl 5.12.2:
C:/Perl/site/lib
C:/Perl/lib
.
---
Environment for perl 5.12.2:
HOME (unset)
LANG (unset)
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=C:\Program Files\ActiveState Komodo IDE
6\;C:\Perl\site\bin;C:\Perl\bin;C:\Program Files\NVIDIA
Corporation\PhysX\Common;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\
PERL_BADLANG (unset)
SHELL (unset)
-
[perl #80058] [Bug Report] Bad \n convert, using UTF-16 on Win32
by MzM