develooper Front page | perl.perl5.porters | Postings from May 2003

[perl #22261] Unrecognised BOM when reading a file larger than 1k with encoding(UTF-16)

Thread Next
Jeremy Devenport
May 21, 2003 08:21
[perl #22261] Unrecognised BOM when reading a file larger than 1k with encoding(UTF-16)
Message ID:
# New Ticket Created by  Jeremy Devenport 
# Please include the string:  [perl #22261]
# in the subject line of all future correspondence about this issue. 
# <URL: >

This is a bug report for perl from,
generated with the help of perlbug 1.34 running under perl v5.8.0.

[Please enter your report here]

The following code fails with perl 5.8.0:

# This will succeed until input.txt is >1k
open IN, "<:raw:encoding(utf16)", "input.txt";
while (<IN>) {
    # do nothing
close IN;

UTF-16:Unregognised BOM 4f00 at line 3, <IN> line 27.

The typo (Unregognised) is fixed in 5.8.x but the error still hits.

This bug makes it tricky to work with UTF-16 files (the predominent flavor 
of unicode on Windows).

Changing the :encoding from UTF-16 to UTF-16LE will make the error go away 
but then the BOM will actually show up in the text.

It looks like none of the current tests catch this because none of them 
store more than one buffer worth of data in their test files. The test below 
demonstrates the bug on my system (not sure if it's written correctly for BE 
or EBCDIC systems), note that it only fails if $count is set to 512 or 
greater (causing the file to be larger than 1k).

#!./perl -w

    if ($ENV{'PERL_CORE'}){
        chdir 't';
        unshift @INC, '../lib';
    unless (find PerlIO::Layer 'perlio') {
        print "1..0 # Skip: not perlio\n";
        exit 0;
print "1..4\n";

my $utf16 = "utf16$$";
my $utf8  = "utf8$$";
my $count = 512;

# write a BOM and then $count UTF-16 'A' characters
if (open(UTF, ">$utf16")) {
    binmode(UTF, ":bytes");
    print UTF "\xff\xfe" . ("\x41\x00" x $count);
    close UTF or die "Could not close: $!";

    use Encode;
    open(my $i,'<:encoding(UTF-16)',$utf16);
    print "ok 1\n";
    open(my $o,'>:utf8',$utf8);
    print "ok 2\n";
    print $o readline($i);
    print "ok 3\n";
    close($o) or die "Could not close: $!";

if (open(UTF, "<$utf8")) {
    binmode(UTF, ":bytes");
    print "not " unless <UTF> eq 'A' x $count;
    print "ok 4\n";
    close UTF;

    unlink($utf16, $utf8);

[Please do not change anything below this line]
Site configuration information for perl v5.8.0:

Configured by jeremyd at Mon May 19 22:33:21 PDT 2003.

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
    osname=openbsd, osvers=3.2, archname=OpenBSD.i386-openbsd
    uname='openbsd badger.internal 3.2 badger#1 i386 '
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef 
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
    cc='cc', ccflags ='-fno-strict-aliasing -I/usr/local/include',
    cppflags='-fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='2.95.3 20010125 (prerelease)', 
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib
    libs=-lgdbm -lm -lc -lutil
    perllibs=-lm -lc -lutil
    libc=/usr/lib/, so=so, useshrplib=false, libperl=libperl.a
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=define, ccdlflags=' '
    cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC  -L/usr/local/lib'

Locally applied patches:

@INC for perl v5.8.0:

Environment for perl v5.8.0:
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PERL_BADLANG (unset)

Add photos to your messages with MSN 8. Get 2 months FREE*.

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About