Front page | perl.perl5.porters |
Postings from May 2007
Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE)
Thread Previous
|
Thread Next
From:
demerphq
Date:
May 23, 2007 10:20
Subject:
Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE)
Message ID:
9b18b3110705231020r197cad00g353fa54dcca2249f@mail.gmail.com
On 5/23/07, Tels <nospam-abuse@bloodgate.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Moin,
>
> On Wednesday 23 May 2007 15:53:14 demerphq wrote:
> > Hi Dan,
> >
> > I was wondering if there is some way to get Encode to emit the little
> > endian version of UTF-16 (with BOM) as a typical Win32 on Intel app
> > would do. It seems to me that currently
> >
> > my $octets= encode('UTF-16',$string);
> >
> > will only emit the big-endian form of it.
>
> As far as I gleaned from working with UTF, this is right. (or in other
> words, UTF-16BE is just an alias for UTF-16), but I could be wrong.
No, thats not correct. UTF-16 files can be either big endian or little
endian and must start with a Byte Order Mark, codepoint U+FEFF, which
is used to determine what their endianness is. UTF-16LE and UTF-16BE
are encodings with a specific endianess and do not start with a BOM.
> > Of course well behaved apps shouldnt care, but some do, also i know I
> > can hand emit the BOM myself like so:
> >
> > my $octets= encode('UTF-16LE',chr(0xFEFF).$string);
> >
> > but this strck me as a bit convoluted and makes it a bit tricky to do
> > with IO layers. If there isnt a way to do it currently maybe the name
> > 'UTF-16:le' or something similar could be used for this?
>
> I am not sure I understand your question, since you showed it is possible to
> get UTF-16LE, so what exactly do you want more? :)
>
> Shouldn't then:
>
> binmode ($FILE, 'UTF-16LE') or die("$!");
>
> just work?
Yes it works, but it doesnt ensure the file starts with a BOM. Which
is easily enough done by hand, but as i said above is a touch
annoying. I can imagine scenarios where its not clear whose
responsibility it is to add the BOM. I actually was trying to write a
utf-8 to utf-16 converter (long story) but the files are different
from that provided by most win32 tools i used for comparision as they
emit the little-endian variant instead.
Also it struck me as weird that UTF-16 in perl is alway big endian
even on a little endian architecture. Obviously its easier to test
this way.
Imo it would be cool to have a way to control it in code without hand
adding the BOM.
Ill do a patch if there isnt already a way to do it, i just wanted to
be sure before i look into it, and since Dan knows the code what could
take me a while to do would probably be the work of a few minutes for
him so i figured id see what he had to say first.
cheers,
yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Thread Previous
|
Thread Next