Front page | perl.perl5.porters |
Postings from May 2007
Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE)
Thread Previous
|
Thread Next
From:
Tels
Date:
May 23, 2007 10:36
Subject:
Re: Encode and emitting the little endian form of UTF-16 (not UTF-16LE)
Message ID:
200705231944.22536@bloodgate.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Moin,
On Wednesday 23 May 2007 17:20:15 demerphq wrote:
> On 5/23/07, Tels <nospam-abuse@bloodgate.com> wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Moin,
> >
> > On Wednesday 23 May 2007 15:53:14 demerphq wrote:
> > > Hi Dan,
> > >
> > > I was wondering if there is some way to get Encode to emit the little
> > > endian version of UTF-16 (with BOM) as a typical Win32 on Intel app
> > > would do. It seems to me that currently
> > >
> > > my $octets= encode('UTF-16',$string);
> > >
> > > will only emit the big-endian form of it.
> >
> > As far as I gleaned from working with UTF, this is right. (or in other
> > words, UTF-16BE is just an alias for UTF-16), but I could be wrong.
>
> No, thats not correct. UTF-16 files can be either big endian or little
> endian and must start with a Byte Order Mark, codepoint U+FEFF, which
> is used to determine what their endianness is.
As far as I read the wiki entry, they "should" but not "must". Of course,
the BOM makes things much easier.
Quote:
"If the BOM is missing, barring any indication of byte order from
higher-level protocols, big endian is to be used or assumed."
> UTF-16LE and UTF-16BE
> are encodings with a specific endianess and do not start with a BOM.
Erm, see above.
And that still doesn't answer how you know which endianess to emit when the
conversion only specifies "UTF-16".
When you say "UTF-16", Encode can either:
* always ommit the BOM and emit BE
* send a BOM and let the BE or LE be determined by random chance, the
architeture, or always be BE
> > > Of course well behaved apps shouldnt care, but some do, also i know I
> > > can hand emit the BOM myself like so:
> > >
> > > my $octets= encode('UTF-16LE',chr(0xFEFF).$string);
> > >
> > > but this strck me as a bit convoluted and makes it a bit tricky to do
> > > with IO layers. If there isnt a way to do it currently maybe the name
> > > 'UTF-16:le' or something similar could be used for this?
> >
> > I am not sure I understand your question, since you showed it is
> > possible to get UTF-16LE, so what exactly do you want more? :)
> >
> > Shouldn't then:
> >
> > binmode ($FILE, 'UTF-16LE') or die("$!");
> >
> > just work?
>
> Yes it works, but it doesnt ensure the file starts with a BOM. Which
> is easily enough done by hand, but as i said above is a touch
> annoying. I can imagine scenarios where its not clear whose
> responsibility it is to add the BOM. I actually was trying to write a
> utf-8 to utf-16 converter (long story) but the files are different
> from that provided by most win32 tools i used for comparision as they
> emit the little-endian variant instead.
>
> Also it struck me as weird that UTF-16 in perl is alway big endian
> even on a little endian architecture. Obviously its easier to test
> this way.
> Imo it would be cool to have a way to control it in code without hand
> adding the BOM.
I guess that adding the BOM when you request UTF-16BE and UTF-16LE would be
a first start, but the wiki contradicts itself there:
"However rather than using a BOM prepended to the data, the byte order used
is implicit in the name of the encoding scheme (LE for little-endian, BE
for big-endian). Since a BOM is specifically not to be prepended in these
schemes, if an encoded ZWNBSP character is found at the beginning of any
data encoded by these schemes is not to be considered to be a BOM, but
instead is considered part of the text itself. In practice most software
will ignore these "accidental" BOMs.
Hohum.
All the best,
Tels
- --
Signed on Wed May 23 19:32:39 2007 with key 0x93B84C15.
Get one of my photo posters: http://bloodgate.com/posters
PGP key on http://bloodgate.com/tels.asc or per email.
"The rovers Spirit and Opportunity were proposed, authorized, announced,
designed, launched and successfully landed upon Mars within the
timeframe of Duke Nukem Forever's development."
-- Unknown
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iQEVAwUBRlSZlncLPEOTuEwVAQJhQAf/QNDn3ga8xYrbn60lSiz0Q2Xy02W+42Id
q8BXMrQDMnGeaam2JBCW9toN4b/FHRtIkwPsqkGp9ABo+HU1xSiCYucX2ueHSr4U
d1e2J/Jj7eA1wT+wbBVs0neZzP65LOoqZoysUrIVWyvLvaJ3zddseM/yl1s4qHLc
/R4yHtW+sG4nq6d5GrSfEuNd6s4kFTRXAViUeBHdhIrPd/gpBnl+3HKdWWOCWlD/
IiDC7YfZtvMWDOS+hjG+T571FZmRBdTBHsOsBaJAnXrAvOYWtp22qGMG2ISqSngF
ABPOAVHlcwJfC8HDDqsfrWW8RRfouWPS/a7S558w6D+1eOdb7p8avQ==
=G818
-----END PGP SIGNATURE-----
Thread Previous
|
Thread Next