develooper Front page | perl.perl5.porters | Postings from March 2011

Perl generates illegal UTF-8 sequences

Thread Next
From:
Tom Christiansen
Date:
March 7, 2011 09:55
Subject:
Perl generates illegal UTF-8 sequences
Message ID:
28249.1299520545@chthon
If you have a file like this:

    printf "I have %d whatevers\n",  1 + $_ for qw[élite Ævar μῦθος mío];

And run it with "perl -W -CS", Perl generates the following illegal UTF-8 sequences:

    Argument "élite" isn't numeric in addition (+) at /tmp/uterr line 5.
    Argument "�M-^Fvar" isn't numeric in addition (+) at /tmp/uterr line 5.
    Argument "μῦθο�M-^B" isn't numeric in addition (+) at /tmp/uterr line 5.
    Argument "mío" isn't numeric in addition (+) at /tmp/uterr line 5.

That should never be allowed.  This is a bug.

On the other hand, even if you "properly" declare your source encoding
as utf8, you still get output that is arguably even *less* helpful:

    use utf8;
    printf "I have %d whatevers\n",  1 + $_ for qw[élite Ævar μῦθος mío];

    Argument "\x{e9}\x{6c}..." isn't numeric in addition (+) at /tmp/uterr line 5.
    Argument "\x{c6}\x{76}..." isn't numeric in addition (+) at /tmp/uterr line 5.
    Argument "\x{3bc}\x{1fe6}..." isn't numeric in addition (+) at /tmp/uterr line 5.
    Argument "\x{6d}\x{ed}..." isn't numeric in addition (+) at /tmp/uterr line 5.

What's that about, eh?  I would argue that all of those are wrong, and that
any of these would be preferable:

     *  Proper UTF-8 per -CS:

	Argument "él..." isn't numeric in addition (+) at /tmp/uterr line 5.
	Argument "Æv..." isn't numeric in addition (+) at /tmp/uterr line 5.
	Argument "μῦ..." isn't numeric in addition (+) at /tmp/uterr line 5.
	Argument "mí..." isn't numeric in addition (+) at /tmp/uterr line 5.

     * Escape only the trans-ASCII:

	Argument "\x{E9}l..." isn't numeric in addition (+) at /tmp/uterr line 5.
	Argument "\x{C6}v..." isn't numeric in addition (+) at /tmp/uterr line 5.
	Argument "\x{3BC}\x{1FE6}..." isn't numeric in addition (+) at /tmp/uterr line 5.
	Argument "m\x{ED}..." isn't numeric in addition (+) at /tmp/uterr line 5.

     * Escape only the trans-ASCII, but use proper names:

	Argument "\N{LATIN SMALL LETTER E WITH ACUTE}l..." isn't numeric in addition (+) at /tmp/uterr line 5.
	Argument "\N{LATIN CAPITAL LETTER AE}v..." isn't numeric in addition (+) at /tmp/uterr line 5.
	Argument "\N{GREEK SMALL LETTER MU}\N{GREEK SMALL LETTER UPSILON WITH PERISPOMENI}..." isn't numeric in addition (+) at /tmp/uterr line 5.
	Argument "m\N{LATIN SMALL LETTER I WITH ACUTE}..." isn't numeric in addition (+) at /tmp/uterr line 5.

Thoughts?

--tom

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About