perl.unicode http://www.nntp.perl.org/group/perl.unicode/ ... Copyright 1998-2008 perl.org Sun, 27 Jul 2008 09:17:04 +0000 ask@perl.org Re: /\w/ match with 'use locale' misses letters in utf8 locale by Peter Volkov &#x412; &#x41F;&#x442;&#x43D;, 11/07/2008 &#x432; 09:00 +0200, Juerd Waalboer &#x43F;&#x438;&#x448;&#x435;&#x442;:<br/>&gt; Peter Volkov skribis 2008-07-11 10:10 (+0400):<br/>&gt; &gt; The problem is that in Linux (Gentoo and Debian I&#39;ve tried) /\w/ does<br/>&gt; &gt; not match Russian letter while I use locale and LC_COLLATE is set to<br/>&gt; &gt; ru_RU.UTF-8.<br/>&gt; <br/>&gt; \w should match Cyrillic letters even without &quot;use locale&quot;. You might be<br/>&gt; running into an annoying bug which makes \w lose its unicode support<br/>&gt; depending on the *internal* state of a value.<br/><br/>This behavior is reproducible with cp1251 encoding too. So...<br/><br/>&gt; Despite the above there&#39;s a slightly more important issue here. You&#39;re<br/>&gt; opening a text file but you don&#39;t specify the character encoding.<br/><br/>seems to be the answer I was looking for. But this makes me wonder why<br/>use locale exists then? I thought that it should take &quot;default&quot; or not<br/>specified encoding from environment... And really questionable why in<br/>FreeBSD everything works.<br/><br/>In any case thank you Juerd for very fast answer.<br/><br/>-- <br/>Peter.<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/07/msg3160.html Fri, 11 Jul 2008 02:16:23 +0000 Re: /\w/ match with 'use locale' misses letters in utf8 locale by Juerd Waalboer Peter Volkov skribis 2008-07-11 10:10 (+0400):<br/>&gt; The problem is that in Linux (Gentoo and Debian I&#39;ve tried) /\w/ does<br/>&gt; not match Russian letter while I use locale and LC_COLLATE is set to<br/>&gt; ru_RU.UTF-8.<br/><br/>\w should match Cyrillic letters even without &quot;use locale&quot;. You might be<br/>running into an annoying bug which makes \w lose its unicode support<br/>depending on the *internal* state of a value. To work around this bug,<br/>read Unicode::Semantics on CPAN and use it or utf8::upgrade.<br/><br/>&gt; Linux $ perl -e &#39;use locale; open(IN, &quot;&lt; test-file&quot;); while(&lt;IN&gt;) { print if /\w/; }&#39;<br/>&gt; string with spaces (not only with [:alnum:])<br/>&gt; English;<br/>&gt; hello_&#x43F;&#x440;&#x438;&#x432;&#x435;&#x442;<br/><br/>Despite the above there&#39;s a slightly more important issue here. You&#39;re<br/>opening a text file but you don&#39;t specify the character encoding.<br/>Likewise, you need to specify the encoding for output.<br/><br/>Assuming utf8 for both:<br/><br/> perl -le&#39;<br/> binmode STDOUT, &quot;:encoding(utf8)&quot;;<br/> open my $in, &quot;&lt; :encoding(utf8)&quot;, &quot;test-file&quot;;<br/> while (&lt;$in&gt;) {<br/> print &quot;match: [$1]&quot; if /(\w+)/;<br/> }<br/> &#39;<br/><br/>Which on my system prints:<br/><br/> match: [&#x441;&#x43B;&#x43E;&#x432;&#x43E;]<br/> match: [&#x441;&#x442;&#x440;&#x43E;&#x43A;&#x430;]<br/> match: [string]<br/> match: [English]<br/> match: [hello_&#x43F;&#x440;&#x438;&#x432;&#x435;&#x442;]<br/><br/>I&#39;m not sufficiently familiar with &quot;use encoding&quot; to say anything about<br/>it, but you shouldn&#39;t need it just for this.<br/><br/>&gt; Do I understand correctly that we should always supply encoding of<br/>&gt; streams?<br/><br/>Yes.<br/><br/>&gt; If yes, why in FreeBSD this works without supplying any encoding and is<br/>&gt; it possible (good idea) to do the same in Linux?<br/><br/>I have no idea.<br/>-- <br/>Met vriendelijke groet, Kind regards, Korajn salutojn,<br/><br/> Juerd Waalboer: Perl hacker &lt;#####@juerd.nl&gt; &lt;http://juerd.nl/sig&gt;<br/> Convolution: ICT solutions and consultancy &lt;sales@convolution.nl&gt;<br/>1;<br/> http://www.nntp.perl.org/group/perl.unicode/2008/07/msg3159.html Fri, 11 Jul 2008 00:00:29 +0000 /\w/ match with 'use locale' misses letters in utf8 locale by Peter Volkov Hello. Should /\w/ work with &#39;use locale&#39; and correct environment set?<br/><br/>The problem is that in Linux (Gentoo and Debian I&#39;ve tried) /\w/ does<br/>not match Russian letter while I use locale and LC_COLLATE is set to<br/>ru_RU.UTF-8. The most strange thing is that in FreeBSD this works. Look:<br/><br/>+++++++++++++++++++++++++ FreeBSD ++++++++++++++++++++++++++++++++<br/>FreeBSD $ cat test-file<br/>&#x441;&#x43B;&#x43E;&#x432;&#x43E;<br/>&#x441;&#x442;&#x440;&#x43E;&#x43A;&#x430; &#x441; &#x43F;&#x440;&#x43E;&#x431;&#x435;&#x43B;&#x430;&#x43C;&#x438;<br/>string with spaces (not only with [:alnum:])<br/>English;<br/>hello_&#x43F;&#x440;&#x438;&#x432;&#x435;&#x442;<br/><br/>FreeBSD $ perl -e &#39;open(IN, &quot;&lt; test-file&quot;); while(&lt;IN&gt;) { print if /\w/; }&#39;<br/>string with spaces (not only with [:alnum:])<br/>English;<br/>hello_&#x43F;&#x440;&#x438;&#x432;&#x435;&#x442;<br/>FreeBSD $ perl -e &#39;use locale; open(IN, &quot;&lt; test-file&quot;); while(&lt;IN&gt;) { print if /\w/; }&#39;<br/>&#x441;&#x43B;&#x43E;&#x432;&#x43E;<br/>&#x441;&#x442;&#x440;&#x43E;&#x43A;&#x430; &#x441; &#x43F;&#x440;&#x43E;&#x431;&#x435;&#x43B;&#x430;&#x43C;&#x438;<br/>string with spaces (not only with [:alnum:])<br/>English;<br/>hello_&#x43F;&#x440;&#x438;&#x432;&#x435;&#x442;<br/>FreeBSD $<br/>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++<br/><br/>++++++++++++++++++++++++++++ Linux +++++++++++++++++++++++++++++++<br/>Linux $ perl -e &#39;use locale; open(IN, &quot;&lt; test-file&quot;); while(&lt;IN&gt;) { print if /\w/; }&#39;<br/>string with spaces (not only with [:alnum:])<br/>English;<br/>hello_&#x43F;&#x440;&#x438;&#x432;&#x435;&#x442;<br/>Linux $<br/>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++<br/><br/>locale -a shows that ru_RU.utf8 locale exists on both systems and I&#39;ve<br/>tried to set LANG and LC_ALL to this value with no result. Do I<br/>understand correctly that we should always supply encoding of streams?<br/>If yes, why in FreeBSD this works without supplying any encoding and is<br/>it possible (good idea) to do the same in Linux?<br/><br/>Thank you for your time.<br/>-- <br/>Peter.<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/07/msg3158.html Thu, 10 Jul 2008 23:10:53 +0000 CSets 2.1 released by Mark Leisher http://www.math.nmsu.edu/~mleisher/Software/csets<br/><br/>For those new to CSets:<br/><br/> &quot;The CSets collection is a set of mapping tables between various <br/>character sets and Unicode, and is intended to provide mappings not <br/>included in most character set conversion tools available today.&quot;<br/><br/>It&#39;s been a couple years since the last release. This release features <br/>the addition of three Guarani mappings (available individually for a <br/>couple years already), and two new Serbian mappings.<br/><br/>Future updates will be posted on freshmeat.net for those who like to <br/>track updates through subscriptions there, and I will always notify <br/>these lists of updates as well.<br/><br/>As always, corrections, new mapping tables, information about mappings, <br/>and even pointers to things like fonts or texts with odd encodings are <br/>gladly accepted.<br/>-- <br/>Mark Leisher<br/> http://www.nntp.perl.org/group/perl.unicode/2008/05/msg3157.html Fri, 30 May 2008 15:20:40 +0000 Re: TAP YAML Diagnostics by Nicholas Clark On Sun, Apr 06, 2008 at 08:41:11AM -0700, Ovid wrote:<br/><br/>&gt; Currently you can shove anything in there you want, but you must use<br/>&gt; upper-case keys for your personal use and all lower-case keys are<br/>&gt; reserved (and it&#39;s a parse error to use an unknown lower-case key). <br/>&gt; Are there any strange Unicode issues where we might get confused about<br/>&gt; what is upper and lower case?)<br/><br/>I believe that there are code points which would be considered word<br/>characters but do not have distinct upper and lower case forms (or by<br/>implication title case either), but I hope that the good folks of<br/>perl-unicode will correct me if I&#39;m wrong.<br/><br/>Hence I&#39;m not sure what the most efficient way of determining if<br/>something is all lower case is. If I&#39;m right, one can&#39;t just test<br/><br/> if ($string eq lc $string)<br/><br/>because these code points would mess you up, and I *assume* that they<br/>are not those which you want to consider reserved. I guess that one<br/>needs to loop over all characters in the string, and verify that if<br/>$char eq lc $char then also $char ne uc $char. (But one could first<br/>short circuit the common pass case with the test above)<br/><br/>Nicholas Clark<br/> http://www.nntp.perl.org/group/perl.unicode/2008/04/msg3156.html Sun, 06 Apr 2008 09:33:30 +0000 Re: how to request for a new module by Darren Duncan Bayanzul,<br/><br/>Go read http://cpan.org/modules/04pause.html and it should tell you or <br/>introduce you to everything you need to know. Read the whole thing (it <br/>isn&#39;t very long).<br/><br/>-- Darren Duncan<br/><br/>bayanzul lodoysamba wrote:<br/>&gt; Dear all,<br/>&gt; <br/>&gt; I have a question about submitting a new module to CPAN. How can one submit a module to CPAN?<br/>&gt; What is the procedure for it? Is there any qualification for submitting a module?<br/>&gt; <br/>&gt; Currently we are interested in developing a new module for converting Unicode strings in traditional mongolian script into different formats.<br/>&gt; It will include functions that perform convertions between: <br/>&gt; Basic Character Set &lt;-&gt; Presentation Set &amp; Ligatures<br/>&gt; Basic Character Set &lt;-&gt; Transliterate in Latin characters<br/>&gt; <br/>&gt; I guess, it is the best if the package would be under Unicode::, or Encode::. <br/>&gt; <br/>&gt; Please give us your suggestions. <br/>&gt; Waiting for your reply<br/>&gt; <br/>&gt; <br/>&gt; Regards,<br/>&gt; Bayanzul.L<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3155.html Thu, 27 Mar 2008 23:30:38 +0000 how to request for a new module by bayanzul lodoysamba Dear all,<br/><br/>I have a question about submitting a new module to CPAN. How can one submit a module to CPAN?<br/>What is the procedure for it? Is there any qualification for submitting a module?<br/><br/>Currently we are interested in developing a new module for converting Unicode strings in traditional mongolian script into different formats.<br/>It will include functions that perform convertions between: <br/> Basic Character Set &lt;-&gt; Presentation Set &amp; Ligatures<br/> Basic Character Set &lt;-&gt; Transliterate in Latin characters<br/><br/>I guess, it is the best if the package would be under Unicode::, or Encode::. <br/><br/>Please give us your suggestions. <br/>Waiting for your reply<br/><br/><br/>Regards,<br/>Bayanzul.L<br/><br/><br/> ____________________________________________________________________________________<br/>Never miss a thing. Make Yahoo your home page. <br/>http://www.yahoo.com/r/hs<br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3154.html Thu, 27 Mar 2008 04:03:41 +0000 Re: Pack and Unpack are Broken for > 0x7FFF_FFFF (in 5.10.0) by Chris Hall On Sun, 16 Mar 2008 I wrote<br/>&gt;I thought &#39;C&#39; worked on Octets ? Which is what 5.8.8 appears to be<br/>&gt;doing, but not 5.10.0.<br/><br/>I apologise... I should have read the perldelta.<br/><br/>I now understand that:<br/><br/> * in v5.8.8 one would say unpack(&#39;C*&#39;, ...) and get the underlying<br/> octets, if the string was a &#39;wide&#39; (UTF8) string.<br/><br/> in v5.10.0 this is no longer possible, C in a &#39;wide&#39; string<br/> returns the wide character value.<br/><br/> * in v5.10.0, in unpack &#39;C&#39; and &#39;W&#39; are the same as each other.<br/> (At least, I cannot tell the difference.)<br/><br/> I cannot imagine why.<br/><br/> * in v5.10.0, in pack &#39;C&#39; and &#39;W&#39; are the same as each other, except<br/> that &#39;C&#39; masks the value down to 0..255.<br/><br/>Exchanging the meaning of &#39;U0&#39; and &#39;C0&#39; between 5.8.X and 5.10.0 was a<br/>stroke of genius.<br/><br/>The pack v5.10.0 documentation says:<br/><br/> &quot;Pack and unpack can operate in two modes, character mode (C0 mode)<br/> where the packed string is processed per character and UTF-8 mode<br/> (U0 mode) where the packed string is processed in its UTF-8-encoded<br/> Unicode form on a byte by byte basis. Character mode is the default<br/> unless the format string starts with an U . You can switch mode at<br/> any moment with an explicit C0 or U0 in the format. A mode is in<br/> effect until the next mode switch or until the end of the ()-group<br/> in which it was entered.&quot;<br/><br/>Where UTF-8 mode appears to mean the exact opposite of UTF8 in<br/>connection with the state of a string value. The given meaning for &#39;C&#39;<br/>is &quot;An unsigned char (octet) value&quot; doesn&#39;t help clarify things :-(<br/><br/>Anywho, v5.10.0 will:<br/><br/> pack(&#39;C*&#39;, 192,176,128,21) -&gt; &quot;\xC0\xB0\x80\x15&quot; (byte:4)<br/> pack(&#39;W*&#39;, 192,176,128,21) -&gt; &quot;\xC0\xB0\x80\x15&quot; (byte:4)<br/> pack(&#39;U*&#39;, 192,176,128,21) -&gt; &quot;\xC0\xB0\x80\x15&quot; (wide:4)<br/><br/> i.e. &#39;C&#39; and &#39;W&#39; produce &#39;byte&#39; form strings, but &#39;U&#39; produces<br/> a &#39;wide&#39; character string.<br/><br/> You might have thought that &#39;W&#39; would generate a (&#39;wide&#39;) character<br/> string... (but then it would be identical to &#39;U&#39; !)<br/><br/> pack(&#39;C0C*&#39;, 192,176,128,21) -&gt; &quot;\xC0\xB0\x80\x15&quot; (byte:4)<br/> pack(&#39;C0W*&#39;, 192,176,128,21) -&gt; &quot;\xC0\xB0\x80\x15&quot; (byte:4)<br/> pack(&#39;C0U*&#39;, 192,176,128,21) -&gt; &quot;\xC3\x80\xC2\xB0\xC2\x80\x15&quot;<br/> (byte:7)<br/><br/> so C0 prefix hasn&#39;t changed the result, except for C0U where we<br/> have been given a byte form string, containing the UTF-8.<br/><br/> pack(&#39;U0C*&#39;, 192,176,128,21) -&gt; &quot;\x00\x00\x15&quot; (wide:3)<br/> pack(&#39;U0W*&#39;, 192,176,128,21) -&gt; &quot;\x00\x00\x15&quot; (wide:3)<br/> pack(&#39;U0U*&#39;, 192,176,128,21) -&gt; &quot;\xC0\xB0\x80\x15&quot; (wide:4)<br/><br/> which does create a (&#39;wide&#39;) character string.<br/><br/> This makes no difference to &#39;U*&#39;, but for &#39;C*&#39; and &#39;W*&#39;, the &#39;byte&#39;<br/> string produced has been decoded as UTF-8, replacing bad<br/> sequences by \x00 !<br/><br/> I suppose this makes sense.<br/><br/> Though one might have imagined this would mean that the result of<br/> C* &amp; W* would be &#39;utf8::upgrade&#39;d to &#39;wide&#39; characters.<br/><br/> pack(&#39;C*&#39;, 257,359,477,91) -&gt; &quot;\x01\x67\xDD\x5B&quot; (byte:4)<br/> pack(&#39;W*&#39;, 257,359,477,91) -&gt; &quot;\x101\x167\x1DD\x5B&quot; (wide:4)<br/> pack(&#39;U*&#39;, 257,359,477,91) -&gt; &quot;\x101\x167\x1DD\x5B&quot; (wide:4)<br/><br/> pack(&#39;C0C*&#39;, 257,359,477,91) -&gt; &quot;\x01\x67\xDD\x5B&quot; (byte:4)<br/> pack(&#39;C0W*&#39;, 257,359,477,91) -&gt; &quot;\x{101}\x{167}\x{1DD}\x5B&quot; (wide:4)<br/> pack(&#39;C0U*&#39;, 257,359,477,91) -&gt; &quot;\xC4\x81\xC5\xA7\xC7\x9D\x5B&quot;<br/> (byte:7)<br/><br/> So W* and U* produce the same &#39;wide&#39; character string.<br/><br/> But &#39;C0W*&#39; and &#39;C0U*&#39; are quite different.<br/><br/> I have no idea why.<br/><br/>Almost equally amusing:<br/><br/> $p = pack(&#39;W&#39;, 248) -&gt; &quot;\xF8&quot; (byte:1)<br/><br/> unpack(&#39;C*&#39;) -&gt; (0xF8)<br/> unpack(&#39;C0C*&#39;) -&gt; (0xF8)<br/> unpack(&#39;U0C*&#39;) -&gt; (0xC3, 0xB8)<br/><br/>So, &#39;U0C&#39; seems to utf8::upgrade($p) before unpacking the raw bytes.<br/><br/>Whereas:<br/><br/> $p = pack(&#39;W&#39;, 257) -&gt; &quot;\x{101}&quot; (wide:1)<br/><br/> unpack(&#39;C*&#39;) -&gt; (0x101)<br/> unpack(&#39;C0C*&#39;) -&gt; (0x101)<br/> unpack(&#39;U0C*&#39;) -&gt; (0xC3, 0xB8)<br/><br/>Now, one can argue that poking around inside how characters are encoded<br/>in wide strings is a Bad Thing. But &#39;U0...&#39; is exposing just that.<br/><br/>However, there is no straightforward way to unpack a string into raw<br/>bytes without testing first for whether the string is &#39;wide&#39; or not:<br/><br/> @b = unpack(utf8::is_utf8($s) ? &#39;U0C*&#39; : &#39;C*&#39;, $s) ;<br/><br/>Chris<br/>-- <br/>Chris Hall highwayman.com<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3153.html Sun, 16 Mar 2008 11:54:24 +0000 Re: Pack and Unpack are Broken for > 0x7FFF_FFFF (in 5.10.0) by Chris Hall On Sun, 16 Mar 2008 I wrote<br/>....<br/>&gt;Consider:<br/>&gt;<br/>&gt; use warnings ;<br/>&gt;<br/>&gt; sub sp {<br/>&gt; my ($v) = @_ ;<br/>&gt;<br/>&gt; my $p = pack(&#39;U&#39;, $v) ;<br/>&gt; my @t = unpack(&#39;C*&#39;, $p) ;<br/>&gt;<br/>&gt; printf &#39;\x%04X_%04X: &#39;, ($v &gt;&gt; 16), $v &amp; 0xFFFF ;<br/>&gt; print map sprintf(&#39;\x%02X&#39;, $_), @t ;<br/>&gt; print &quot;\n&quot; ;<br/>&gt; } ;<br/>...<br/> &gt; sp(0x7FFF_FFFF) ;<br/><br/>...<br/>&gt;v5.8.8 result:<br/>&gt;<br/>&gt; \x7FFF_FFFD: \xFD\xBF\xBF\xBF\xBF\xBD<br/>...<br/>&gt;v5.10.0 result:<br/>&gt;<br/>&gt; \x7FFF_FFFD: \x7FFFFFFD<br/><br/>I didn&#39;t see this in all the clutter of warnings.<br/><br/>I thought &#39;C&#39; worked on Octets ? Which is what 5.8.8 appears to be <br/>doing, but not 5.10.0.<br/><br/>Chris<br/>-- <br/>Chris Hall highwayman.com<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3152.html Sun, 16 Mar 2008 08:51:17 +0000 Pack and Unpack are Broken for > 0x7FFF_FFFF (in 5.10.0) by Chris Hall <br/>More confusion about the valid range of characters in Perl.<br/><br/>Both v5.8.8 and v5.10.0 Perl will pack(&#39;U&#39;, $v) for values of $v which<br/>are &gt; 0x7FFF_FFFF. The result is the (non-standard) Perl utf8 encoding<br/>for such characters.<br/><br/>v5.8.8 Perl will unpack a string containing the non-standard encoding.<br/><br/>v5.10.0 Perl will not.<br/><br/>Consider:<br/><br/> use warnings ;<br/><br/> sub sp {<br/> my ($v) = @_ ;<br/><br/> my $p = pack(&#39;U&#39;, $v) ;<br/> my @t = unpack(&#39;C*&#39;, $p) ;<br/><br/> printf &#39;\x%04X_%04X: &#39;, ($v &gt;&gt; 16), $v &amp; 0xFFFF ;<br/> print map sprintf(&#39;\x%02X&#39;, $_), @t ;<br/> print &quot;\n&quot; ;<br/> } ;<br/><br/> sp(0x7FFF_FFFD) ;<br/> sp(0x8000_0000) ;<br/> sp(0xFFFF_FFFD) ;<br/><br/>v5.8.8 result:<br/><br/> \x7FFF_FFFD: \xFD\xBF\xBF\xBF\xBF\xBD<br/> \x8000_0000: \xFE\x82\x80\x80\x80\x80\x80<br/> \xFFFF_FFFD: \xFE\x83\xBF\xBF\xBF\xBF\xBD<br/><br/>v5.10.0 result:<br/><br/> \x7FFF_FFFD: \x7FFFFFFD<br/> Malformed UTF-8 character (byte 0xfe) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0x82, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0x80, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0x80, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0x80, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0x80, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0x80, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> \x8000_0000: \x00\x00\x00\x00\x00\x00\x00<br/> Malformed UTF-8 character (byte 0xfe) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0x83, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0xbf, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0xbf, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0xbf, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0xbf, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> Malformed UTF-8 character (unexpected continuation byte 0xbd, with no<br/> preceding start byte) in unpack at tpbug.pl line 7.<br/> \xFFFF_FFFD: \x00\x00\x00\x00\x00\x00\x00<br/><br/>And, FWIW, in 64-bit v5.8.8, pack(&#39;U&#39;, $v) appears to mask the $v value<br/>to unsigned 32-bits before attempting to pack !<br/><br/>-- <br/>Chris Hall highwayman.com +44 7970 277 383<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3151.html Sun, 16 Mar 2008 08:25:04 +0000 Re: UTF-8 (strict) appears borken by Chris Hall <br/>I have prepared a bug report, as below.<br/><br/>I don&#39;t want to waste everybody&#39;s time if this is thought to be a <br/>feature...<br/><br/>...so if anyone thinks this is not a bug, please shout (soon).<br/><br/>Thanks,<br/><br/>Chris<br/><br/>-----------------------------------------------------------------<br/>[Please enter your report here]<br/><br/><br/>Encode::encode(&#39;UTF-8&#39;, $foo) and Encode::decode(&#39;UTF-8&#39;, $bar) detect <br/>the<br/>Unicode &#39;non-character&#39; U+FFFF and treat it as an error.<br/><br/>There are 65 other Unicode non-characters:<br/><br/> U+FFFE<br/> U+01FFFE, U+02FFFE, U+03FFFE, ... U+10FFFE<br/> U+01FFFF, U+02FFFF, U+03FFFF, ... U+10FFFF<br/> U+FDD0..U+FDEF<br/><br/>which one would expect to be treated the same as U+FFFF.<br/><br/>They aren&#39;t. They are accepted as normal characters.<br/><br/>This appears to be a bug.<br/><br/><br/>[Please do not change anything below this line]<br/>-----------------------------------------------------------------<br/><br/><br/><br/>-- <br/>Chris Hall highwayman.com<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3150.html Sat, 15 Mar 2008 04:57:36 +0000 Re: utf8::valid and \x14_000 - \x1F_0000 by Chris Hall On Wed, 12 Mar 2008 Juerd Waalboer wrote<br/>&gt;Chris Hall skribis 2008-03-12 20:49 (+0000):<br/>&gt;&gt; a. are you saying that characters in Perl are Unicode ?<br/><br/>&gt;Yes. They are called Unicode, at least. This has my preference for<br/>&gt;explanation and documentation.<br/><br/>&gt;&gt; b. or are you agreeing that characters in Perl take values<br/>&gt;&gt; 0..0x7FFF_FFFF (or beyond), which are generally interpreted as<br/>&gt;&gt; UCS, where required and possible ?<br/><br/>&gt;This too. This is the more technically accurate explanation, and has my<br/>&gt;preference for implementation.<br/><br/>&#39;This too&#39; ? Goodness, superimposition ! Perl and quantum mechanics ? <br/>Suddenly it all becomes clear. Or at least as clear as the uncertainty <br/>principle will allow !-)<br/><br/>FWIW, I have tried some of the HTTP, HTML and XML modules. The warnings <br/>that pop out every now and then about Unicode or UTF-8 or whatever are <br/>less than useful and more than irritating !<br/><br/>&gt;&gt; If (a) then characters with ordinals beyond 0x10_FFFF should throw<br/>&gt;&gt; warnings (at least) since they clearly are not Unicode !<br/><br/>&gt;Perl just has a somewhat broad definition of &quot;unicode&quot;, that is not<br/>&gt;the same as the official unicode character set.<br/><br/>BTW, in &quot;2.14 Conforming to the Unicode Standard&quot; I found this gem:<br/><br/> Unacceptable Behavior<br/><br/> It is unacceptable for a conforming implementation:<br/><br/> - To use unassigned codes.<br/><br/> &bull; U+2073 is unassigned and not usable for &lsquo;3&rsquo; (superscript 3) or<br/> any other character.<br/><br/>This appears to say that unassigned codes should not be transmitted out, <br/>same like non-characters ! Which looks like hard work. (On the other <br/>hand, applications are supposed to cope with future defined code <br/>points...)<br/><br/>Should &#39;UTF-8&#39; be strict about unassigned codes as well ? What should <br/>chr() and &quot;\x{...}&quot; etc. do ?<br/><br/>This reinforces my view that chr(n) is (a) wrong to whinge about <br/>surrogates and non-characters, and (b) wrong to return a character for n <br/>outside 0x..7FFF_FFFF. IMO:<br/><br/> - chr() shouldn&#39;t worry about strict UCS ...<br/><br/> - ... and doesn&#39;t, in an case, do a complete job<br/> [it does spot all non-characters and surrogates, but ignores<br/> unassigned codes.]<br/><br/> - ... however, non-characters are perfectly legal UCS, at least for<br/> internal use. One can argue for jumping all over these when<br/> outputting (strict) UTF-8 for external exchange.<br/><br/> - ... and 0x11_FFFE is not defined by UCS to be a non-character,<br/> it&#39;s not defined in UCS at all, any more than any other character<br/> code &gt; U+10_FFFF !<br/><br/> - chr(n) doesn&#39;t whinge about characters &gt; U+10_FFFF ! (Except for<br/> the non-characters it has invented !)<br/><br/> - the answer to chr(-1) is &#39;not a character at all&#39; -- it isn&#39;t &#39;the<br/> character that stands in place of some unknown character&#39;<br/><br/> - the utility of characters &gt; 0x7FFF_FFFF is not worth (a) the kludge<br/> required to extend utf8, or (b) the interoperability issues.<br/><br/> Even encode/decode &#39;utf8&#39; take a dim view of chars &gt; 0x7FFF_FFFF.<br/><br/> I note that utf8::valid() rejects characters &gt; 0x7FFF_FFFF !<br/><br/> - chr(n) accepts characters &gt; 0x7FFF_FFFF, even though the result<br/> is not valid per utf8::valid() !!<br/><br/> - chr(n) warns about p + 0xFFFE and p + 0xFFFF for every value of &#39;p&#39;,<br/> even those which are beyond the Unicode range !<br/><br/>&gt;It has its own utf8, it can have its own unicode too :)<br/><br/>And there was I thinking that things were already sufficiently confused <br/>:-}<br/><br/>The &#39;utf8&#39; decode does the Right Thing -- it decodes well-formed UTF-8 <br/>up to 0x7FFF_FFFF and handles errors and incomplete sequences and <br/>doesn&#39;t concern itself with the minutiae of UCS (surrogates, <br/>non-characters and unassigned codes).<br/><br/>This is nicely consistent with utf8::valid().<br/><br/>[The only thing I would argue about is the separate treatment of each <br/>byte of an invalid sequence -- I&#39;d be tempted to treat 0x00..0x7F and <br/>0xC0..0xFF as terminators of an invalid sequence and 0x80..0xBF as <br/>members of an invalid sequence.]<br/><br/>If &#39;unicode&#39; were to follow that model, then chr() and friends could <br/>stop throwing (spurious) warnings around the place.<br/><br/>Sadly, &#39;utf8&#39; encode is doesn&#39;t care, and outputs whatever is in the <br/>string -- including redundant sequences, invalid sequences, incomplete <br/>sequences and Perl&#39;s extended sequences for &gt; 0x7FFF_FFFF. That is, it <br/>will happily output something that utf8::valid would reject. Note that <br/>this &quot;encoding&quot; is outputting something that &#39;utf8&#39; decode won&#39;t accept.<br/><br/>If you really want what &#39;utf8&#39; encode currently does you can force <br/>characters to octets (wax off) and output. The reverse is to input the <br/>octets and force to characters (wax on).<br/><br/>Summary of Observations<br/>-----------------------<br/><br/> * chr(n) and friends are broken:<br/><br/> - they winge about things that are none of their business, which is<br/> not consistent with the notion of (lax) &#39;unicode&#39;.<br/><br/> - the wingeing about not-(strict)-Unicode is, moreover, incomplete<br/> (unassigned codes and codes beyond the UCS range are allowed !)<br/><br/> - non-characters are perfectly legal -- just not suitable for<br/> external exchange.<br/><br/> - projecting non-characters beyond the UCS range is plain odd.<br/><br/> - they create invalid (per utf8::valid()) strings<br/><br/> - invalid &#39;n&#39; should return an &#39;invalid&#39; (i.e. undef) response<br/><br/> * &#39;utf8&#39; encode is broken:<br/><br/> - it should not output stuff that is not at least utf8:valid()<br/><br/> - it should be symmetrical with &#39;utf8&#39; decode<br/><br/> * characters &gt; 0x7FFF_FFFF are not utf8::valid. I think that&#39;s a<br/> good call -- but Perl is not consistent, and will happily produce<br/> invalid strings...<br/><br/> * &#39;UTF-8&#39; is broken:<br/><br/> - it doesn&#39;t know about all the defined non-characters.<br/><br/> - there should be an option to allow non-characters for internal<br/> exchange of otherwise strict UTF-8.<br/><br/> - BTW: the Unicode reference code for UTF8 to UTF32 does not trouble<br/> itself about non-characters. Nor does UTF32 to UTF8.<br/><br/>Chris<br/>-- <br/>Chris Hall highwayman.com<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3149.html Thu, 13 Mar 2008 10:16:06 +0000 UTF-8 (strict) appears borken by Chris Hall <br/>1. &#39;Ill-formed&#39; UTF-8<br/>=====================<br/><br/>The Unicode Standard specifies that any UTF-8 sequence that does not<br/>correspond to this table is &#39;ill-formed&#39;:<br/><br/> Code Points | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |<br/> -------------------+----------+----------+----------+----------+<br/> U+0000..U+007F | 00..7F | -- | -- | -- |<br/> U+0080..U+07FF | C2..DF | 80..BF | -- | -- |<br/> U+0800..U+0FFF | E0 | A0..BF | 80..BF | -- |<br/> U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | -- |<br/> U+D000..U+D7FF | ED | 80..9F | 80..BF | -- |<br/> U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | -- |<br/> U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |<br/> U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |<br/> U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |<br/><br/>Note in particular that:<br/><br/> - anything beyond U+10FFFF is ill-formed.<br/><br/> - anything U+D800..U+DFFF is ill-formed.<br/><br/> - only one encoding for each Code Point is well-formed.<br/><br/>We&#39;d expect UTF-8 decode to spot ill-formed sequences. Though some<br/>special handling of incomplete sequences at the end of a buffer would be<br/>handy.<br/><br/>We&#39;d expect UTF-8 encode to only generate well-formed sequences.<br/><br/>2. Extended Sequences<br/>=====================<br/><br/>Unicode and ISO/IEC 10646:2003 define meanings for UTF-8 compatible<br/>sequences up to 6 bytes, which allows for characters up to 0x7FFF_FFFF.<br/><br/>The Unicode reference code for reading UTF-8 recognises these extended<br/>sequences as being single entities (though ill-formed).<br/><br/>Perl has its own further 7 and 13 byte forms, allowing for characters up<br/>to 0xF_FFFF_FFFF and 2^72-1, respectively. These are beyond UTF-8.<br/><br/>3. Non-Characters<br/>=================<br/><br/>The only other cause for concern are non-characters. These are:<br/><br/> * U+FFFE and U+FFFF and the last two code points in every other<br/> Unicode plane.<br/><br/> Unicode code space is divided into 17 &#39;planes&#39; of 65,536 characters,<br/> each. So characters U+01_FFFE, U+01_FFFF, U+02_FFFE, U+02_FFFF, ...<br/> U+10_FFFE and U+10_FFFF are all non-characters.<br/><br/> * U+FDD0..U+FDEF<br/><br/>Now, Unicode 5.0.0 says:<br/><br/> &quot;Applications are free to use any of these noncharacter code points<br/> internally but should never attempt to exchange them. If a<br/> noncharacter is received in open interchange, an application is not<br/> required to interpret it in any way. It is good practice, however,<br/> to recognize it as a noncharacter and to take appropriate action,<br/> such as removing it from the text.&quot;<br/><br/> &quot;Noncharacter code points are reserved for internal use, such as for<br/> sentinel values. They should never be interchanged. They do, however,<br/> have well-formed representations in Unicode encoding forms and<br/> survive conversions between encoding forms. This allows sentinel<br/> values to be preserved internally across Unicode encoding forms, even<br/> though they are not designed to be used in open interchange.&quot;<br/><br/>So... this is not so clear-cut. For &quot;open interchange&quot; UTF-8 should<br/>disallow the non-characters. However, for local storage of Unicode<br/>stuff, non-characters should be allowed.<br/><br/>4. What &#39;UTF-8&#39; Does<br/>====================<br/><br/>Ill-formed sequences -- fine (mostly):<br/><br/> * UTF-8 decode treats these as errors, and will stop or use fallback<br/> decoding as required.<br/><br/> The default fallback is:<br/><br/> - errors for sequence &lt;= 0x7FFF_FFFF -- replaced by U+FFFD<br/><br/> *** information is being lost, here :-(<br/><br/> - anything else: each byte which is not recognised as being part<br/> of a complete 2..6 byte sequence is replaced by U+FFFD<br/><br/> *** so one cannot distinguish ill-formed sequences from<br/> out of range characters.<br/><br/> The PERLQQ, HTMLCREF and XMLCREF fallbacks are:<br/><br/> - errors for sequence &lt;= 0x7FFF_FFFF -- replaced by the<br/> respective escape sequence for the character value.<br/><br/> This ought to work if the data is HTML or XML, where new escape<br/> sequences fit right in if HTMLCREF or XMLCREF is used.<br/><br/> *** PERLQQ, however, may fail if &#39;\&#39; appears in the input and<br/> the sender has not escaped it !<br/><br/> Perhaps PERLQQ should escape &#39;\&#39; that appear in the input ?<br/><br/> *** In all cases, however, all that&#39;s been achieved is that<br/> non-UTF-8 characters have been transliterated. It&#39;s still<br/> a puzzle what may be done with these characters !<br/><br/> - anything else: each byte which is not recognised as being part<br/> of a complete up to 6 byte sequence is replaced by the<br/> respective escape sequence for the byte value.<br/><br/> *** this is impossible to distinguish from escaped values which<br/> could exist in the input !<br/><br/> * UTF-8 encode will not generate ill-formed sequences and treats out<br/> of ranges character values as errors. Errors will stop encoding or<br/> cause the fallback encoding to be used.<br/><br/> The default fallback is:<br/><br/> - errored characters &lt;= 0x7FFF_FFFF -- replaced by U+FFFD<br/><br/> *** Not much one can do here. It&#39;s not clear that U+FFFD is a<br/> good thing to output -- one could argue for discarding<br/> this rubbish, instead ?<br/><br/> - 0x8000_0000 and greater -- replaced by seven or thirteen U+FFFD,<br/> depending on the length of the Perl internal form !!!<br/><br/> *** This is also more than a bit odd !!<br/><br/> The PERLQQ, HTMLCREF and XMLCREF fallbacks are:<br/><br/> - errored characters &lt;= 0x7FFF_FFFF -- replaced by the<br/> respective escape sequence for the character value.<br/><br/> This ought to work if the data is HTML or XML, where new escape<br/> sequences fit right in if HTMLCREF or XMLCREF is used.<br/><br/> *** PERLQQ, however, may fail if &#39;\&#39; appears in the output and<br/> the sender has not escaped it ! .<br/><br/> Perhaps PERLQQ should escape &#39;\&#39; that appear in the output ?<br/><br/> *** In all cases, however, all that&#39;s been achieved is that<br/> non-UTF-8 characters have been transliterated. It&#39;s still<br/> a puzzle what may be done with these characters !<br/><br/> - 0x8000_0000 and greater -- replaced by the seven or thirteen<br/> bytes that comprise the Perl internal form, each as its<br/> respective escape sequence !!!<br/><br/> *** This is also more than a bit odd !!<br/><br/>Incomplete sequences -- fine, but not documented !<br/><br/> * UTF-8 decode generally treats these as ill-formed, as above.<br/><br/> However, the STOP_AT_PARTIAL CHECK option will cause decode to stop,<br/> without error (so without invoking the fallback).<br/><br/>Non-Character Values -- inconsistent and arguable !!<br/><br/> As noted above, one can argue for two approaches here, depending on<br/> whether the data being en/decoded is internal or external.<br/><br/> For internal data, non-characters are valid and should be preserved.<br/><br/> For external data, non-characters should not be sent or received. One<br/> can debate whether they should be dropped or replaced or escaped.<br/><br/> UTF-8 encode/decode recognise only U+FFFF as a non-character, and<br/> treat it as an error.<br/><br/> *** This looks like a bug. If non-character values are to be treated<br/> as errors, I suggest all non-character values should be so<br/> treated.<br/><br/> *** This caters only for external data exchange.<br/><br/> The error handling is as for ill-formed sequences, see above.<br/><br/>5. Conclusion &#39;UTF-8&#39; is broken<br/>===============================<br/><br/> * the non-character handling is incomplete.<br/><br/> * it can be argued that there should be an option to accept/allow non-<br/> character values.<br/><br/> * the various fallback options are all less than satisfactory in their<br/> own way.<br/><br/> One can see why the ref:Sub CHECK argument was invented.<br/><br/> HOWEVER: it would be handy if there was a second parameter passed to<br/> the CHECK subroutine, telling it *why* the given sequence cannot<br/> be encoded/decoded, in particular:<br/><br/> -- out of range character value<br/><br/> -- ill-formed sequence (and could pass in everything up to the next<br/> not invalid byte ?)<br/><br/> -- non-character<br/><br/> -- incomplete sequence<br/><br/> for otherwise the subroutine has to do all the work to figure this<br/> out for itself !<br/><br/>------------------------------------------------------------------------<br/><br/>It is clear that what data is valid, and how to deal with invalid data,<br/>is really up to the application. Trying to be helpful in Encode/Decode<br/>is apparently tricky.<br/><br/>It is also clear that a lot of heavy duty character/byte bashing would<br/>be better if it could be provided in XS land.<br/><br/>However, thinking about some simple but general mechanism for this is<br/>making my head hurt.<br/><br/>[I&#39;m going to go away now, and lie down.]<br/>-- <br/>Chris Hall highwayman.com<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3148.html Wed, 12 Mar 2008 17:39:51 +0000 Re: utf8::valid and \x14_000 - \x1F_0000 by Juerd Waalboer Chris Hall skribis 2008-03-12 20:49 (+0000):<br/>&gt; a. are you saying that characters in Perl are Unicode ?<br/><br/>Yes. They are called Unicode, at least. This has my preference for<br/>explanation and documentation.<br/><br/>&gt; b. or are you agreeing that characters in Perl take values<br/>&gt; 0..0x7FFF_FFFF (or beyond), which are generally interpreted as<br/>&gt; UCS, where required and possible ?<br/><br/>This too. This is the more technically accurate explanation, and has my<br/>preference for implementation.<br/><br/>&gt; If (a) then characters with ordinals beyond 0x10_FFFF should throw <br/>&gt; warnings (at least) since they clearly are not Unicode !<br/><br/>Perl just has a somewhat broad definition of &quot;unicode&quot;, that is not<br/>the same as the official unicode character set.<br/><br/>It has its own utf8, it can have its own unicode too :)<br/>-- <br/>Met vriendelijke groet, Kind regards, Korajn salutojn,<br/><br/> Juerd Waalboer: Perl hacker &lt;#####@juerd.nl&gt; &lt;http://juerd.nl/sig&gt;<br/> Convolution: ICT solutions and consultancy &lt;sales@convolution.nl&gt;<br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3147.html Wed, 12 Mar 2008 13:58:17 +0000 Re: utf8::valid and \x14_000 - \x1F_0000 by Chris Hall On Wed, 12 Mar 2008 Juerd Waalboer wrote<br/>&gt;Chris Hall skribis 2008-03-12 13:20 (+0000):<br/>....<br/>&gt;&gt; String literals are represented by UCS code points. Which<br/>&gt;&gt; reinforces the feeling that characters in Perl are Unicode.<br/><br/>&gt;Yes!<br/><br/>OK. For the avoidance of doubt:<br/><br/> a. are you saying that characters in Perl are Unicode ?<br/><br/> b. or are you agreeing that characters in Perl take values<br/> 0..0x7FFF_FFFF (or beyond), which are generally interpreted as<br/> UCS, where required and possible ?<br/><br/>If (a) then characters with ordinals beyond 0x10_FFFF should throw <br/>warnings (at least) since they clearly are not Unicode !<br/><br/>....[in the context of U+D800..U+DFFF]<br/>&gt;&gt; &quot;Isolated surrogate code units have no interpretation on<br/>&gt;&gt; their own.&quot;<br/>&gt;&gt; (...)<br/>&gt;&gt; Clearly these are illegal in UTF-8.<br/><br/>&gt;They have no interpretation, but this also doesn&#39;t say it&#39;s illegal.<br/><br/>The Unicode Standard defines the set of &#39;Unicode scalar values&#39; which <br/>consists of U+0000..U+D7FF and U+E000..U+10_FFFF. All Unicode <br/>encodings, including UTF-8, encode only the &#39;Unicode scalar values&#39;.<br/><br/>The code points U+D800..U+DFFF exist, but do &quot;not contain any character <br/>assignments&quot;. Given that no Unicode encoding exists that allows these <br/>code points, it&#39;s unclear how one would ever end up with one of these <br/>things on its hands !<br/><br/>....[in the context of U+FFFE, U+FFFF etc.]<br/>&gt;&gt; &quot;Applications are free to use any of these noncharacter code<br/>&gt;&gt; points internally but should never attempt to exchange<br/>&gt;&gt; them.<br/><br/>&gt;I think it&#39;s not Perl&#39;s job to prevent exchange. Simply because the<br/>&gt;exchange could be internal, but between processes of the same program.<br/><br/>Well UTF-8 is jumping all over U+FFFF (at least). The warnings thrown <br/>by chr() and &quot;\x{h...h} suggest that Perl feels that exchanging these <br/>values ain&#39;t kosher.<br/><br/>&gt;&gt; I&#39;m puzzled as to why &#39;UTF-8&#39; (strict) doesn&#39;t treat U+FFFE (and<br/>&gt;&gt; friends) in the same way as U+FFFF (and friends).<br/><br/>&gt;My gut says it&#39;s out of ignorance of the &quot;rules&quot;, and certainly not an<br/>&gt;intentional deviation.<br/><br/>Well... I&#39;m running some more tests on UTF-8 to see what it thinks is <br/>illegal.<br/><br/>.....................................<br/>&gt;&gt; &gt;The result is Unicode.<br/>&gt;&gt; IMHO the result of chr(n) should just be a character.<br/><br/>&gt;We call that a unicode character in Perl. It is true that Perl allows<br/>&gt;ordinal values outside the currently existing range, but it is still<br/>&gt;called unicode by Perl&#39;s documentation.<br/><br/>OK. This is the hair which I am splitting.<br/><br/>IMHO the things in strings and the things that chr() and ord() return or <br/>process should be plain characters (ordinal U_INT) -- so that these are <br/>general purpose. Only when it&#39;s necessary to attach meaning to the <br/>characters in a string, should Perl treat them as Unicode code points -- <br/>I accept that this is most of the time (but not *all* the time).<br/><br/>&gt;&gt; FWIW I note that printf &quot;%vX&quot; is suggested as a means to render IPv6<br/>&gt;&gt; addresses. This implies the use of a string containing eight characters<br/>&gt;&gt; 0..0xFFFF as the packed form of IPv6. Building one of those using<br/>&gt;&gt; chr(n) will generate spurious warnings about 0xFFFE and 0xFFFF !<br/><br/>&gt;Interesting point.<br/><br/>What&#39;s more, the Unicode standard suggests various *internal* uses for <br/>U+FFFE and U+FFFF (and friends), including, but not limited to, <br/>terminators and separators. This will also generate spurious warnings <br/>from chr() or &quot;\x{...}&quot; !<br/><br/>Chris<br/>-- <br/>Chris Hall highwayman.com<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3146.html Wed, 12 Mar 2008 13:50:45 +0000 Re: utf8::valid and \x14_000 - \x1F_0000 by Juerd Waalboer Chris Hall skribis 2008-03-12 13:20 (+0000):<br/>&gt; &gt;&gt; OK. In the meantime IMHO chr(n) should be handling utf8 and has no<br/>&gt; &gt;&gt; business worrying about things which UTF-8 or UCS think aren&#39;t<br/>&gt; &gt;&gt; characters.<br/>&gt; &gt;It should do Unicode, not any specific byte encoding, like UTF-?8.<br/>&gt; IMHO chr(n) should do characters, which may be interpreted as per<br/>&gt; Unicode, but may not.<br/>&gt; When I said utf8 I was following the (sloppy) convention that utf8 means<br/>&gt; how Perl handles characters in strings...<br/><br/>I&#39;m working hard to break this convention. I&#39;ve changed a lot of Perl<br/>documentation, and the result was released with Perl 5.10.<br/><br/>If in any place in Perl&#39;s official documentation, it still reads UTF-8<br/>or UTF8 for *characters in text strings*, it&#39;s wrong. Let me know and I<br/>will fix it :)<br/><br/>&gt; b. in a Perl string, characters are held in a UTF-8 like form.<br/><br/>I&#39;d say *inside* a Perl string. This is the C implementation, but a Perl<br/>programmer should not have to know the specific *internal* encoding of a<br/>Perl string.<br/><br/>Likewise, in Perl you don&#39;t have to know whether your number is<br/>internally encoded as a long integer or a double.<br/><br/>&gt; Where UTF-8 (upper case, with hyphen) means the RFC 3629 &amp;<br/>&gt; Unicode Consortium defined byte-wise encoding.<br/><br/>That&#39;s the theory, but it&#39;s so often not entirely following spec.<br/><br/>&gt; This form is referred to as utf8 (lower case, no hyphen).<br/><br/>Yes, but note that encoding names in Perl are case insensitive. I tend<br/>to call it UTF8 sometimes.<br/><br/>&gt; There is really no need to discuss this, except in the context of<br/>&gt; messing around in guts of Perl.<br/><br/>Exactly.<br/><br/>&gt; String literals are represented by UCS code points. Which<br/>&gt; reinforces the feeling that characters in Perl are Unicode.<br/><br/>Yes!<br/><br/>&gt; &#39;C&#39; uses &#39;wide&#39; to refer to characters that may have values<br/>&gt; &gt; 255. IMHO it&#39;s a shame that Perl did not follow this.<br/><br/>It does in some places, most notably warnings about &quot;wide characters&quot;.<br/><br/>&gt; d. when exchanging character data with other systems one needs to<br/>&gt; deal with character set and encoding issues.<br/><br/>Not just other systems. All I/O is done in bytes, even with yourself,<br/>for example if you forked.<br/><br/>&gt; &quot;Isolated surrogate code units have no interpretation on<br/>&gt; their own.&quot;<br/>&gt; (...)<br/>&gt; Clearly these are illegal in UTF-8.<br/><br/>They have no interpretation, but this also doesn&#39;t say it&#39;s illegal.<br/><br/>Compare it with the undefined behavior of multiple ++ in a single<br/>expression. There&#39;s no specification of what should happen, but it&#39;s not<br/>illegal to do it.<br/><br/>&gt; &quot;Applications are free to use any of these noncharacter code<br/>&gt; points internally but should never attempt to exchange<br/>&gt; them.<br/><br/>I think it&#39;s not Perl&#39;s job to prevent exchange. Simply because the<br/>exchange could be internal, but between processes of the same program.<br/><br/>&gt; I&#39;m puzzled as to why &#39;UTF-8&#39; (strict) doesn&#39;t treat U+FFFE (and<br/>&gt; friends) in the same way as U+FFFF (and friends).<br/><br/>My gut says it&#39;s out of ignorance of the &quot;rules&quot;, and certainly not an<br/>intentional deviation.<br/><br/>&gt; &gt;The result is Unicode.<br/>&gt; IMHO the result of chr(n) should just be a character.<br/><br/>We call that a unicode character in Perl. It is true that Perl allows<br/>ordinal values outside the currently existing range, but it is still<br/>called unicode by Perl&#39;s documentation.<br/><br/>&gt; OK, sure. I was using utf8 to mean any character value you like, and<br/>&gt; UTF-8 to imply a value which is recognised in UCS -- rather than the<br/>&gt; encoding.<br/><br/>Please use utf8 only for naming the byte encoding that allows any<br/>character value you like, not for the ordinal values themselves.<br/><br/>&gt; FWIW I note that printf &quot;%vX&quot; is suggested as a means to render IPv6<br/>&gt; addresses. This implies the use of a string containing eight characters<br/>&gt; 0..0xFFFF as the packed form of IPv6. Building one of those using<br/>&gt; chr(n) will generate spurious warnings about 0xFFFE and 0xFFFF !<br/><br/>Interesting point.<br/>-- <br/>Met vriendelijke groet, Kind regards, Korajn salutojn,<br/><br/> Juerd Waalboer: Perl hacker &lt;#####@juerd.nl&gt; &lt;http://juerd.nl/sig&gt;<br/> Convolution: ICT solutions and consultancy &lt;sales@convolution.nl&gt;<br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3145.html Wed, 12 Mar 2008 09:53:38 +0000 Re: utf8::valid and \x14_000 - \x1F_0000 by Chris Hall On Tue, 11 Mar 2008 Juerd Waalboer wrote<br/>&gt;Chris Hall skribis 2008-03-11 21:09 (+0000):<br/>&gt;&gt; OK. In the meantime IMHO chr(n) should be handling utf8 and has no<br/>&gt;&gt; business worrying about things which UTF-8 or UCS think aren&#39;t<br/>&gt;&gt; characters.<br/><br/>&gt;It should do Unicode, not any specific byte encoding, like UTF-?8.<br/><br/>IMHO chr(n) should do characters, which may be interpreted as per<br/>Unicode, but may not.<br/><br/>When I said utf8 I was following the (sloppy) convention that utf8 means<br/>how Perl handles characters in strings...<br/><br/>...the naming is a cause of confusion. For the avoidance of doubt, this<br/>is what I understand the position to be:<br/><br/> a. characters in Perl have integer values in 0..0x7FFF_FFFF (or more).<br/><br/> It appears that what is actually going on is that the limit is<br/> the local unsigned Perl integer. One can debate the marginal<br/> utility of that vs the scope for confusion.<br/><br/> b. in a Perl string, characters are held in a UTF-8 like form.<br/><br/> Where UTF-8 (upper case, with hyphen) means the RFC 3629 &amp;<br/> Unicode Consortium defined byte-wise encoding.<br/><br/> Current UTF-8 defines encoding for values 0..0xD7FF and<br/> 0xE000..0x10_FFFF, which is exactly the current UCS range (less<br/> the &#39;surrogates&#39;).<br/><br/> Note that this limits UTF-8 to 4 byte sequences, explicitly<br/> excluding:<br/><br/> * sequences that have shorter equivalents (&#39;redundant&#39;)<br/> * 0xD800..0xDFFF -- the &#39;surrogates&#39;<br/> * x11_0000..0x1F_FFFF -- beyond UCS range<br/><br/> Older versions of the standard allowed for values 0..0x7FFF_FFFF,<br/> but also excluded the &#39;redundant&#39; sequences and (I believe) the<br/> &#39;surrogates&#39;.<br/><br/> The encoding used by Perl stretches the range to 2^72-1. This<br/> is incompatible with even the older versions of UTF-8.<br/><br/> This form is referred to as utf8 (lower case, no hyphen).<br/><br/> There is really no need to discuss this, except in the context of<br/> messing around in guts of Perl.<br/><br/> c. when Perl wishes to assign some meaning to a character value<br/> it interprets it as a Unicode Code Point, if it can.<br/><br/> There are huge areas of the Unicode space that have no current<br/> meaning. There are areas which may have local meaning (&quot;Private<br/> Use&quot;). In addition Perl allows character values that are beyond<br/> current Unicode space.<br/><br/> In the abstract, characters in Perl are not Unicode (UCS). But<br/> most of the time one treats them as if they were.<br/><br/> String literals are represented by UCS code points. Which<br/> reinforces the feeling that characters in Perl are Unicode.<br/><br/> &#39;C&#39; uses &#39;wide&#39; to refer to characters that may have values<br/> &gt; 255. IMHO it&#39;s a shame that Perl did not follow this.<br/><br/> d. when exchanging character data with other systems one needs to<br/> deal with character set and encoding issues.<br/><br/> The &#39;UTF-8&#39; encoding (character set) covers the UCS character set<br/> (values 0..0x10_FFFF, currently) and the (current) standard UTF-8<br/> encoding. &#39;UTF-8&#39; also worries about some &#39;suspect&#39; (my term) UCS<br/> values, see below.<br/><br/> The &#39;utf8&#39; encoding (character set) is a super set of current UTF-8<br/> (values 0..0x7FFF_FFFF) -- corresponding to earlier UTF-8. &#39;utf8&#39;<br/> does not concern itself about any &#39;suspect&#39; UCS values.<br/><br/> [Actually, that&#39;s not entirely true. &#39;utf8&#39; encode happily deals<br/> with characters all the way up to 2^64-1 (and perhaps, beyond),<br/> using Perl&#39;s extended encoding. However, &#39;utf8&#39; decode treats<br/> anything &gt; 0x7FFF_FFFF as invalid.]<br/><br/> e. The &#39;suspect&#39; UCS values.<br/><br/> These are:<br/><br/> * U+D800..U+DBFF and U+DC00..U+DFFF (High- and Low-surrogate,<br/> respectively). Where these are used they should appear in<br/> pairs, High followed by Low.<br/><br/> Unicode 5.0.0 says:<br/><br/> &quot;Surrogate pairs are used only in UTF-16.&quot;<br/><br/> &quot;Isolated surrogate code units have no interpretation on<br/> their own.&quot;<br/><br/> &quot;Surrogate code points cannot be conformantly interchanged<br/> using Unicode encoding forms.&quot;<br/><br/> &quot;Unicode scalar value: Any Unicode code point except high-<br/> surrogate and low-surrogate code points.&quot;<br/><br/> All the Unicode encodings are defined in terms of Unicode<br/> scalar value. There is by definition no way to exchange<br/> these characters, and no meaning is attached to them.<br/><br/> Clearly these are illegal in UTF-8.<br/><br/> * U+FFFE and U+FFFF and the last two code points in every<br/> other Unicode plane are noncharacters.<br/><br/> [Unicode code space is divided into 17 &#39;planes&#39; of 65,536<br/> characters, each. So characters U+01_FFFE, U+01_FFFF,<br/> U+02_FFFE, U+02_FFFF, ... U+10_FFFE and U+10_FFFF are all<br/> noncharacters.]<br/><br/> The range U+FDD0..U+FDEF are also noncharacters.<br/><br/> Unicode 5.0.0 says:<br/><br/> &quot;Applications are free to use any of these noncharacter code<br/> points internally but should never attempt to exchange<br/> them. If a noncharacter is received in open interchange, an<br/> application is not required to interpret it in any way. It<br/> is good practice, however, to recognize it as a<br/> noncharacter and to take appropriate action, such as<br/> removing it from the text.&quot;<br/><br/> &quot;Noncharacter code points are reserved for internal use,<br/> such as for sentinel values. They should never be<br/> interchanged. They do, however, have well-formed<br/> representations in Unicode encoding forms and survive<br/> conversions between encoding forms. This allows sentinel<br/> values to be preserved internally across Unicode encoding<br/> forms, even though they are not designed to be used in open<br/> interchange.&quot;<br/><br/> So, assuming UTF-8 is used for &quot;open interchange&quot;, these are<br/> also invalid.<br/><br/> * U+FFFD -- the Replacement Character<br/><br/> Unicode 5.0.0 says:<br/><br/> &quot;U+FFFD replacement character is the general substitute<br/> character in the Unicode Standard. It can be substituted<br/> for any &#39;unknown&#39; character in another encoding that cannot<br/> be mapped in terms of known Unicode characters.&quot;<br/><br/> This is generally legal.<br/><br/> However on the topic of &quot;Reserved and Private-Use Character<br/> Codes&quot; the standard also counsels:<br/><br/> &quot;An implementation should not blindly delete such<br/> characters, nor should it unintentionally transform them<br/> into something else.&quot;<br/><br/>Any corrections required would be appreciated, and may also inform any<br/>&quot;lurkers&quot;.<br/><br/>&gt;Internally, a byte encoding is needed. As a programmer I don&#39;t want to<br/>&gt;be bothered with such implementation details.<br/><br/>&gt;&gt; Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode<br/>&gt;&gt; (UTF-8) are happy with. Unicode defines 0xFFFE and 0xFFFF as<br/>&gt;&gt; non-characters, not just 0xFFFF (which Encode::en/decode do deem<br/>&gt;&gt; invalid).<br/><br/>&gt;Personally, I think Perl should accept these characters without warning,<br/>&gt;except the strict UTF-8 encoding is requested (which differs from the<br/>&gt;non-strict UTF8 encoding).<br/><br/>I agree -- chr(n) and &#39;utf8&#39; (lax) should happily process anything<br/>0..0x7FFF_FFFF as characters -- which may or may not be UCS.<br/><br/>I&#39;m puzzled as to why &#39;UTF-8&#39; (strict) doesn&#39;t treat U+FFFE (and<br/>friends) in the same way as U+FFFF (and friends).<br/><br/>&gt;&gt; &gt;&gt;In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it&#39;s<br/>&gt;&gt; &gt;&gt;neither.<br/>&gt;&gt; &gt;It&#39;s supposed to be neither on the outside. Internally, it&#39;s utf8.<br/>&gt;&gt; One can turn off the warnings and then chr(n) will happily take any +ve<br/>&gt;&gt; integer and give you the equivalent character -- so the result is utf8,<br/><br/>&gt;The result is Unicode.<br/><br/>IMHO the result of chr(n) should just be a character.<br/><br/>&gt; The difference between Unicode and UTF8 is not<br/>&gt;always clear, but in this case is: the character is Unicode, a single<br/>&gt;codepoint, the internal implementation is UTF8.<br/>&gt;<br/>&gt;Unicode: U+20AC (one character: &euro;)<br/>&gt;UTF-8: E2 82 AC (three bytes)<br/>&gt;<br/>&gt;I am under the impression that you know the difference and made an<br/>&gt;honest mistake. My detailed expansion is also for lurkers and archives.<br/><br/>OK, sure. I was using utf8 to mean any character value you like, and<br/>UTF-8 to imply a value which is recognised in UCS -- rather than the<br/>encoding.<br/><br/>&gt;&gt; [replacement character]<br/>&gt;&gt; So we&#39;ll have to differ on this :-)<br/><br/>&gt;Yes, although my opinion on this is not strong. undef or replacement<br/>&gt;character - both are good options. One argument in favor of the<br/>&gt;replacement character would be backwards compatibility.<br/><br/>Well, having concluded that the result of chr(n) should be just a<br/>character -- to be interpreted one way or another, later -- returning<br/>&quot;\xFFFD&quot; for chr(-1) looks perverse !<br/><br/>FWIW I note that printf &quot;%vX&quot; is suggested as a means to render IPv6<br/>addresses. This implies the use of a string containing eight characters<br/>0..0xFFFF as the packed form of IPv6. Building one of those using<br/>chr(n) will generate spurious warnings about 0xFFFE and 0xFFFF !<br/><br/>Chris<br/>-- <br/>Chris Hall highwayman.com +44 7970 277 383<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3144.html Wed, 12 Mar 2008 06:22:40 +0000 Re: utf8::valid and \x14_000 - \x1F_0000 by Juerd Waalboer Chris Hall skribis 2008-03-11 21:09 (+0000):<br/>&gt; OK. In the meantime IMHO chr(n) should be handling utf8 and has no <br/>&gt; business worrying about things which UTF-8 or UCS think aren&#39;t <br/>&gt; characters.<br/><br/>It should do Unicode, not any specific byte encoding, like UTF-?8.<br/><br/>Internally, a byte encoding is needed. As a programmer I don&#39;t want to<br/>be bothered with such implementation details.<br/><br/>&gt; Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode<br/>&gt; (UTF-8) are happy with. Unicode defines 0xFFFE and 0xFFFF as <br/>&gt; non-characters, not just 0xFFFF (which Encode::en/decode do deem <br/>&gt; invalid).<br/><br/>Personally, I think Perl should accept these characters without warning,<br/>except the strict UTF-8 encoding is requested (which differs from the<br/>non-strict UTF8 encoding).<br/><br/>&gt; &gt;&gt;In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it&#39;s<br/>&gt; &gt;&gt;neither.<br/>&gt; &gt;It&#39;s supposed to be neither on the outside. Internally, it&#39;s utf8.<br/>&gt; One can turn off the warnings and then chr(n) will happily take any +ve <br/>&gt; integer and give you the equivalent character -- so the result is utf8, <br/><br/>The result is Unicode. The difference between Unicode and UTF8 is not<br/>always clear, but in this case is: the character is Unicode, a single<br/>codepoint, the internal implementation is UTF8.<br/><br/>Unicode: U+20AC (one character: &euro;)<br/>UTF-8: E2 82 AC (three bytes)<br/><br/>I am under the impression that you know the difference and made an<br/>honest mistake. My detailed expansion is also for lurkers and archives.<br/><br/>&gt; [replacement character]<br/>&gt; So we&#39;ll have to differ on this :-)<br/><br/>Yes, although my opinion on this is not strong. undef or replacement<br/>character - both are good options. One argument in favor of the<br/>replacement character would be backwards compatibility.<br/>-- <br/>Met vriendelijke groet, Kind regards, Korajn salutojn,<br/><br/> Juerd Waalboer: Perl hacker &lt;#####@juerd.nl&gt; &lt;http://juerd.nl/sig&gt;<br/> Convolution: ICT solutions and consultancy &lt;sales@convolution.nl&gt;<br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3143.html Tue, 11 Mar 2008 14:26:14 +0000 Re: utf8::valid and \x14_000 - \x1F_0000 by Chris Hall On Tue, 11 Mar 2008 you wrote<br/>&gt;Chris Hall skribis 2008-03-11 18:48 (+0000):<br/>&gt;&gt; I&#39;m comfortable with the notion that perl characters are unsigned<br/>&gt;&gt; integers that overlap UCS, and happen to be held internally as a<br/>&gt;&gt; superset of UTF-8.<br/>&gt;&gt; I wonder if perl is completely comfortable.<br/><br/>&gt;It isn&#39;t. There are some very unfortunate &quot;features&quot;.<br/><br/>&gt;&gt; chr(n) throws various runtime warnings where &#39;n&#39; isn&#39;t kosher UCS, and<br/>&gt;&gt; &quot;\x{h...h}&quot; throws the same ones at compile time.<br/>&gt;&gt; (...)I&#39;m not sure I see the point of picking on a few values to warn<br/>&gt;&gt; about.<br/><br/>&gt;I don&#39;t see the point, but Perl&#39;s warnings are arbitrary in several<br/>&gt;ways. Abigail has a lightning talk about the &quot;interpreted as function&quot;<br/>&gt;warning, that illustrates this.<br/><br/>OK. In the meantime IMHO chr(n) should be handling utf8 and has no <br/>business worrying about things which UTF-8 or UCS think aren&#39;t <br/>characters.<br/><br/>Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode<br/>(UTF-8) are happy with. Unicode defines 0xFFFE and 0xFFFF as <br/>non-characters, not just 0xFFFF (which Encode::en/decode do deem <br/>invalid).<br/><br/>&gt;&gt; In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it&#39;s<br/>&gt;&gt; neither.<br/><br/>&gt;It&#39;s supposed to be neither on the outside. Internally, it&#39;s utf8.<br/><br/>One can turn off the warnings and then chr(n) will happily take any +ve <br/>integer and give you the equivalent character -- so the result is utf8, <br/>but the warnings are some (very) small subset of checking for UTF-8 :-(<br/><br/>I wonder what happens for n &gt;= 2^64. The encoding runs out at 2^72 !<br/><br/>&gt;&gt; If chr(-1) doesn&#39;t exist, then undef looks like a reasonable<br/>&gt;&gt; return value -- returning &quot;\x{FFFD}&quot; makes chr(-1)<br/>&gt;&gt; indistinguishable from chr(0xFFFD) -- where the first is<br/>&gt;&gt; nonsense and the second is entirely proper.<br/><br/>&gt;0xFFFD is the Unicode equivalent of undef. I think it makse sense in<br/>&gt;this case.<br/><br/>Well...<br/><br/>Unicode says: &quot;REPLACEMENT CHARACTER: used to represent an incoming <br/>character whose value is unknown or unrepresentable in Unicode&quot;.<br/><br/>...so it has plenty to do without being used to represent a value which <br/>is completely beyond the range for characters, and for which perl has a <br/>perfectly good convention already.<br/><br/>...besides, if I want to see if chr(n) has worked I have to check that <br/>(a) the result is not &quot;\xFFFD&quot; and (b) that n is not 0xFFFD.<br/><br/>So we&#39;ll have to differ on this :-)<br/><br/>Chris<br/>-- <br/>Chris Hall highwayman.com +44 7970 277 383<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3142.html Tue, 11 Mar 2008 14:11:05 +0000 Re: utf8::valid and \x14_000 - \x1F_0000 by Juerd Waalboer Chris Hall skribis 2008-03-11 18:48 (+0000):<br/>&gt; I&#39;m comfortable with the notion that perl characters are unsigned<br/>&gt; integers that overlap UCS, and happen to be held internally as a<br/>&gt; superset of UTF-8.<br/>&gt; I wonder if perl is completely comfortable.<br/><br/>It isn&#39;t. There are some very unfortunate &quot;features&quot;.<br/><br/>&gt; chr(n) throws various runtime warnings where &#39;n&#39; isn&#39;t kosher UCS, and<br/>&gt; &quot;\x{h...h}&quot; throws the same ones at compile time.<br/>&gt; (...)I&#39;m not sure I see the point of picking on a few values to warn<br/>&gt; about.<br/><br/>I don&#39;t see the point, but Perl&#39;s warnings are arbitrary in several<br/>ways. Abigail has a lightning talk about the &quot;interpreted as function&quot;<br/>warning, that illustrates this.<br/><br/>&gt; In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it&#39;s<br/>&gt; neither.<br/><br/>It&#39;s supposed to be neither on the outside. Internally, it&#39;s utf8.<br/><br/>&gt; If chr(-1) doesn&#39;t exist, then undef looks like a reasonable<br/>&gt; return value -- returning &quot;\x{FFFD}&quot; makes chr(-1)<br/>&gt; indistinguishable from chr(0xFFFD) -- where the first is<br/>&gt; nonsense and the second is entirely proper.<br/><br/>0xFFFD is the Unicode equivalent of undef. I think it makse sense in<br/>this case.<br/><br/>&gt; &gt;Could you please report this bug with perlbug?<br/>&gt; Done.<br/><br/>Cheers.<br/>-- <br/>Met vriendelijke groet, Kind regards, Korajn salutojn,<br/><br/> Juerd Waalboer: Perl hacker &lt;#####@juerd.nl&gt; &lt;http://juerd.nl/sig&gt;<br/> Convolution: ICT solutions and consultancy &lt;sales@convolution.nl&gt;<br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3141.html Tue, 11 Mar 2008 12:28:19 +0000 Decode, byte codes ASCII & ISO-8859 and HTMLCREF or XMLCREF by Chris Hall <br/>Having tried:<br/><br/> $o = Encode::decode(&#39;ascii&#39;, &quot;abc \x80 \xFF&quot;, FB_HTMLCREF)<br/> $o = Encode::decode(&#39;ISO-8859-7&#39;, &quot;abc \xFF&quot;, FB_HTMLCREF)<br/><br/> $o = Encode::decode(&#39;ascii&#39;, &quot;abc \x80 \xFF&quot;, FB_XMLCREF)<br/> $o = Encode::decode(&#39;ISO-8859-7&#39;, &quot;abc \xFF&quot;, FB_XMLCREF)<br/><br/>(0xFF is not a valid character value in ISO-8859-7.)<br/><br/>I find that they produce neither &amp;128; and &amp;255; nor &amp;#x80; and &amp;#xFF, <br/>but exactly the same as:<br/><br/> $o = Encode::decode(&#39;ascii&#39;, &quot;abc \x80 \xFF&quot;, FB_PERLQQ)<br/> $o = Encode::decode(&#39;ISO-8859-7&#39;, &quot;abc \xFF&quot;, FB_PERLQQ)<br/><br/>namely \x80 and \xFF !<br/><br/>Almost as if there were a bug. Or have I missed the right piece of the <br/>documentation ?<br/><br/>-- <br/>Chris Hall highwayman.com<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3140.html Tue, 11 Mar 2008 12:09:34 +0000 Re: utf8::valid and \x14_000 - \x1F_0000 by Chris Hall On Tue, 11 Mar 2008 you wrote<br/>&gt;Chris Hall skribis 2008-03-11 13:30 (+0000):<br/>&gt;&gt; I suggest utf8::valid() is broken.<br/>&gt;&gt; my $s = chr($c) ;<br/>&gt;&gt; my $v = utf8::valid($s) ? 1 : 0 ;<br/><br/>&gt;Agreed. utf8::valid(chr $foo) should ALWAYS return true. (Please note<br/>&gt;that utf8::valid tests the internal consistency of a string - on the<br/>&gt;outside, it has little to do with UTF8.)<br/><br/>I&#39;m comfortable with the notion that perl characters are unsigned<br/>integers that overlap UCS, and happen to be held internally as a<br/>superset of UTF-8.<br/><br/>I wonder if perl is completely comfortable.<br/><br/>chr(n) throws various runtime warnings where &#39;n&#39; isn&#39;t kosher UCS, and<br/>&quot;\x{h...h}&quot; throws the same ones at compile time.<br/><br/>Now there&#39;s HUGE areas of UCS code space that are essentially<br/>meaningless. There are VAST areas of perl character space that are way<br/>beyond UCS. I&#39;m not sure I see the point of picking on a few values to<br/>warn about.<br/><br/>In any case, is chr(n) supposed to be utf8 or UTF-8 ? AFAIKS, it&#39;s<br/>neither.<br/><br/>I have tried the following on 5.10.0 and 5.8.8, and where these differ I<br/>have noted it:<br/><br/> chr(-1)<br/><br/> 5.10.0: No warning, returned &quot;\x{FFFD}&quot;<br/> 5.8.8: Warning &#39;Unicode character 0xffffffffffffffff is illegal&#39;,<br/> returned &quot;\x{FFFF_FFFF_FFFF_FFFF}&quot;<br/><br/> Neither of these seem very sensible.<br/><br/> If chr(-1) doesn&#39;t exist, then undef looks like a reasonable<br/> return value -- returning &quot;\x{FFFD}&quot; makes chr(-1)<br/> indistinguishable from chr(0xFFFD) -- where the first is<br/> nonsense and the second is entirely proper.<br/><br/> chr(0xD800) Warns &#39;UTF-16 surrogate 0xd800&#39;, returns &quot;\x{D800}&quot;<br/><br/> chr(0xFFFD) No warning, returns &quot;\x{FFFD}&quot;<br/><br/> chr(0xFFFE) Warns &#39;Unicode character 0xfffe is illegal&#39;,<br/> returns &quot;\x{FFFE}&quot;,<br/><br/> NB: both Encode::encode(&#39;UTF-8&#39;, &quot;\x{FFFE}&quot;)<br/> and Encode::decode{&#39;UTF-8&#39;, &quot;\xEF\xBF\xBE&quot;)<br/><br/> are perfectly happy ! This appears inconsistent ?<br/><br/> All the UCS planes appear to be treated like this.<br/><br/> chr(0xFFFF) Warns &#39;Unicode character 0xffff is illegal&#39;,<br/> returns &quot;\x{FFFF}&quot;,<br/><br/> NB: both Encode::encode(&#39;UTF-8&#39;, &quot;\x{FFFF}&quot;)<br/> and Encode::decode{&#39;UTF-8&#39;, &quot;\xEF\xBF\xBF&quot;)<br/><br/> consider this to be illegal, and replace it by &quot;\x{FFFD}&quot;<br/><br/> All the UCS planes appear to be treated like this.<br/><br/> chr(0x11_0000) No warning, returns &quot;\x11_0000&quot;<br/><br/> This is now outside the UCS range, so I suppose we don&#39;t care<br/> that this is no more useful than chr(0xFFFE) ?<br/><br/> Modern (RFC 3629 &amp; Unicode Consortium) UTF-8 is defined to<br/> exclude sequences that exceed the (current) UCS maximum of<br/> U+10_FFFF.<br/><br/> chr(0x14_0000) No warning, returns &quot;\x14_0000&quot;<br/><br/> Modern UTF-8 (RFC 3629 &amp; Unicode Consortium) is defined to<br/> exclude any sequence containing any byte 0xC0, 0xC1,<br/> and 0xF5-0xFF. This is the first character that contains a<br/> byte 0xF5-0xFF !<br/><br/> chr(0xzzzz_FFFE) Warns &#39;Unicode character 0xzzzzfffe is illegal&#39;,<br/> returns &quot;\x{zzzz_FFFE}&quot;<br/> chr(0xzzzz_FFFF) Warns &#39;Unicode character 0xzzzzffff is illegal&#39;,<br/> returns &quot;\x{zzzz_FFFF}&quot;<br/><br/> For all values of zzzz from 0x0011 onwards.<br/><br/> Now, it&#39;s known that 0xFFFE and 0xFFFF are non-characters in all<br/> UCS planes... but we&#39;re beyond UCS here ?<br/><br/> [I confess this baffled me at first, because 0x7FFF_FFFF<br/> generates a warning, but 0x8000_0000 doesn&#39;t.... But that&#39;s<br/> another story.]<br/><br/> chr(0x0020_0000) No warning, returns &quot;\x{0020_0000}&quot;<br/><br/> This is the first character with an encoding &gt; 4 bytes.<br/><br/> Modern UTF-8 (RFC 3629 &amp; Unicode Consortium) stops at 4 bytes.<br/><br/> chr(0x8000_0000) No warning, returns &quot;\x{8000_0000}&quot;<br/><br/> This is the first character with an encoding &gt; 6 bytes.<br/><br/> Actually, not even &#39;old-style&#39; UTF-8 supported anything longer<br/> than the 6 byte form. (Because bytes 0xFE and 0xFF were defined<br/> not to appear in a UTF-8 sequence -- to guarantee no confusion<br/> with UTF-16.<br/><br/>Compile time warnings for &quot;\x{h...h}&quot; appear to complain or not complain<br/>about the same things.<br/><br/>&gt;Could you please report this bug with perlbug?<br/><br/>Done.<br/><br/>Chris<br/>-- <br/>Chris Hall highwayman.com +44 7970 277 383<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3139.html Tue, 11 Mar 2008 11:49:31 +0000 Re: utf8::valid and \x14_000 - \x1F_0000 by Juerd Waalboer Chris Hall skribis 2008-03-11 13:30 (+0000):<br/>&gt; I suggest utf8::valid() is broken.<br/>&gt; my $s = chr($c) ;<br/>&gt; my $v = utf8::valid($s) ? 1 : 0 ;<br/><br/>Agreed. utf8::valid(chr $foo) should ALWAYS return true. (Please note<br/>that utf8::valid tests the internal consistency of a string - on the<br/>outside, it has little to do with UTF8.)<br/><br/>Could you please report this bug with perlbug?<br/>-- <br/>Met vriendelijke groet, Kind regards, Korajn salutojn,<br/><br/> Juerd Waalboer: Perl hacker &lt;#####@juerd.nl&gt; &lt;http://juerd.nl/sig&gt;<br/> Convolution: ICT solutions and consultancy &lt;sales@convolution.nl&gt;<br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3138.html Tue, 11 Mar 2008 06:52:41 +0000 utf8::valid and \x14_000 - \x1F_0000 by Chris Hall <br/>It appears that utf8::valid() disagrees with Encode::encode(&#39;utf8&#39;, ...)<br/>do not agree for characters 0x14_0000 - 0x1F_0000.<br/><br/>I suggest utf8::valid() is broken.<br/><br/>The following:<br/><br/> use strict ;<br/><br/> use Encode qw(FB_QUIET LEAVE_SRC) ;<br/><br/> printf &quot;Perl v%vd &amp; Encode %s\n&quot;, $^V, $Encode::VERSION ;<br/><br/> my $c = 0xFFFF ;<br/> while ($c &lt; 0x8000_0000) {<br/> my $s = chr($c) ;<br/><br/> my $v = utf8::valid($s) ? 1 : 0 ;<br/> my $o = Encode::encode(&#39;utf8&#39;, $s, FB_QUIET() | LEAVE_SRC()) ;<br/><br/> my $r = $o ? 1 : 0 ;<br/><br/> if ($v != $r) {<br/> printf &quot;0x%04X_%04X: utf8::valid=%d but Encode::encode=%d &quot;,<br/> ($c &gt;&gt; 16), $c &amp; 0xFFFF, $v, $r ;<br/> Encode::_utf8_off($s) ;<br/> print map { sprintf &#39;\x%02X&#39;, ord($_) } split(//, $s) ;<br/> print &quot;\n&quot; ;<br/> } ;<br/><br/> if ($c &amp; 0xFFFF) { $c += 1 ; } else { $c += 0xFFFF ; } ;<br/> } ;<br/><br/>Produces:<br/><br/> Perl v5.8.8 &amp; Encode 2.23<br/> 0x0014_0000: utf8::valid=0 but Encode::encode=1 \xF5\x80\x80\x80<br/> 0x0014_FFFF: utf8::valid=0 but Encode::encode=1 \xF5\x8F\xBF\xBF<br/> 0x0015_0000: utf8::valid=0 but Encode::encode=1 \xF5\x90\x80\x80<br/> 0x0015_FFFF: utf8::valid=0 but Encode::encode=1 \xF5\x9F\xBF\xBF<br/> 0x0016_0000: utf8::valid=0 but Encode::encode=1 \xF5\xA0\x80\x80<br/> 0x0016_FFFF: utf8::valid=0 but Encode::encode=1 \xF5\xAF\xBF\xBF<br/> 0x0017_0000: utf8::valid=0 but Encode::encode=1 \xF5\xB0\x80\x80<br/> 0x0017_FFFF: utf8::valid=0 but Encode::encode=1 \xF5\xBF\xBF\xBF<br/> 0x0018_0000: utf8::valid=0 but Encode::encode=1 \xF6\x80\x80\x80<br/> 0x0018_FFFF: utf8::valid=0 but Encode::encode=1 \xF6\x8F\xBF\xBF<br/> 0x0019_0000: utf8::valid=0 but Encode::encode=1 \xF6\x90\x80\x80<br/> 0x0019_FFFF: utf8::valid=0 but Encode::encode=1 \xF6\x9F\xBF\xBF<br/> 0x001A_0000: utf8::valid=0 but Encode::encode=1 \xF6\xA0\x80\x80<br/> 0x001A_FFFF: utf8::valid=0 but Encode::encode=1 \xF6\xAF\xBF\xBF<br/> 0x001B_0000: utf8::valid=0 but Encode::encode=1 \xF6\xB0\x80\x80<br/> 0x001B_FFFF: utf8::valid=0 but Encode::encode=1 \xF6\xBF\xBF\xBF<br/> 0x001C_0000: utf8::valid=0 but Encode::encode=1 \xF7\x80\x80\x80<br/> 0x001C_FFFF: utf8::valid=0 but Encode::encode=1 \xF7\x8F\xBF\xBF<br/> 0x001D_0000: utf8::valid=0 but Encode::encode=1 \xF7\x90\x80\x80<br/> 0x001D_FFFF: utf8::valid=0 but Encode::encode=1 \xF7\x9F\xBF\xBF<br/> 0x001E_0000: utf8::valid=0 but Encode::encode=1 \xF7\xA0\x80\x80<br/> 0x001E_FFFF: utf8::valid=0 but Encode::encode=1 \xF7\xAF\xBF\xBF<br/> 0x001F_0000: utf8::valid=0 but Encode::encode=1 \xF7\xB0\x80\x80<br/> 0x001F_FFFF: utf8::valid=0 but Encode::encode=1 \xF7\xBF\xBF\xBF<br/><br/>And the same for: Perl v5.10.0 &amp; Encode 2.23<br/>-- <br/>Chris Hall highwayman.com<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2008/03/msg3137.html Tue, 11 Mar 2008 06:32:25 +0000 Re: Problem processing UTF-8 strings from email by Juerd Waalboer Tatsuhiko Miyagawa skribis 2008-01-12 16:42 (-0800):<br/>&gt; use Encode;<br/>&gt; use Encode::MIME::Header;<br/>&gt; decode(&quot;MIME-Header&quot;, $bytes);<br/><br/>Note that you don&#39;t have to &quot;use Encode::MIME::Header;&quot;. Doing this or<br/>leaving it out is a matter of personal preference, of course.<br/>-- <br/>Met vriendelijke groet, Kind regards, Korajn salutojn,<br/><br/> Juerd Waalboer: Perl hacker &lt;#####@juerd.nl&gt; &lt;http://juerd.nl/sig&gt;<br/> Convolution: ICT solutions and consultancy &lt;sales@convolution.nl&gt;<br/> http://www.nntp.perl.org/group/perl.unicode/2008/01/msg3136.html Sun, 13 Jan 2008 05:23:40 +0000 Re: Problem processing UTF-8 strings from email by Neil Gunton Tatsuhiko Miyagawa wrote:<br/>&gt; use Encode;<br/>&gt; use Encode::MIME::Header;<br/>&gt; decode(&quot;MIME-Header&quot;, $bytes);<br/>&gt; <br/>&gt; to get the Unicode strings for these MIME encoded characters.<br/>&gt; <br/>&gt; If you want to turn them into HTML entities, you can say:<br/>&gt; <br/>&gt; encode(&quot;ascii&quot;, decode(&quot;MIME-Header&quot;, $bytes), Encode::FB_HTMLCREF);<br/><br/>Wonderful, yes, that worked! Thanks VERY much. I&#39;m slapping my forehead <br/>for failing to realize that this was MIME encoding-related.<br/><br/>:-)<br/><br/>/Neil<br/> http://www.nntp.perl.org/group/perl.unicode/2008/01/msg3135.html Sat, 12 Jan 2008 16:50:15 +0000 Re: Problem processing UTF-8 strings from email by Tatsuhiko Miyagawa On 1/12/08, Neil Gunton &lt;neil@nilspace.com&gt; wrote:<br/>&gt;<br/>&gt; I am somewhat experienced with Perl in general, but absolutely no<br/>&gt; experience dealing with UTF-8. I have a community journals website which<br/>&gt; allows updates from users via email. I&#39;m having trouble with emails that<br/>&gt; contain Chinese characters encoded (I think) as UTF-8. The strings look<br/>&gt; like this:<br/>&gt;<br/>&gt; =?UTF-8?B?5qGQ5LmhLCBUb25neGlhbmc6IEJlaW5nIGEgJ2hhbg==?= =?UTF-8?B?dHUn?=<br/>&gt;<br/>&gt; When I read this text from a file, using my perl script, and then save<br/>&gt; it into MySQL, it comes out on the website looking literally like the<br/>&gt; above. I can&#39;t seem to get perl to &quot;do&quot; anything with it in terms of<br/>&gt; conversions to a format that looks like chinese characters when<br/>&gt; displayed on the Web.<br/><br/> use Encode;<br/> use Encode::MIME::Header;<br/> decode(&quot;MIME-Header&quot;, $bytes);<br/><br/>to get the Unicode strings for these MIME encoded characters.<br/><br/>&gt; Does anybody have any clues as to how to convert strings like this into<br/>&gt; something more usable - e.g. HTML character entities?<br/><br/>If you want to turn them into HTML entities, you can say:<br/><br/> encode(&quot;ascii&quot;, decode(&quot;MIME-Header&quot;, $bytes), Encode::FB_HTMLCREF);<br/><br/>HTH<br/><br/>-- <br/>Tatsuhiko Miyagawa<br/> http://www.nntp.perl.org/group/perl.unicode/2008/01/msg3134.html Sat, 12 Jan 2008 16:42:12 +0000 Problem processing UTF-8 strings from email by Neil Gunton (Apologies if you see a duplicate - I think I may have originally sent <br/>this to the wrong list)<br/><br/>Hi all,<br/><br/>I am somewhat experienced with Perl in general, but absolutely no <br/>experience dealing with UTF-8. I have a community journals website which <br/>allows updates from users via email. I&#39;m having trouble with emails that <br/>contain Chinese characters encoded (I think) as UTF-8. The strings look <br/>like this:<br/><br/>=?UTF-8?B?5qGQ5LmhLCBUb25neGlhbmc6IEJlaW5nIGEgJ2hhbg==?= =?UTF-8?B?dHUn?=<br/><br/>When I read this text from a file, using my perl script, and then save <br/>it into MySQL, it comes out on the website looking literally like the <br/>above. I can&#39;t seem to get perl to &quot;do&quot; anything with it in terms of <br/>conversions to a format that looks like chinese characters when <br/>displayed on the Web.<br/><br/>Does anybody have any clues as to how to convert strings like this into <br/>something more usable - e.g. HTML character entities?<br/><br/>I&#39;m using stock perl 5.8.8 from Debian Etch.<br/><br/>Thanks!<br/><br/>/Neil<br/> http://www.nntp.perl.org/group/perl.unicode/2008/01/msg3133.html Sat, 12 Jan 2008 16:28:05 +0000 Re: Fix UTF Encoding issue by mkoegler On Tue, Dec 04, 2007 at 10:12:50AM +0200, Ismail D&ouml;nmez wrote:<br/>&gt; I think just a better method is to use (not tested):<br/>&gt; <br/>&gt; if( is_utf8($str) ) <br/>&gt; {<br/>&gt; return decode_utf8($str);<br/>&gt; }<br/>&gt; else {<br/>&gt; return decode($str);<br/>&gt; }<br/><br/>I already tried this function. It does not test, if a string is<br/>really UTF-8. It seems to be to intended to check, if perl stores<br/>the string internally in a multi byte encoding.<br/><br/>mfg Martin K&ouml;gler.<br/> http://www.nntp.perl.org/group/perl.unicode/2007/12/msg3132.html Tue, 04 Dec 2007 05:21:19 +0000 Re: Fix UTF Encoding issue by Ismail Dönmez Tuesday 04 December 2007 10:47:39 Ismail D&Atilde;&para;nmez yazm&Auml;&plusmn;&Aring;&#159;t&Auml;&plusmn;:<br/>&gt; Tuesday 04 December 2007 10:44:12 Martin Koegler yazm&Auml;&plusmn;&Aring;&#159;t&Auml;&plusmn;:<br/>&gt; &gt; On Tue, Dec 04, 2007 at 10:33:39AM +0200, Ismail D&Atilde;&para;nmez wrote:<br/>&gt; &gt; &gt; Following to_utf8 function works for me :<br/>&gt; &gt;<br/>&gt; &gt; For me too (Debian sarge+etch).<br/>&gt;<br/>&gt; Thanks for testing.<br/><br/>Use Perl built-in utf8 function for UTF-8 decoding.<br/><br/>Signed-off-by: &Auml;&deg;smail D&Atilde;&para;nmez &lt;ismail@pardus.org.tr&gt;<br/><br/>diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl<br/>index ff5daa7..db255c1 100755<br/>--- a/gitweb/gitweb.perl<br/>+++ b/gitweb/gitweb.perl<br/>@@ -695,10 +695,9 @@ sub validate_refname {<br/> # in utf-8 thanks to &quot;binmode STDOUT, &#39;:utf8&#39;&quot; at beginning<br/> sub to_utf8 {<br/> my $str = shift;<br/>- my $res;<br/>- eval { $res = decode_utf8($str, Encode::FB_CROAK); };<br/>- if (defined $res) {<br/>- return $res;<br/>+ if (utf8::valid($str)) {<br/>+ utf8::decode($str);<br/>+ return $str;<br/> } else {<br/> return decode($fallback_encoding, $str, Encode::FB_DEFAULT);<br/> }<br/><br/><br/><br/>-- <br/>Never learn by your mistakes, if you do you may never dare to try again.<br/> http://www.nntp.perl.org/group/perl.unicode/2007/12/msg3131.html Tue, 04 Dec 2007 05:21:18 +0000 Re: Fix UTF Encoding issue by mkoegler On Tue, Dec 04, 2007 at 10:33:39AM +0200, Ismail D&ouml;nmez wrote:<br/>&gt; Following to_utf8 function works for me :<br/><br/>For me too (Debian sarge+etch).<br/><br/>&gt; sub to_utf8 {<br/>&gt; &middot; &nbsp; my $str = shift;<br/>&gt; <br/>&gt; &nbsp; &nbsp; if(utf8::valid($str))<br/>&gt; &nbsp; &nbsp; {<br/>&gt; &nbsp; &nbsp; &nbsp; &nbsp; utf8::decode($str);<br/>&gt; &nbsp; &nbsp; }<br/>&gt; &middot;<br/>&gt; &nbsp; &nbsp; return $str;<br/><br/>In the original thread, there was some discussion, that some people<br/>might want a different fallback endcoding. So mayme you should <br/>keep the second call to decode for the fallback encoding.<br/><br/>&gt; }<br/><br/>mfg Martin K&ouml;gler<br/> http://www.nntp.perl.org/group/perl.unicode/2007/12/msg3130.html Tue, 04 Dec 2007 04:47:59 +0000 Re: Fix UTF Encoding issue by Wincent Colaiuta El 4/12/2007, a las 9:55, Ismail D&ouml;nmez escribi&oacute;:<br/><br/>&gt; Tuesday 04 December 2007 10:47:39 Ismail D&ouml;nmez yazm&#x131;&#x15F;t&#x131;:<br/>&gt;&gt; Tuesday 04 December 2007 10:44:12 Martin Koegler yazm&#x131;&#x15F;t&#x131;:<br/>&gt;&gt;&gt; On Tue, Dec 04, 2007 at 10:33:39AM +0200, Ismail D&ouml;nmez wrote:<br/>&gt;&gt;&gt;&gt; Following to_utf8 function works for me :<br/>&gt;&gt;&gt;<br/>&gt;&gt;&gt; For me too (Debian sarge+etch).<br/>&gt;&gt;<br/>&gt;&gt; Thanks for testing.<br/>&gt;<br/>&gt; Use Perl built-in utf8 function for UTF-8 decoding.<br/>&gt;<br/>&gt; Signed-off-by: &#x130;smail D&ouml;nmez &lt;ismail@pardus.org.tr&gt;<br/>&gt;<br/>&gt; diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl<br/>&gt; index ff5daa7..db255c1 100755<br/>&gt; --- a/gitweb/gitweb.perl<br/>&gt; +++ b/gitweb/gitweb.perl<br/>&gt; @@ -695,10 +695,9 @@ sub validate_refname {<br/>&gt; # in utf-8 thanks to &quot;binmode STDOUT, &#39;:utf8&#39;&quot; at beginning<br/>&gt; sub to_utf8 {<br/>&gt; my $str = shift;<br/>&gt; - my $res;<br/>&gt; - eval { $res = decode_utf8($str, Encode::FB_CROAK); };<br/>&gt; - if (defined $res) {<br/>&gt; - return $res;<br/>&gt; + if (utf8::valid($str)) {<br/>&gt; + utf8::decode($str);<br/>&gt; + return $str;<br/><br/>This is good as it fixes another problem which some may have <br/>encountered. On at least one distro that I use (Red Hat Enterprise <br/>Linux 3) the Encode module is very old (it&#39;s 1.83; the latest release <br/>is 2.23), and so gitweb won&#39;t even run, dying during compilation with <br/>this:<br/><br/> Too many arguments for Encode::decode_utf8 at gitweb.cgi line 686, <br/>near &quot;Encode::FB_CROAK)&quot;<br/><br/>Of course, the workaround is to install a newer version of the module, <br/>but this patch eliminates that dependency which is IMO a good thing.<br/><br/>Cheers,<br/>Wincent<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2007/12/msg3129.html Tue, 04 Dec 2007 04:47:58 +0000 Re: Fix UTF Encoding issue by mkoegler On Tue, Dec 04, 2007 at 09:55:04AM +0200, Ismail D&ouml;nmez wrote:<br/>&gt; Tuesday 04 December 2007 Tarihinde 09:50:28 yazm????t??:<br/>&gt; &gt; The bug affects old versions of perl (Debian sarge = oldstable).<br/>&gt; &gt; As it works on the newer Debian etch, do you really think, that it is<br/>&gt; &gt; a good idea to report issue?<br/>&gt; <br/>&gt; Same problem here with v5.8.8 which is latest stable perl5 release.<br/><br/>I have put together a small perl script, which tests the various ways<br/>of decoding, which have been posted on the list. The first test is<br/>wrong by design. A working decoding method should result in<br/>&quot;#&ouml;&auml;&uuml;#&auml;&ouml;&uuml;&quot;.<br/><br/>Debian sarge:<br/>#&ouml;&auml;&uuml;#&Atilde;&euro;&Atilde;&para;&Atilde;&OElig;<br/>##&auml;&ouml;&uuml;<br/>##&auml;&ouml;&uuml;<br/>##&auml;&ouml;&uuml;<br/><br/>Debian etch, OpenSuSE 10.2, Fedora 7:<br/>#&ouml;&auml;&uuml;#&Atilde;&euro;&Atilde;&para;&Atilde;&OElig;<br/>#&ouml;&auml;&uuml;#&auml;&ouml;&uuml;<br/>#&ouml;&auml;&uuml;#&auml;&ouml;&uuml;<br/>#&ouml;&auml;&uuml;#&auml;&ouml;&uuml;<br/><br/>mfg Martin K&ouml;gler<br/><br/>#!/usr/bin/perl<br/>use Encode;<br/><br/>sub t {<br/>my $str = shift;<br/>my ($res);<br/>eval { return ($res = decode_utf8($str, Encode::FB_CROAK)); };<br/>return decode(&quot;latin1&quot;, $str, Encode::FB_DEFAULT);<br/>}<br/>sub t1 {<br/>my $str = shift;<br/>my ($res);<br/>eval { ($res = decode_utf8($str, Encode::FB_CROAK)); };<br/>if ($@) {<br/>return decode(&quot;latin1&quot;, $str, Encode::FB_DEFAULT); }<br/>else<br/>{ return $res; }<br/>}<br/><br/>sub t2 {<br/>my $str = shift;<br/>my ($res);<br/><br/>eval { $res = decode_utf8($str, Encode::FB_CROAK); };<br/> if (defined $res) {<br/> return $res;<br/>} else {<br/> return decode(&quot;latin1&quot;, $str, Encode::FB_DEFAULT);<br/>}<br/>}<br/><br/>sub t3 {<br/> my $str = shift;<br/> my $res;<br/> eval { $res = decode_utf8 ($str, 1); };<br/> return $res || decode(&#39;latin1&#39;, $str);<br/>}<br/><br/>print t(&quot;#&ouml;&auml;&uuml;&quot;);<br/>print t(&quot;#&Atilde;&euro;&Atilde;&para;&Atilde;&OElig;&quot;);<br/>print &quot;\n&quot;;<br/>print t1(&quot;#&ouml;&auml;&uuml;&quot;);<br/>print t1(&quot;#&Atilde;&euro;&Atilde;&para;&Atilde;&OElig;&quot;);<br/>print &quot;\n&quot;;<br/>print t2(&quot;#&ouml;&auml;&uuml;&quot;);<br/>print t2(&quot;#&Atilde;&euro;&Atilde;&para;&Atilde;&OElig;&quot;);<br/>print &quot;\n&quot;;<br/>print t3(&quot;#&ouml;&auml;&uuml;&quot;);<br/>print t3(&quot;#&Atilde;&euro;&Atilde;&para;&Atilde;&OElig;&quot;);<br/>print &quot;\n&quot;;<br/><br/><br/><br/> http://www.nntp.perl.org/group/perl.unicode/2007/12/msg3128.html Tue, 04 Dec 2007 04:47:53 +0000 Re: Fix UTF Encoding issue by Ismail Dönmez Tuesday 04 December 2007 10:44:12 Martin Koegler yazm&Auml;&plusmn;&Aring;&#159;t&Auml;&plusmn;:<br/>&gt; On Tue, Dec 04, 2007 at 10:33:39AM +0200, Ismail D&Atilde;&para;nmez wrote:<br/>&gt; &gt; Following to_utf8 function works for me :<br/>&gt;<br/>&gt; For me too (Debian sarge+etch).<br/><br/>Thanks for testing.<br/><br/>&gt; &gt; sub to_utf8 {<br/>&gt; &gt; &Acirc;&middot; &Acirc;&nbsp; my $str = shift;<br/>&gt; &gt;<br/>&gt; &gt; &Acirc;&nbsp; &Acirc;&nbsp; if(utf8::valid($str))<br/>&gt; &gt; &Acirc;&nbsp; &Acirc;&nbsp; {<br/>&gt; &gt; &Acirc;&nbsp; &Acirc;&nbsp; &Acirc;&nbsp; &Acirc;&nbsp; utf8::decode($str);<br/>&gt; &gt; &Acirc;&nbsp; &Acirc;&nbsp; }<br/>&gt; &gt; &Acirc;&middot;<br/>&gt; &gt; &Acirc;&nbsp; &Acirc;&nbsp; return $str;<br/>&gt;<br/>&gt; In the original thread, there was some discussion, that some people<br/>&gt; might want a different fallback endcoding. So mayme you should<br/>&gt; keep the second call to decode for the fallback encoding.<br/><br/>Probably, I just wanted to fix this damn UTF-8 bug surfacing over and over =)<br/><br/>Regards,<br/>ismail<br/><br/>-- <br/>Never learn by your mistakes, if you do you may never dare to try again.<br/> http://www.nntp.perl.org/group/perl.unicode/2007/12/msg3127.html Tue, 04 Dec 2007 04:18:07 +0000 Re: Fix UTF Encoding issue by Ismail Dönmez Tuesday 04 December 2007 10:28:59 Ismail D&Atilde;&para;nmez yazm&Auml;&plusmn;&Aring;&#159;t&Auml;&plusmn;:<br/>&gt; Tuesday 04 December 2007 10:16:34 Martin Koegler yazm&Auml;&plusmn;&Aring;&#159;t&Auml;&plusmn;:<br/>&gt; [...]<br/>&gt;<br/>&gt; &gt; print t(&quot;#&Atilde;&para;&Atilde;&curren;&Atilde;&frac14;&quot;);<br/>&gt; &gt; print t(&quot;#&Atilde;&#131;&acirc;&#130;&not;&Atilde;&#131;&Acirc;&para;&Atilde;&#131;&Aring;&#146;&quot;);<br/>&gt; &gt; print &quot;\n&quot;;<br/>&gt;<br/>&gt; How about this one, doesn&#39;t even use Encode, uses just built-in utf8<br/>&gt; function :<br/>&gt;<br/>&gt; [~]&gt; cat test.pl<br/>&gt; binmode STDOUT, &#39;:utf8&#39;;<br/>&gt;<br/>&gt; my $str = &quot;#&Atilde;&para;&Atilde;&curren;&Atilde;&frac14;&quot;;<br/>&gt;<br/>&gt; if (utf8::valid($str))<br/>&gt; {<br/>&gt; utf8::decode($str);<br/>&gt; }<br/>&gt;<br/>&gt; print $str.&quot;\n&quot;;<br/>&gt;<br/>&gt; [~]&gt; perl test.pl<br/>&gt; #&Atilde;&para;&Atilde;&curren;&Atilde;&frac14;<br/><br/>Following to_utf8 function works for me :<br/><br/>sub to_utf8 {<br/>&Acirc;&middot; &Acirc;&nbsp; my $str = shift;<br/><br/>&Acirc;&nbsp; &Acirc;&nbsp; if(utf8::valid($str))<br/>&Acirc;&nbsp; &Acirc;&nbsp; {<br/>&Acirc;&nbsp; &Acirc;&nbsp; &Acirc;&nbsp; &Acirc;&nbsp; utf8::decode($str);<br/>&Acirc;&nbsp; &Acirc;&nbsp; }<br/>&Acirc;&middot;<br/>&Acirc;&nbsp; &Acirc;&nbsp; return $str;<br/>}<br/><br/>Regards,<br/>ismail<br/><br/>-- <br/>Never learn by your mistakes, if you do you may never dare to try again.<br/> http://www.nntp.perl.org/group/perl.unicode/2007/12/msg3126.html Tue, 04 Dec 2007 04:17:52 +0000 Re: Fix UTF Encoding issue by Ismail Dönmez Tuesday 04 December 2007 10:16:34 Martin Koegler yazm&Auml;&plusmn;&Aring;&#159;t&Auml;&plusmn;:<br/>[...]<br/>&gt; print t(&quot;#&Atilde;&para;&Atilde;&curren;&Atilde;&frac14;&quot;);<br/>&gt; print t(&quot;#&Atilde;&#131;&acirc;&#130;&not;&Atilde;&#131;&Acirc;&para;&Atilde;&#131;&Aring;&#146;&quot;);<br/>&gt; print &quot;\n&quot;;<br/><br/>How about this one, doesn&#39;t even use Encode, uses just built-in utf8 <br/>function :<br/><br/>[~]&gt; cat test.pl<br/>binmode STDOUT, &#39;:utf8&#39;;<br/><br/>my $str = &quot;#&Atilde;&para;&Atilde;&curren;&Atilde;&frac14;&quot;;<br/><br/>if (utf8::valid($str))<br/>{<br/> utf8::decode($str);<br/>}<br/><br/>print $str.&quot;\n&quot;;<br/><br/>[~]&gt; perl test.pl<br/>#&Atilde;&para;&Atilde;&curren;&Atilde;&frac14;<br/><br/>Regards,<br/>ismail<br/><br/>-- <br/>Never learn by your mistakes, if you do you may never dare to try again.<br/> http://www.nntp.perl.org/group/perl.unicode/2007/12/msg3125.html Tue, 04 Dec 2007 04:17:50 +0000 Re: Fix UTF Encoding issue by Ismail Dönmez Tuesday 04 December 2007 10:04:07 Martin Koegler yazm&Auml;&plusmn;&Aring;&#159;t&Auml;&plusmn;:<br/>&gt; On Tue, Dec 04, 2007 at 08:16:24AM +1030, Benjamin Close wrote:<br/>&gt; &gt; Jakub Narebski wrote:<br/>&gt; &gt; &gt;On Mon, 3 Dec 2007, Martin Koegler wrote:<br/>&gt; &gt; &gt;&gt;On Mon, Dec 03, 2007 at 04:06:48AM -0800, Jakub Narebski wrote:<br/>&gt; &gt; &gt;&gt;&gt;Ismail D&Atilde;&para;nmez &lt;ismail@pardus.org.tr&gt; writes:<br/>&gt; &gt; &gt;&gt;&gt;&gt;Monday 03 December 2007 Tarihinde 12:14:43 yazm??t?:<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;Benjamin Close &lt;Benjamin.Close@clearchain.com&gt; writes:<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;&gt;- eval { $res = decode_utf8($str, Encode::FB_CROAK); };<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;&gt;- if (defined $res) {<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;&gt;- return $res;<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;&gt;- } else {<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;&gt;- return decode($fallback_encoding, $str,<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;&gt;Encode::FB_DEFAULT);<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;&gt;- }<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;&gt;+ eval { return ($res = decode_utf8($str, Encode::FB_CROAK));<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;&gt;};<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;&gt;+ return decode($fallback_encoding, $str, Encode::FB_DEFAULT);<br/>&gt; &gt; &gt;&gt;&gt;&gt;&gt;&gt; }<br/>&gt; &gt; &gt;&gt;<br/>&gt; &gt; &gt;&gt;This version is broken on Debian sarge and etch