perl.unicode http://www.nntp.perl.org/group/perl.unicode/ ... Copyright 1998-2013 perl.org Thu, 23 May 2013 18:56:48 +0000 ask@perl.org Re: Need info on ascii converstion. by Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 &gt; Error that I am receiving is below :<br/>&gt; <br/>&gt; $VAR1 = bless( {<br/>&gt; &#39;why&#39; =&gt; &#39;Expected 8 or 0 byte long (19)&#39;<br/>&gt; }, &#39;Cassandra::InvalidRequestException&#39; );<br/>You haven&#39;t shown any code and input data to produce that error. Provide<br/>a minimal, stand-alone example that someone else can run to reproduce<br/>the problem on his own computer. &lt;http://sscce.org/&gt;<br/>&lt;http://www.chiark.greenend.org.uk/~sgtatham/bugs.html#showmehow&gt;<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2013/03/msg3352.html Tue, 19 Mar 2013 21:45:50 +0000 RE: Need info on ascii converstion. by Dhoke, Swati **CTR** Hi Team ,<br/><br/>Data I am trying to pass is getting converted into ascii , how can I avoid that in my script .<br/><br/>I have tried below module :<br/><br/>use utf8;<br/>use feature &#39;unicode_strings&#39;;<br/><br/>I still get ascii data even on using above modules<br/><br/>Error that I am receiving is below :<br/><br/>$VAR1 = bless( {<br/> &#39;why&#39; =&gt; &#39;Expected 8 or 0 byte long (19)&#39;<br/> }, &#39;Cassandra::InvalidRequestException&#39; );<br/><br/>REFERENCE : http://perldoc.perl.org/Encode.html#Handling-Malformed-Data.<br/><br/>Thanks ,<br/>Swati<br/><br/><br/><br/> http://www.nntp.perl.org/group/perl.unicode/2013/03/msg3351.html Thu, 14 Mar 2013 21:21:16 +0000 Re: Word boundaries by Zbigniew Łukasiak On Mon, Mar 26, 2012 at 12:57 PM, Lars D&#x26A;&#x1D07;&#x1D04;&#x1D0B;&#x1D0F;&#x1D21; &#x8FEA;&#x62C9;&#x65AF; &lt;daxim@cpan.org&gt; wrote:<br/>&gt; Let the regex engine help you advance the character counter.<br/>&gt;<br/>&gt; &nbsp; &nbsp;$ cat langs<br/>&gt; &nbsp; &nbsp;&Epsilon;&lambda;&lambda;&eta;&nu;&iota;&kappa;&#x3AC;English&#xD55C;&#xAD6D;&#xC5B4;&#x65E5;&#x672C;&#x8A9E;&#x420;&#x443;&#x441;&#x441;&#x43A;&#x438;&#x439;&#xE44;&#xE17;&#xE22;<br/>&gt;<br/>&gt; ----<br/>&gt;<br/>&gt; &nbsp; &nbsp;$ cat langs.pl<br/>&gt; &nbsp; &nbsp;use 5.010;<br/>&gt; &nbsp; &nbsp;use strictures;<br/>&gt; &nbsp; &nbsp;use Unicode::UCD qw(charinfo);<br/>&gt;<br/>&gt; &nbsp; &nbsp;sub script {<br/>&gt; &nbsp; &nbsp; &nbsp; &nbsp;return charinfo(ord substr($_[0], 0, 1))-&gt;{script}<br/>&gt; &nbsp; &nbsp;};<br/>&gt;<br/>&gt; &nbsp; &nbsp;# necessary because pos() magic is tracked on the scalar.<br/>&gt; &nbsp; &nbsp;my $copy = $_;<br/>&gt; &nbsp; &nbsp;while (/(\X)/g) {<br/>&gt; &nbsp; &nbsp; &nbsp; &nbsp;my $script = script $1;<br/>&gt; &nbsp; &nbsp; &nbsp; &nbsp;my ($part) = $copy =~ /(\p{$script}+)/;<br/>&gt; &nbsp; &nbsp; &nbsp; &nbsp;say $part;<br/>&gt; &nbsp; &nbsp; &nbsp; &nbsp;pos($_) = pos($_) + length($part);<br/>&gt; &nbsp; &nbsp;}<br/><br/>Thanks a lot!<br/><br/>Here is the first version of my tokenizer based on this idea:<br/><br/><br/>use Lingua::ZH::MMSEG;<br/><br/>sub tokenize {<br/> my $text = shift;<br/> my @tokens;<br/> while ( $text =~ /(\X)/g ) {<br/> my $part = $1;<br/> my $script = charinfo( ord $1)-&gt;{script};<br/> $text=~ /(\p{$script}*)/g;<br/> next if $script eq &#39;Common&#39;;<br/> $part .= $1;<br/> if( $script eq &#39;Han&#39; ){<br/> push @tokens, mmseg( $part );<br/> }<br/> else{<br/> push @tokens, $part;<br/> }<br/> }<br/> return @tokens;<br/>}<br/><br/>And the surprise - this works even without further splitting because<br/>space and other dots all get the &#39;Common&#39; script and are not matched<br/>by \p{Latin}.<br/><br/>-- <br/>Zbigniew Lukasiak<br/>http://brudnopis.blogspot.com/<br/>http://perlalchemy.blogspot.com/<br/> http://www.nntp.perl.org/group/perl.unicode/2012/03/msg3350.html Tue, 27 Mar 2012 05:21:59 +0000 Re: Word boundaries by Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 &gt; How can I check what script a character belongs to?<br/><br/> $ perl -Mutf8 -MUnicode::UCD=charinfo -E&#39;say charinfo(ord<br/>&quot;&#x4E3A;&quot;)-&gt;{script}&#39;<br/> Han<br/><br/>Sanity checks:<br/><br/> $ perl -Mutf8 -E&#39;say &quot;&#x4E3A;&quot; =~ /\p{Han}/&#39;<br/> 1<br/><br/> $ uniprops -a1 &#x4E3A; | ack Script<br/> Script=Han<br/> Script=Hani<br/><br/>&gt; check if it is the same as the<br/>&gt; previous one - i.e. back to C mode of programming.<br/><br/>Let the regex engine help you advance the character counter.<br/><br/> $ cat langs<br/> &Epsilon;&lambda;&lambda;&eta;&nu;&iota;&kappa;&#x3AC;English&#xD55C;&#xAD6D;&#xC5B4;&#x65E5;&#x672C;&#x8A9E;&#x420;&#x443;&#x441;&#x441;&#x43A;&#x438;&#x439;&#xE44;&#xE17;&#xE22;<br/><br/>----<br/><br/> $ cat langs.pl<br/> use 5.010;<br/> use strictures;<br/> use Unicode::UCD qw(charinfo);<br/><br/> sub script {<br/> return charinfo(ord substr($_[0], 0, 1))-&gt;{script}<br/> };<br/><br/> # necessary because pos() magic is tracked on the scalar.<br/> my $copy = $_; <br/> while (/(\X)/g) {<br/> my $script = script $1;<br/> my ($part) = $copy =~ /(\p{$script}+)/;<br/> say $part;<br/> pos($_) = pos($_) + length($part);<br/> }<br/><br/>----<br/><br/> $ perl -C -ln langs.pl &lt; langs<br/> &Epsilon;&lambda;&lambda;&eta;&nu;&iota;&kappa;&#x3AC;<br/> English<br/> &#xD55C;&#xAD6D;&#xC5B4;<br/> &#x420;&#x443;&#x441;&#x441;&#x43A;&#x438;&#x439;<br/> &#xE44;&#xE17;&#xE22;<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2012/03/msg3349.html Mon, 26 Mar 2012 03:58:11 +0000 Word boundaries by Zbigniew Łukasiak For our spam classifier I need to split the text into words.<br/>Unfortunately the &#39;\b&#39; regex does not yet work for languages with no<br/>spaces (apparently it is covered in the level 3 of unicode support<br/>http://unicode.org/reports/tr18/#Tailored_Word_Boundaries) - so I need<br/>some custom solution. This did not seem very difficult - just split<br/>the text into blocks of same unicode script and then use &#39;\b&#39; for most<br/>of the scripts and appropriate libraries for the rest (at least for<br/>Chinese there are some tokenizers on CPAN) - but:<br/><br/>1. How can I split the text into blocks of same scripts? (Wouldn&#39;t a<br/>script-boundary regex property be useful?). OK I can always loop over<br/>the characters, check their script and check if it is the same as the<br/>previous one - i.e. back to C mode of programming. But then there is<br/>still the question of:<br/><br/>2. How can I check what script a character belongs to? Do I need to<br/>cut and paste all the script ranges from unicode.org into a huge<br/>if-else branch in my program or is there a simpler way?<br/><br/>Thanks in advance,<br/>Zbigniew<br/> http://www.nntp.perl.org/group/perl.unicode/2012/03/msg3348.html Mon, 26 Mar 2012 02:03:20 +0000 Re: Please help me with this perl question by Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 Yes, it is possible. This is an example:<br/><br/> perl -C -E&#39;say&quot;\x{e8f4}\x{e8f5}&quot;&#39;<br/><br/>Screenshot: &lt;http://i.imgur.com/wivY9.png&gt;<br/><br/>The magic happens not at the Perl level, but at the rendering step.<br/><br/>I first picked two unassigned codepoints. Unicode provides a private<br/>use area for exactly this kind of purpose.<br/><br/>I also took care not to trample over a [registered script]<br/>(http://enwp.org/ConScript_Unicode_Registry). Then I created a<br/>font with two glyphs, `U+E8F4 ROTATED LATIN CAPITAL LETTER D` and<br/>`U+E8F5 LATIN CAPITAL LETTER A WITHOUT COUNTER`, and installed it.<br/>Last, I restarted my terminal application so that the new font gets<br/>picked up, and executed the Perl one-liner from above.<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2012/02/msg3347.html Tue, 14 Feb 2012 02:01:25 +0000 Re: Unicode on Windows Console by Michael Ludwig Lars D&#x26A;&#x1D07;&#x1D04;&#x1D0B;&#x1D0F;&#x1D21; &#x8FEA;&#x62C9;&#x65AF; schrieb am 12.01.2012 um 09:54 (+0100):<br/>&gt; Run `chcp 65001`, see &lt;http://enwp.org/chcp_(command)&gt;. I have not<br/>&gt; tested this.<br/><br/>Okay, that&#39;s just the regular chcp command.<br/><br/>&gt; Setting the encoding to Windows-1252 and then expecting to be able to<br/>&gt; talk variations of UTF to it seems wrong.<br/><br/>Might seem wrong at first glance, I agree; but hey, this is Windows, and<br/>it magically works, sort of bypassing the chcp setting! You just need<br/>the C API call:<br/><br/> _setmode(_fileno(stdout), _O_WTEXT)<br/><br/>Or some equivalent of that, which is what Win32::Unicode seems to be<br/>doing.<br/>-- <br/>Michael Ludwig<br/> http://www.nntp.perl.org/group/perl.unicode/2012/01/msg3346.html Thu, 12 Jan 2012 11:51:13 +0000 Re: Unicode on Windows Console by Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 Run `chcp 65001`, see &lt;http://enwp.org/chcp_(command)&gt;. I have not<br/>tested this.<br/><br/>Setting the encoding to Windows-1252 and then expecting to be able to<br/>talk variations of UTF to it seems wrong.<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2012/01/msg3345.html Thu, 12 Jan 2012 00:54:14 +0000 Re: question about perlunicode "Unicode Character Properties" by Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 Install &lt;http://p3rl.org/unichars&gt;.<br/><br/>$ unichars -au &#39;\p{Han}&#39; | wc -l<br/>75960<br/><br/>$ unichars -au &#39;\p{Han}&#39; | perl -lne&#39;print unless $. % 1000&#39;<br/> &#x368F; U+0368F CJK UNIFIED IDEOGRAPH-368F<br/> &#x3A77; U+03A77 CJK UNIFIED IDEOGRAPH-3A77<br/> &#x3E5F; U+03E5F CJK UNIFIED IDEOGRAPH-3E5F<br/> &#x4247; U+04247 CJK UNIFIED IDEOGRAPH-4247<br/> &#x462F; U+0462F CJK UNIFIED IDEOGRAPH-462F<br/> &#x4A17; U+04A17 CJK UNIFIED IDEOGRAPH-4A17<br/> &#x4E49; U+04E49 CJK UNIFIED IDEOGRAPH-4E49<br/> &#x5231; U+05231 CJK UNIFIED IDEOGRAPH-5231<br/> &#x5619; U+05619 CJK UNIFIED IDEOGRAPH-5619<br/> &#x5A01; U+05A01 CJK UNIFIED IDEOGRAPH-5A01<br/> &#x5DE9; U+05DE9 CJK UNIFIED IDEOGRAPH-5DE9<br/> &#x61D1; U+061D1 CJK UNIFIED IDEOGRAPH-61D1<br/> &#x65B9; U+065B9 CJK UNIFIED IDEOGRAPH-65B9<br/> &#x69A1; U+069A1 CJK UNIFIED IDEOGRAPH-69A1<br/> &#x6D89; U+06D89 CJK UNIFIED IDEOGRAPH-6D89<br/> &#x7171; U+07171 CJK UNIFIED IDEOGRAPH-7171<br/> &#x7559; U+07559 CJK UNIFIED IDEOGRAPH-7559<br/> &#x7941; U+07941 CJK UNIFIED IDEOGRAPH-7941<br/> &#x7D29; U+07D29 CJK UNIFIED IDEOGRAPH-7D29<br/> &#x8111; U+08111 CJK UNIFIED IDEOGRAPH-8111<br/> &#x84F9; U+084F9 CJK UNIFIED IDEOGRAPH-84F9<br/> &#x88E1; U+088E1 CJK UNIFIED IDEOGRAPH-88E1<br/> &#x8CC9; U+08CC9 CJK UNIFIED IDEOGRAPH-8CC9<br/> &#x90B1; U+090B1 CJK UNIFIED IDEOGRAPH-90B1<br/> &#x9499; U+09499 CJK UNIFIED IDEOGRAPH-9499<br/> &#x9881; U+09881 CJK UNIFIED IDEOGRAPH-9881<br/> &#x9C69; U+09C69 CJK UNIFIED IDEOGRAPH-9C69<br/> &#xF985; U+0F985 CJK COMPATIBILITY IDEOGRAPH-F985<br/> &#x20297; U+20297 CJK UNIFIED IDEOGRAPH-20297<br/> &#x2067F; U+2067F CJK UNIFIED IDEOGRAPH-2067F<br/> &#x20A67; U+20A67 CJK UNIFIED IDEOGRAPH-20A67<br/> &#x20E4F; U+20E4F CJK UNIFIED IDEOGRAPH-20E4F<br/> &#x21237; U+21237 CJK UNIFIED IDEOGRAPH-21237<br/> &#x2161F; U+2161F CJK UNIFIED IDEOGRAPH-2161F<br/> &#x21A07; U+21A07 CJK UNIFIED IDEOGRAPH-21A07<br/> &#x21DEF; U+21DEF CJK UNIFIED IDEOGRAPH-21DEF<br/> &#x221D7; U+221D7 CJK UNIFIED IDEOGRAPH-221D7<br/> &#x225BF; U+225BF CJK UNIFIED IDEOGRAPH-225BF<br/> &#x229A7; U+229A7 CJK UNIFIED IDEOGRAPH-229A7<br/> &#x22D8F; U+22D8F CJK UNIFIED IDEOGRAPH-22D8F<br/> &#x23177; U+23177 CJK UNIFIED IDEOGRAPH-23177<br/> &#x2355F; U+2355F CJK UNIFIED IDEOGRAPH-2355F<br/> &#x23947; U+23947 CJK UNIFIED IDEOGRAPH-23947<br/> &#x23D2F; U+23D2F CJK UNIFIED IDEOGRAPH-23D2F<br/> &#x24117; U+24117 CJK UNIFIED IDEOGRAPH-24117<br/> &#x244FF; U+244FF CJK UNIFIED IDEOGRAPH-244FF<br/> &#x248E7; U+248E7 CJK UNIFIED IDEOGRAPH-248E7<br/> &#x24CCF; U+24CCF CJK UNIFIED IDEOGRAPH-24CCF<br/> &#x250B7; U+250B7 CJK UNIFIED IDEOGRAPH-250B7<br/> &#x2549F; U+2549F CJK UNIFIED IDEOGRAPH-2549F<br/> &#x25887; U+25887 CJK UNIFIED IDEOGRAPH-25887<br/> &#x25C6F; U+25C6F CJK UNIFIED IDEOGRAPH-25C6F<br/> &#x26057; U+26057 CJK UNIFIED IDEOGRAPH-26057<br/> &#x2643F; U+2643F CJK UNIFIED IDEOGRAPH-2643F<br/> &#x26827; U+26827 CJK UNIFIED IDEOGRAPH-26827<br/> &#x26C0F; U+26C0F CJK UNIFIED IDEOGRAPH-26C0F<br/> &#x26FF7; U+26FF7 CJK UNIFIED IDEOGRAPH-26FF7<br/> &#x273DF; U+273DF CJK UNIFIED IDEOGRAPH-273DF<br/> &#x277C7; U+277C7 CJK UNIFIED IDEOGRAPH-277C7<br/> &#x27BAF; U+27BAF CJK UNIFIED IDEOGRAPH-27BAF<br/> &#x27F97; U+27F97 CJK UNIFIED IDEOGRAPH-27F97<br/> &#x2837F; U+2837F CJK UNIFIED IDEOGRAPH-2837F<br/> &#x28767; U+28767 CJK UNIFIED IDEOGRAPH-28767<br/> &#x28B4F; U+28B4F CJK UNIFIED IDEOGRAPH-28B4F<br/> &#x28F37; U+28F37 CJK UNIFIED IDEOGRAPH-28F37<br/> &#x2931F; U+2931F CJK UNIFIED IDEOGRAPH-2931F<br/> &#x29707; U+29707 CJK UNIFIED IDEOGRAPH-29707<br/> &#x29AEF; U+29AEF CJK UNIFIED IDEOGRAPH-29AEF<br/> &#x29ED7; U+29ED7 CJK UNIFIED IDEOGRAPH-29ED7<br/> &#x2A2BF; U+2A2BF CJK UNIFIED IDEOGRAPH-2A2BF<br/> &#x2A6A7; U+2A6A7 CJK UNIFIED IDEOGRAPH-2A6A7<br/> &#x2AAB8; U+2AAB8 CJK UNIFIED IDEOGRAPH-2AAB8<br/> &#x2AEA0; U+2AEA0 CJK UNIFIED IDEOGRAPH-2AEA0<br/> &#x2B288; U+2B288 CJK UNIFIED IDEOGRAPH-2B288<br/> &#x2B670; U+2B670 CJK UNIFIED IDEOGRAPH-2B670<br/><br/><br/>To get a good coverage for display, install the following font families:<br/><br/>&#x6587;&#x6CC9;&#x9A7F;&#x6B63;&#x9ED1; &lt;http://wenq.org/?ZenHei&gt;<br/>Han Nom &lt;http://vietunicode.sf.net/fonts/fonts_hannom.html&gt;<br/>Code200x &lt;http://web.archive.org/web/2010/http://code2000.net/&gt;<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2012/01/msg3344.html Thu, 12 Jan 2012 00:47:42 +0000 question about perlunicode "Unicode Character Properties" by silent in perldoc perlunicde : Unicode Character Properties : Scripts<br/><br/>I see a Han, which can be use as $string =~/\p{Han}/;<br/><br/>my question is how can I find out what exactly &quot;Han&quot; is ?<br/>I know \p{Han} can match a Chinese word,<br/>also tested it to match each word in perl-src/ext/Encode/t/gb2312.utf,<br/><br/>but I do not know the exact range of this \p{Han}.<br/><br/>thanks!<br/> http://www.nntp.perl.org/group/perl.unicode/2012/01/msg3343.html Wed, 11 Jan 2012 23:10:34 +0000 Re: Unicode on Windows Console by Michael Ludwig Michael Ludwig schrieb am 07.01.2012 um 18:30 (+0100):<br/>&gt; There&#39;s a WinAPI function that sets stdout to Unicode so you can<br/>&gt; read Cyrillic and Greek characters in the cmd.exe console window:<br/><br/>&gt; Can I get the same feature in Perl?<br/><br/>Yes: https://metacpan.org/module/Win32::Unicode<br/><br/>Printing twelve hearts:<br/><br/>perl -MWin32::Unicode -lwe &quot;printW qq( \x{2665}) x 12&quot;<br/><br/> &hearts; &hearts; &hearts; &hearts; &hearts; &hearts; &hearts; &hearts; &hearts; &hearts; &hearts; &hearts;<br/><br/>-- <br/>Michael Ludwig<br/> http://www.nntp.perl.org/group/perl.unicode/2012/01/msg3342.html Wed, 11 Jan 2012 14:55:44 +0000 Unicode on Windows Console by Michael Ludwig There&#39;s a WinAPI function that sets stdout to Unicode so you can<br/>read Cyrillic and Greek characters in the cmd.exe console window:<br/><br/> \,,,/<br/> (o o)<br/>------oOOo-(_)-oOOo------<br/>// http://msdn.microsoft.com/en-us/library/tw4k6df8.aspx - _setmode<br/>// crt_setmodeunicode.c<br/>// This program uses _setmode to change<br/>// stdout to Unicode. Cyrillic and Ideographic<br/>// characters will appear on the console (if<br/>// your console font supports those character sets).<br/><br/>#include &lt;fcntl.h&gt;<br/>#include &lt;io.h&gt;<br/>#include &lt;stdio.h&gt;<br/><br/>int main(void) {<br/> _setmode(_fileno(stdout), _O_U16TEXT);<br/> wprintf(L&quot;\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n&quot;);<br/> return 0;<br/>}<br/>-------------------------<br/><br/>Compile it and will print &quot;&#x43A;&#x43E;&#x448;&#x43A;&#x430; &#x65E5;&#x672C;&#x56FD;&quot;, so a Russian name and three<br/>fancy ideograms. (The ideograms aren&#39;t supported by my font, but that<br/>is an unrelated problem; and I can&#39;t read them anyway; still they look<br/>nice.) My console codepage is just 1252, Western European.<br/><br/>Can I get the same feature in Perl? I left the codepage at 1252 (can be<br/>changes using CHCP) and tried the following:<br/><br/>* binmode STDOUT, &#39;:encoding(UTF-16LE)&#39;<br/>* binmode STDOUT, &#39;:encoding(UTF-16BE)&#39;<br/>* binmode STDOUT, &#39;:encoding(UTF-8)&#39;<br/><br/>None of these produced the desired effect. Any ideas?<br/><br/>I know I could use a Linux UTF-8 terminal or Cygwin/MinTTY, which is<br/>what I&#39;m using to write this mail, by the way; but the question is<br/>specific to the Windows console in cmd.exe and how to make Perl use<br/>its features.<br/>-- <br/>Michael Ludwig<br/> http://www.nntp.perl.org/group/perl.unicode/2012/01/msg3341.html Sat, 07 Jan 2012 09:31:01 +0000 Re: Please help me with this perl question by Dean Hoover I don&#39;t see how you can do that... you are dependent on the capabilities of<br/>the terminal. If its an old school text terminal, you can&#39;t just<br/>draw arbitrary stuff like a rotated D. That is pretty fancy and even in<br/>graphics mode has nothing to do with perl, or do I misunderstand your<br/>question?<br/><br/>On Fri, Dec 30, 2011 at 2:12 PM, Michael Ludwig &lt;milu71@gmx.de&gt; wrote:<br/><br/>&gt; FORREST COPLEY schrieb am 29.12.2011 um 10:48 (-0700):<br/>&gt; &gt; Is it possible to write a perl script to print a completely custom<br/>&gt; &gt; character on a console text terminal?<br/>&gt; &gt; Say a D rotated 90 degrees or something.<br/>&gt; &gt; or an A with the innards filled in.<br/>&gt;<br/>&gt; Here&#39;s how to print alpha to omega:<br/>&gt;<br/>&gt; perl -C2 -lwe &#39;print join q(, ), map chr, 0x391 .. 0x3a9&#39;<br/>&gt;<br/>&gt; &Alpha;, &Beta;, &Gamma;, &Delta;, &Epsilon;, &Zeta;, &Eta;, &Theta;, &Iota;, &Kappa;, &Lambda;, &Mu;, &Nu;, &Xi;, &Omicron;, &Pi;, &Rho;, &#x3A2;, &Sigma;, &Tau;, &Upsilon;, &Phi;, &Chi;, &Psi;, &Omega;<br/>&gt;<br/>&gt; If you know the Unicode codepoints you can certainly print the characters.<br/>&gt; For Perl, it&#39;s essentially numbers.<br/>&gt;<br/>&gt; Whether your particular character comes out nicely or not depends on<br/>&gt; whether the font you&#39;re using has the glyph in question.<br/>&gt;<br/>&gt; Next time you might want to spend two seconds or even ten thinking<br/>&gt; about a suitable subject line &hellip;<br/>&gt;<br/>&gt; --<br/>&gt; Michael Ludwig<br/>&gt;<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2011/12/msg3340.html Fri, 30 Dec 2011 17:39:34 +0000 Re: Please help me with this perl question by Karl Williamson On 12/29/2011 10:48 AM, FORREST COPLEY wrote:<br/>&gt; Is it possible to write a perl script to print a completely custom<br/>&gt; character on a console text terminal?<br/>&gt; Say a D rotated 90 degrees or something.<br/>&gt; or an A with the innards filled in.<br/>&gt; --<br/>&gt;<br/>&gt;<br/><br/>http://search.cpan.org/~bdfoy/Unicode-Tussle-1.03/lib/Unicode/Tussle.pm<br/><br/>contains the Perl script:<br/>leo - u&#x28D;op&#x259;p&#x1D09;sdn s&#x183;u&#x1D09;&#x265;&#x287; &#x259;&#x287;&#x1D09;&#x279;&#x28D; o&#x287; &#x279;&#x259;&#x287;l&#x1D09;&#x25F;<br/><br/>That&#39;s the closest thing I know about<br/> http://www.nntp.perl.org/group/perl.unicode/2011/12/msg3339.html Fri, 30 Dec 2011 11:47:21 +0000 Re: Please help me with this perl question by Michael Ludwig FORREST COPLEY schrieb am 29.12.2011 um 10:48 (-0700):<br/>&gt; Is it possible to write a perl script to print a completely custom<br/>&gt; character on a console text terminal?<br/>&gt; Say a D rotated 90 degrees or something.<br/>&gt; or an A with the innards filled in.<br/><br/>Here&#39;s how to print alpha to omega:<br/><br/>perl -C2 -lwe &#39;print join q(, ), map chr, 0x391 .. 0x3a9&#39;<br/><br/>&Alpha;, &Beta;, &Gamma;, &Delta;, &Epsilon;, &Zeta;, &Eta;, &Theta;, &Iota;, &Kappa;, &Lambda;, &Mu;, &Nu;, &Xi;, &Omicron;, &Pi;, &Rho;, &#x3A2;, &Sigma;, &Tau;, &Upsilon;, &Phi;, &Chi;, &Psi;, &Omega;<br/><br/>If you know the Unicode codepoints you can certainly print the characters.<br/>For Perl, it&#39;s essentially numbers.<br/><br/>Whether your particular character comes out nicely or not depends on<br/>whether the font you&#39;re using has the glyph in question.<br/><br/>Next time you might want to spend two seconds or even ten thinking<br/>about a suitable subject line &hellip;<br/><br/>-- <br/>Michael Ludwig<br/> http://www.nntp.perl.org/group/perl.unicode/2011/12/msg3338.html Fri, 30 Dec 2011 11:12:44 +0000 Please help me with this perl question by FORREST COPLEY Is it possible to write a perl script to print a completely custom <br/>character on a console text terminal?<br/>Say a D rotated 90 degrees or something.<br/>or an A with the innards filled in.<br/>-- <br/><br/>Forrest Copley | Senior Technical Support Engineer, Solaris OS Support , <br/>Global Systems Support<br/>Email: forrest.copley@oracle.com &lt;mailto:rob.hulme@oracle.com&gt;<br/>Phone: 303.272.6716<br/>OracleGlobal Customer Services<br/><br/>Log, update, and monitor your Service Request online using <br/>https://support.oracle.com &lt;https://support.oracle.com/&gt;<br/><br/><br/> http://www.nntp.perl.org/group/perl.unicode/2011/12/msg3337.html Fri, 30 Dec 2011 10:55:25 +0000 New API available to access Unicode DB, and RFC on changes to it. by Karl Williamson Repeat of last message, but now the attachment should be correct.<br/><br/>Perl 5.15.5, now available, has additions to Unicode::UCD in it to allow <br/>unfettered programmatic access to the Unicode character data base. The <br/>API is quite similar to what was sent out for comment on this list <br/>several months ago; several changes were required as a result of lessons <br/>learned during implementation. This email has an attachment that is an <br/>html file giving (with a yellow background) the additions since 5.14 to <br/>the pod.<br/><br/>As a result of this API, it is deprecated to read the files in <br/>lib/unicore directly. These may change, and the API will be stable as <br/>of 5.16. In the meantime, I&#39;d be happy to have people use this, and <br/>give me get feedback on any problems with the API or bugs in the code.<br/><br/>And, I do wish to change the API already for certain of the outputs in <br/>prop_invmap() in order to make them more compact. For example, take the <br/>uc() property. What it currently returns is this (taken from the <br/>attached pod):<br/><br/> @$uppers_ranges_ref @$uppers_maps_ref Note<br/> 0 &quot;&lt;code point&gt;&quot;<br/> 97 65 &#39;a&#39; maps to &#39;A&#39;<br/> 98 66 &#39;b&#39; =&gt; &#39;B&#39;<br/> 99 67 &#39;c&#39; =&gt; &#39;C&#39;<br/> ...<br/> 120 88 &#39;x&#39; =&gt; &#39;X&#39;<br/> 121 89 &#39;y&#39; =&gt; &#39;Y&#39;<br/> 122 90 &#39;z&#39; =&gt; &#39;Z&#39;<br/> 123 &quot;&lt;code point&gt;&quot;<br/> 181 924 MICRO SIGN =&gt; Greek Cap MU<br/> 182 &quot;&lt;code point&gt;&quot;<br/> ...<br/> 0x0149 [ 0x02BC 0x004E ]<br/> 0x014A &quot;&lt;code point&gt;&quot;<br/> 0x014B 0x014A<br/> ...<br/><br/><br/>That could be more compactly represented as:<br/> @$uppers_ranges_ref @$uppers_maps_ref Note<br/> 0 0<br/> 97 -32 &#39;a-z&#39; maps to &#39;A&#39;-&#39;Z&#39;<br/> 123 0<br/> 181 743 MICRO SIGN =&gt; Greek Cap MU<br/> 182 0<br/> ...<br/> 0x0149 [ 0x02BC 0x004E ]<br/> 0x014A 0<br/> 0x014B -1<br/> ...<br/><br/>where the map is to be added to the code point to get the final result. <br/> Thus only one entry is needed to represent all 26 ASCII lower case <br/>character mappings, instead of 26 entries. This makes such tables <br/>significantly smaller. The Perl core currently does a linear search <br/>through them looking for mappings. Using the more compact versions <br/>would speed that up significantly. The percentage gain is 30-40%, and <br/>with the mapping for decimal digits the result is a full order of <br/>magnitude smaller, making the search much much faster.<br/><br/>Returning the delta only makes sense on a few tables, ones that whose <br/>map is code points, or the decimal digits.<br/><br/>As you can see in the example for 0x0149, I wouldn&#39;t propose to make <br/>deltas of the lists, even though that is inconsistent. They generally <br/>require special handling.<br/><br/><br/> http://www.nntp.perl.org/group/perl.unicode/2011/11/msg3336.html Mon, 21 Nov 2011 13:19:46 +0000 New API available to access Unicode DB, and RFC on changes to it. by Karl Williamson Perl 5.15.5, now available, has additions to Unicode::UCD in it to allow <br/>unfettered programmatic access to the Unicode character data base. The <br/>API is quite similar to what was sent out for comment on this list <br/>several months ago; several changes were required as a result of lessons <br/>learned during implementation. This email has an attachment that is an <br/>html file giving (with a yellow background) the additions since 5.14 to <br/>the pod.<br/><br/>As a result of this API, it is deprecated to read the files in <br/>lib/unicore directly. These may change, and the API will be stable as <br/>of 5.16. In the meantime, I&#39;d be happy to have people use this, and <br/>give me get feedback on any problems with the API or bugs in the code.<br/><br/>And, I do wish to change the API already for certain of the outputs in <br/>prop_invmap() in order to make them more compact. For example, take the <br/>uc() property. What it currently returns is this (taken from the <br/>attached pod):<br/><br/> @$uppers_ranges_ref @$uppers_maps_ref Note<br/> 0 &quot;&lt;code point&gt;&quot;<br/> 97 65 &#39;a&#39; maps to &#39;A&#39;<br/> 98 66 &#39;b&#39; =&gt; &#39;B&#39;<br/> 99 67 &#39;c&#39; =&gt; &#39;C&#39;<br/> ...<br/> 120 88 &#39;x&#39; =&gt; &#39;X&#39;<br/> 121 89 &#39;y&#39; =&gt; &#39;Y&#39;<br/> 122 90 &#39;z&#39; =&gt; &#39;Z&#39;<br/> 123 &quot;&lt;code point&gt;&quot;<br/> 181 924 MICRO SIGN =&gt; Greek Cap MU<br/> 182 &quot;&lt;code point&gt;&quot;<br/> ...<br/> 0x0149 [ 0x02BC 0x004E ]<br/> 0x014A &quot;&lt;code point&gt;&quot;<br/> 0x014B 0x014A<br/> ...<br/><br/><br/>That could be more compactly represented as:<br/> @$uppers_ranges_ref @$uppers_maps_ref Note<br/> 0 0<br/> 97 -32 &#39;a-z&#39; maps to &#39;A&#39;-&#39;Z&#39;<br/> 123 0<br/> 181 743 MICRO SIGN =&gt; Greek Cap MU<br/> 182 0<br/> ...<br/> 0x0149 [ 0x02BC 0x004E ]<br/> 0x014A 0<br/> 0x014B -1<br/> ...<br/><br/>where the map is to be added to the code point to get the final result. <br/> Thus only one entry is needed to represent all 26 ASCII lower case <br/>character mappings, instead of 26 entries. This makes such tables <br/>significantly smaller. The Perl core currently does a linear search <br/>through them looking for mappings. Using the more compact versions <br/>would speed that up significantly. The percentage gain is 30-40%, and <br/>with the mapping for decimal digits the result is a full order of <br/>magnitude smaller, making the search much much faster.<br/><br/>Returning the delta only makes sense on a few tables, ones that whose <br/>map is code points, or the decimal digits.<br/><br/>As you can see in the example for 0x0149, I wouldn&#39;t propose to make <br/>deltas of the lists, even though that is inconsistent. They generally <br/>require special handling.<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2011/11/msg3335.html Mon, 21 Nov 2011 12:42:40 +0000 Re: RFC: API to access Unicode db files by Karl Williamson Here&#39;s a new version of the API for comment, with the addition of 2 <br/>extra functions:<br/><br/><br/><br/> prop_invlist()<br/> &quot;prop_invlist&quot; returns an inversion list (described below)<br/> that defines all the code points for the Unicode property<br/> given by the input parameter string:<br/><br/> use Unicode::UCD &#39;prop_invlist&#39;;<br/> say join &quot;, &quot;, prop_invlist(&quot;Any&quot;);<br/><br/> 0, 1114112<br/><br/> An empty list is returned if the given property is unknown;<br/> the number of elements in the list is returned if called in<br/> scalar context.<br/><br/> perluniprops gives the list of properties that this function<br/> accepts, as well as all the possible forms for them (loose<br/> matching rules are used on the parameter). Note that many<br/> properties can be specified in a compound form, such as<br/><br/> say join &quot;, &quot;, prop_invlist(&quot;Script=Shavian&quot;);<br/> 66640, 66688<br/><br/> say join &quot;, &quot;, prop_invlist(&quot;ASCII_Hex_Digit=No&quot;);<br/> 0, 48, 58, 65, 71, 97, 103<br/><br/> say join &quot;, &quot;, prop_invlist(&quot;ASCII_Hex_Digit=Yes&quot;);<br/> 48, 58, 65, 71, 97, 103<br/><br/> Inversion lists are a compact way of specifying Unicode<br/> properties. The 0th item in the list is the lowest code<br/> point that has the property-value. The next item is the<br/> lowest code point after that one that does NOT have the<br/> property-value. And the next item after that is the lowest<br/> code point after that one that has the property-value, and so<br/> on. Put another way, each element in the list gives the<br/> beginning of a range that has the property-value (for even<br/> numbered elements), or doesn&#39;t have the property-value (for<br/> odd numbered elements).<br/><br/> In the final example above, the first ASCII Hex digit is code<br/> point 48, the character &quot;0&quot;, and all code points from it<br/> through 57 (a &quot;9&quot;) are ASCII hex digits. Code points 58<br/> through 64 aren&#39;t, but 65 (an &quot;A&quot;) through 70 (an &quot;F&quot;) are,<br/> as are 97 (&quot;a&quot;) through 102 (&quot;f&quot;). 103 starts a range of<br/> code points that aren&#39;t ASCII hex digits. That range extends<br/> to infinity, which on your computer can be found in the<br/> variable $Unicode::UCD::MAX_CP. (This variable is as close<br/> to infinity as Perl can get on your platform, and may be too<br/> high for some operations to work; you may wish to use a<br/> smaller number for your purposes.)<br/><br/> The name for this data structure stems from the fact that<br/> each element in the list toggles (or inverts) whether the<br/> corresponding range is or isn&#39;t on the list.<br/><br/> It is a simple matter to expand out an inversion list to a<br/> full list of all code points that have the property-value:<br/><br/> my @invlist = prop_invlist(&quot;My Property&quot;);<br/> die &quot;empty&quot; unless @invlist;<br/> my @full_list;<br/> for (my $i = 0; $i &lt; @invlist; $i += 2) {<br/> my $upper = ($i + 1) &lt; @invlist<br/> ? $invlist[$i+1] - 1 # In range<br/> : $Unicode::UCD::MAX_CP; # To infinity. You <br/>may want<br/> # to stop much much <br/>earlier;<br/> # going this high may <br/>expose<br/> # perl bugs with very <br/>large<br/> # numbers.<br/> for my $j ($invlist[$i] .. $upper) {<br/> push @full_list, $j;<br/> }<br/> }<br/><br/> prop_aliases()<br/> use Unicode::UCD &#39;prop_aliases&#39;;<br/><br/> my $full_name = prop_value_aliases(&quot;White Space&quot;);<br/> my @all_names = prop_value_aliases(&quot;White Space&quot;);<br/> my $short_name = $all_names[0];<br/> print join &quot;, &quot;, @all_names, &quot;\n&quot;;<br/><br/> XXX<br/><br/> Most Unicode properties have several synonymous names.<br/> Typically, there is at least a short name, convenient to<br/> type, and a long name that more fully describes the property,<br/> and hence is more easily understood.<br/><br/> If you know one name for a property, you can use<br/> &quot;prop_aliases&quot; to find either the long name (when called in<br/> scalar context), or a list of all of the names, somewhat<br/> ordered so that the short name is in the 0th element, the<br/> long name in the next element, and any other synonyms in the<br/> remaining elements, in no particular order.<br/><br/> The long name is returned in a form nicely capitalized,<br/> suitable for printing.<br/><br/> White space, hyphens, and underscores are ignored in the<br/> input parameter name.<br/><br/> If the name is unknown, &quot;undef&quot; is returned.<br/><br/> prop_value_aliases()<br/> use Unicode::UCD &#39;prop_value_aliases&#39;;<br/><br/> my $full_name = prop_value_aliases(&quot;Gc&quot;, &quot;Punct&quot;);<br/> my @all_names = prop_value_aliases(&quot;Gc&quot;, &quot;Punct&quot;);<br/> my $short_name = $all_names[0];<br/> print &quot;The aliases are: &quot;, join &quot;, &quot;, @all_names, &quot;\n&quot;;<br/> print &quot;The fullname is $full_name\n&quot;;<br/><br/> The aliases are: P, Punctuation, Punct<br/> The fullname is Punctuation<br/><br/> Some Unicode properties have a restricted set of legal<br/> values. For example, all binary properties are restricted to<br/> just &quot;true&quot; or &quot;false&quot;; and there are only a few dozen<br/> possible General Categories.<br/><br/> For such properties, there are usually several synonyms for<br/> each possible value. For example, in binary properties,<br/> truth can be represented by any of the strings, &quot;Y&quot;, &quot;Yes&quot;,<br/> &quot;T&quot;, or &quot;True&quot;; and the General Category &quot;Punctuation&quot; by<br/> that string, or &quot;Punct&quot;, or simply &quot;P&quot;.<br/><br/> Like property names, there is typically at least a short name<br/> for each such property-value, and a long name. If you know<br/> any name of the property-value, you can use<br/> &quot;prop_value_aliases&quot;() to get the long name (when called in<br/> scalar context), or a list of all the names, with the short<br/> name in the 0th element, the long name in the next element,<br/> and any other synonyms in the remaining elements, in no<br/> particular order, except that any all-numeric synonyms will<br/> be last.<br/><br/> The long name is returned in a form nicely capitalized,<br/> suitable for printing.<br/><br/> White space, hyphens, and underscores are ignored in the<br/> input parameters.<br/><br/> If either name is unknown, &quot;undef&quot; is returned.<br/><br/> If called with a property that doesn&#39;t have synonyms for its<br/> values, it returns the input value, possibly normalized with<br/> capitalization and underscores.<br/><br/> For the block property, new-style block names are returned<br/> (see &quot;Old-style versus new-style block names&quot;).<br/><br/> prop_invmap()<br/> &quot;prop_invmap&quot; is used to get the complete mapping definition<br/> for a property, in the form of an inversion map. An<br/> inversion map consists of two parallel arrays. One is an<br/> ordered list of code points that mark range beginnings, and<br/> the other gives the value (or mapping) that all code points<br/> in the corresponding range have.<br/><br/> &quot;prop_invmap&quot; is called with the name of the desired<br/> property. The name is loosely matched, meaning that<br/> differences in case, white-space, hyphens, and underscores<br/> are not meaningful. Many Unicode properties have more than<br/> one name (or alias). &quot;prop_invmap&quot; understands all of these.<br/> &quot;undef&quot; is returned if the property name is unknown.<br/><br/> It is a fatal error to call this function except in list<br/> context.<br/><br/> In addition to the the two arrays that form the inversion<br/> map, &quot;prop_invmap&quot; returns two other values, one is a scalar<br/> that gives some details as to the format of the entries of<br/> the map array; the other is used for specialized purposes,<br/> described at the end of this section.<br/><br/> This means that &quot;prop_invmap&quot; returns a 4 element list. For<br/> example,<br/><br/> my ($blocks_ranges_ref, $blocks_maps_ref, $format, $default)<br/> = prop_invmap(&quot;Block&quot;);<br/><br/> In this call, the two arrays will be populated as shown below<br/> (for Unicode 6.0):<br/><br/> Index @blocks_ranges @blocks_maps<br/> 0 0x0000 Basic Latin<br/> 1 0x0080 Latin-1 Supplement<br/> 2 0x0100 Latin Extended-A<br/> 3 0x0180 Latin Extended-B<br/> 4 0x0250 IPA Extensions<br/> 5 0x02B0 Spacing Modifier Letters<br/> 6 0x0300 Combining Diacritical Marks<br/> 7 0x0370 Greek and Coptic<br/> 8 0x0400 Cyrillic<br/> ...<br/> 233 0x2B820 No_Block<br/> 234 0x2F800 CJK Compatibility Ideographs Supplement<br/> 235 0x2FA20 No_Block<br/> 236 0xE0000 Tags<br/> 237 0xE0080 No_Block<br/> 238 0xE0100 Variation Selectors Supplement<br/> 239 0xE01F0 No_Block<br/> 240 0xF0000 Supplementary Private Use Area-A<br/> 241 0x100000 Supplementary Private Use Area-B<br/> 242 0x110000 No_Block<br/><br/> The first line (with Index 0) means that the value for code<br/> point 0 is &quot;Basic Latin&quot;. The entry &quot;0x0080&quot; in the<br/> @blocks_ranges column in the second line means that the value<br/> from the first line, &quot;Basic Latin&quot;, extends to all code<br/> points in the range up to but not including 0x0080, that is,<br/> to 255. In other words, the code points from 0 to 255 are<br/> all in the &quot;Basic Latin&quot; block. Similarly, all code points<br/> in the range from 0x0080 up to (but not including) 0x0100 are<br/> in the block named &quot;Latin-1 Supplement&quot;, etc. (Notice that<br/> the return is the old-style block names; see &quot;Old-style<br/> versus new-style block names&quot;).<br/><br/> The final line (with Index 242) means that the value for all<br/> code points above the legal Unicode maximum code point have<br/> the value &quot;No_Block&quot;, which is the term Unicode uses for a<br/> non-existing block.<br/><br/> The arrays completely specify the mappings for all possible<br/> code points. The final element in an inversion map returned<br/> by this function will always be for the range that consists<br/> of all the code points that aren&#39;t legal Unicode, but that<br/> are expressible on the platform. (That is, it starts with<br/> code point 0x110000, the first code point above the legal<br/> Unicode maximum, and extends to infinity.) The value for that<br/> range will be the same that any normal unassigned code point<br/> has for the specified property. (Certain unassigned code<br/> points are not &quot;normal&quot;; for example the non-character code<br/> points, or those in blocks that are to be written right-to-<br/> left. The range value will not necessarily be the same as<br/> those code points have.) It could be argued that, instead of<br/> treating these as unassigned Unicode code points, the value<br/> for this range should be &quot;undef&quot;. You can make that decision<br/> and change the returned array accordingly.<br/><br/> The maps are almost always simple scalars that should be<br/> interpreted as-is. These values are those given in the<br/> Unicode data files, which may be inconsistent as to<br/> capitalization and which synonym for a property-value is<br/> given. The results may be normalized by using the<br/> &quot;prop_value_aliases()&quot; function.<br/><br/> There are exceptions to the simple scalar maps. Some<br/> properties have some elements in their map list that are<br/> themselves lists of scalars; and some special strings are<br/> returned that are not to be interpreted as-is. Element [2]<br/> (placed into $format in the example above) of the returned 4<br/> element list tells you if the map has any of these special<br/> elements, as follows:<br/><br/> &quot;s&quot; means all the elements of the map array are simple<br/> scalars. Almost all properties are like this, like the<br/> &quot;block&quot; example above.<br/><br/> &quot;sl&quot;<br/> means that some of the map array elements have the form<br/> given by &quot;s&quot;, and the rest are lists of scalars. For<br/> example, here is a portion of the output of calling<br/> &quot;prop_invmap&quot;() with the &quot;Script Extensions&quot; property:<br/><br/> @scripts_ranges @scripts_maps<br/> ...<br/> 0x0953 Deva<br/> 0x0964 [ Beng Deva Guru Orya ]<br/> 0x0966 Deva<br/> 0x0970 Common<br/><br/> Here, the code points 0x964 and 0x965 are used in the<br/> Bengali, Devanagari, Gurmukhi, and Oriya scripts.<br/><br/> &quot;r&quot; means that all the elements of the map array are either<br/> rational numbers or the string &quot;NaN&quot;, meaning &quot;Not a<br/> Number&quot;. A rational number is either an integer, or two<br/> integers separated by a solidus (&quot;/&quot;). The second<br/> integer represents the denominator of the division<br/> implied by the solidus, and is guaranteed not to be 0.<br/> If you want to convert them to scalar numbers, you can<br/> use something like this:<br/><br/> my ($format, $invlist_ref, $invmap_ref)<br/> = prop_invmap($property);<br/> if ($format &amp;&amp; $format eq &quot;r&quot;) {<br/> map { $_ = eval $_ } @$invmap_ref;<br/> }<br/><br/> Here&#39;s some entries from the output of the property &quot;Nv&quot;,<br/> which has format &quot;r&quot;.<br/><br/> @numerics_ranges @numerics_maps Note<br/> 0x00 &quot;NaN&quot;<br/> 0x30 0 DIGIT 0<br/> 0x31 1<br/> 0x32 2<br/> ...<br/> 0x37 7<br/> 0x38 8<br/> 0x39 9 DIGIT 9<br/> 0x3A &quot;NaN&quot;<br/> 0xB2 2 SUPERSCRIPT 2<br/> 0xB3 3 SUPERSCRIPT 2<br/> 0xB4 &quot;NaN&quot;<br/> 0xB9 1 SUPERSCRIPT 1<br/> 0xBA &quot;NaN&quot;<br/> 0xBC 1/4 VULGAR FRACTION 1/4<br/> 0xBD 1/2 VULGAR FRACTION 1/2<br/> 0xBE 3/4 VULGAR FRACTION 3/4<br/> 0xBF &quot;NaN&quot;<br/> 0x660 0 ARABIC-INDIC DIGIT ZERO<br/><br/> &quot;c&quot; is like &quot;s&quot; in that all the map array elements are<br/> scalars, but some of them are the special string<br/> &quot;&lt;code point&gt;&quot;, meaning that the map of each code point<br/> in the corresponding range in the inversion list is the<br/> code point itself. For example, in:<br/><br/> my ($format, $uppers_ranges_ref, $uppers_maps_ref)<br/> = prop_invmap(&quot;Simple_Uppercase_Mapping&quot;);<br/><br/> the returned arrays look like this:<br/><br/> @$uppers_ranges_ref @$uppers_maps_ref Note<br/> 0 &quot;&lt;code point&gt;&quot;<br/> 97 65 &#39;a&#39; maps to &#39;A&#39;<br/> 98 66 &#39;b&#39; =&gt; &#39;B&#39;<br/> 99 67 &#39;c&#39; =&gt; &#39;C&#39;<br/> ...<br/> 120 88 &#39;x&#39; =&gt; &#39;X&#39;<br/> 121 89 &#39;y&#39; =&gt; &#39;Y&#39;<br/> 122 90 &#39;z&#39; =&gt; &#39;Z&#39;<br/> 123 &quot;&lt;code point&gt;&quot;<br/> 181 924 MICRO SIGN =&gt;<br/> Greek Cap MU<br/> 182 &quot;&lt;code point&gt;&quot;<br/> ...<br/><br/> The first line means that the uppercase of code point 0<br/> is 0; the uppercase of code point 1 is 1; ... of code<br/> point 96 is 96. Without the &quot;&lt;code_point&gt;&quot; notation,<br/> every code point would have to have an entry. This would<br/> mean that the arrays would each have more than a million<br/> entries to list just the legal Unicode code points!<br/><br/> &quot;cl&quot;<br/> means that some of the map array elements have the form<br/> given by &quot;c&quot;, and the rest are ordered lists of code<br/> points. For example, in:<br/><br/> my ($format, $uppers_ranges_ref, $uppers_maps_ref)<br/> = prop_invmap(&quot;Uppercase_Mapping&quot;);<br/><br/> the returned arrays look like this:<br/><br/> @$uppers_ranges_ref @$uppers_maps_ref Note<br/> 0 &quot;&lt;code point&gt;&quot;<br/> 97 65<br/> ...<br/> 122 90<br/> 123 &quot;&lt;code point&gt;&quot;<br/> 181 924<br/> 182 &quot;&lt;code point&gt;&quot;<br/> ...<br/> 0x0149 [ 0x02BC 0x004E ]<br/><br/> This is the full Uppercase_Mapping property (as opposed<br/> to the Simple_Uppercase_Mapping given in the example for<br/> &quot;c&quot;). The only difference between the two in the ranges<br/> shown is that the code point at 0x0149 (LATIN SMALL<br/> LETTER N PRECEDED BY APOSTROPHE) maps to a string of two<br/> characters, 0x02BC (MODIFIER LETTER APOSTROPHE) followed<br/> by 0x004E (LATIN CAPITAL LETTER N).<br/><br/> &quot;n&quot; means the Name property. All the elements of the map<br/> array are simple scalars, but some of them contain<br/> special strings that require more work to get the actual<br/> name.<br/><br/> Entries such as:<br/><br/> CJK UNIFIED IDEOGRAPH-&lt;code point&gt;<br/><br/> mean that the name for the code point is &quot;CJK UNIFIED<br/> IDEOGRAPH-&quot; with the code point (expressed in<br/> hexadecimal) appended to it (similarly for &quot;CJK<br/> COMPATIBILITY IDEOGRAPH-&lt;code point&gt;&quot;).<br/><br/> Also, entries like<br/><br/> &lt;hangul syllable&gt;<br/><br/> means that the name is algorithmically calculated. This<br/> is easily done by the function charnames::viacode().<br/><br/> Note that for control characters (&quot;Gc=cc&quot;), Unicode&#39;s<br/> data files have the string &quot;&quot;control&quot;&quot;, but the real name<br/> of each of these characters is the empty string. This<br/> function returns the real name.<br/><br/> &quot;d&quot; means the Decomposition_Mapping property. Like &quot;n&quot;, this<br/> property uses<br/><br/> &lt;hangul syllable&gt;<br/><br/> for those code points whose decomposition is<br/> algorithmically calculated. These can be generated via<br/> the function Unicode::Normalize::NFD().<br/><br/> Otherwise, this property is like &quot;cl&quot; properties.<br/><br/> Note that the mapping is the one that is specified in the<br/> Unicode data files, and to get the final decomposition,<br/> it may need to be applied recursively.<br/><br/> A binary search can be used to quickly find a code point in<br/> the inversion list, and hence its corresponding mapping.<br/><br/> The final element ([3], assigned to $default in the &quot;block&quot;<br/> example) in the list returned by this function may be useful<br/> for applications that wish to convert the returned inversion<br/> map data structure into some other, such as a hash. It gives<br/> the mapping that most code points map to under the property.<br/> If you establish the convention that any code point not<br/> explicitly listed in your data structure maps to this value,<br/> you can potentially make your data structure much smaller.<br/> As you construct your data structure from the one returned by<br/> this function, simply ignore those ranges that map to this<br/> value, generally called the &quot;default&quot; value.<br/><br/> One internal Perl property is accessible by this function.<br/> &quot;Perl_Decimal_Digit&quot; returns an inversion map in which all<br/> the Unicode decimal digits map to their numeric values, and<br/> everything else to the empty string, like so:<br/><br/> @digits @values<br/> 0x0000 &quot;&quot;<br/> 0x0030 0<br/> 0x0031 1<br/> 0x0032 2<br/> 0x0033 3<br/> 0x0034 4<br/> 0x0035 5<br/> 0x0036 6<br/> 0x0037 7<br/> 0x0038 8<br/> 0x0039 9<br/> 0x003A &quot;&quot;<br/> 0x0660 0<br/> 0x0661 1<br/> ...<br/><br/> Old-style versus new-style block names<br/> Unicode publishes the names of blocks in two different<br/> styles, though the two are equivalent under Unicode&#39;s loose<br/> matching rules.<br/><br/> The original style uses blanks and hyphens in the block names<br/> (except for &quot;No_Block&quot;), like so:<br/><br/> Miscellaneous Mathematical Symbols-B<br/><br/> The newer style replaces these with underscores, like this:<br/><br/> Miscellaneous_Mathematical_Symbols_B<br/><br/> This newer style is consistent with the values of other<br/> Unicode properties. To preserve backward compatibility, all<br/> the functions in Unicode::UCD that return block names (except<br/> one) return the old-style ones. That one function,<br/> &quot;prop_value_aliases&quot;() can be used to convert from old-style<br/> to new-style:<br/><br/> my $new_style = prop_values_aliases(&quot;block&quot;, $old_style);<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2011/08/msg3334.html Wed, 17 Aug 2011 13:42:17 +0000 Re: RFC: API to access Unicode db files by Zefram Karl Williamson wrote:<br/>&gt; prop_invmap(&quot;Numeric_Value&quot;, \@numerics_ranges, \@numerics_maps);<br/><br/>Looks fine to me, except for the convention for returning the arrays.<br/>It would be neater to return a list of items rather than modify arrays<br/>in place:<br/><br/> ($status, $ranges, $maps) = prop_invmap(&quot;Numeric_Value&quot;);<br/><br/>-zefram<br/> http://www.nntp.perl.org/group/perl.unicode/2011/07/msg3333.html Sun, 24 Jul 2011 20:26:21 +0000 RFC: API to access Unicode db files by Karl Williamson Some applications are finding it necessary to read in the Unicode files <br/>that mktables generates. For example, grepping through CPAN indicates <br/>that Text::Unicode::Equivalents reads Decomposition.pl. This, and most <br/>of the other generated files are marked for internal use only, because <br/>we wish to reserve the right to change them around, etc. But <br/>applications currently have no feasible alternative. Prior to 5.14, we <br/>delivered the full Unicode db files that the Unicode consortium <br/>publishes, and whose format is guaranteed not to change. But we dropped <br/>those files in 5.14 to save disk space.<br/><br/>I&#39;m proposing a new function Unicode::UCD::prop_invmap() to return the <br/>contents of those files in a Unicode-centric way, so that applications <br/>can use it and we can deprecate non-core use of our generated files.<br/><br/>The function returns an inversion map, which is a data structure more <br/>used in the Unicode world than the Perl world. It consists of two <br/>parallel arrays. I suppose a more Perl-centric data structure would be <br/>an array of hashes, but the inversion map seems simpler to me to manipulate.<br/><br/>(This function would be in addition to the previously rfc&#39;d function <br/>Unicode::UCD::prop_invlist() which would return a list of all code <br/>points that match a property-value.)<br/><br/>=pod<br/><br/>=head2 prop_invmap<br/><br/>C&lt;prop_invmap&gt; is used to get the complete mapping definition for the input<br/>property, in the form of an inversion map. An inversion map consists of two<br/>parallel arrays. One is an ordered list of code points that mark range<br/>beginnings, and the other gives the value that all code points in the<br/>corresponding range have. C&lt;prop_invmap&gt; is called with the name of the<br/>desired property, and references to the two arrays, which it fills. For<br/>example,<br/><br/> prop_invmap(&quot;Numeric_Value&quot;, \@numerics_ranges, \@numerics_maps);<br/><br/>will populate the arrays as shown below:<br/><br/> @numerics_ranges @numerics_maps Note<br/> 0x00 &quot;NaN&quot; NaN stands for &quot;Not a Number&quot;<br/> 0x30 0 DIGIT 0<br/> 0x31 1<br/> 0x32 2<br/> ...<br/> 0x37 7<br/> 0x38 8<br/> 0x39 9 DIGIT 9<br/> 0x3A &quot;NaN&quot;<br/> 0xB2 2 SUPERSCRIPT 2<br/> 0xB3 3 SUPERSCRIPT 2<br/> 0xB4 &quot;NaN&quot;<br/> 0xB9 1 SUPERSCRIPT 1<br/> 0xBA &quot;NaN&quot;<br/> 0xBC 0.25 VULGAR FRACTION 1/4<br/> 0xBD 0.5 VULGAR FRACTION 1/2<br/> 0xBE 0.75 VULGAR FRACTION 3/4<br/> 0xBF &quot;NaN&quot;<br/> 0x660 0 ARABIC-INDIC DIGIT ZERO<br/> ... ...<br/> 0x110000 undef<br/><br/>The second line means that the value for the code point 0x30 (which is <br/>&quot;DIGIT<br/>0&quot;) is 0. The first line means that all code points in the range from <br/>0x00 to<br/>0x2F (which is 0x30 (from the second line) - 1) have the value &quot;NaN&quot;.<br/>The final line means that the value for all code points above the legal<br/>Unicode maximum code point have the value C&lt;undef&gt; (not the string <br/>&quot;u-n-d-e-f&quot;).<br/><br/>The arrays completely specify the mappings for all possible code points.<br/><br/>The special string S&lt;C&lt;&quot;E&lt;lt&gt;code pointE&lt;gt&gt;&quot;&gt;&gt; is used to specify that<br/>the value of a code point is itself. For example, the beginnings of the<br/>arrays for<br/><br/> prop_invmap(&quot;Uppercase_Mapping&quot;, \@uppers_ranges, \@uppers_maps);<br/><br/>look like this:<br/><br/> @uppers_ranges @uppers_maps Note<br/> 0 &quot;&lt;code point&gt;&quot;<br/> 97 65 &#39;a&#39; maps to &#39;A&#39;<br/> 98 66 &#39;b&#39; =&gt; &#39;B&#39;<br/> 99 67 &#39;c&#39; =&gt; &#39;C&#39;<br/> ...<br/> 120 88 &#39;x&#39; =&gt; &#39;X&#39;<br/> 121 89 &#39;y&#39; =&gt; &#39;Y&#39;<br/> 122 90 &#39;z&#39; =&gt; &#39;Z&#39;<br/> 123 &quot;&lt;code point&gt;&quot;<br/> 181 924 MICRO SIGN =&gt; Greek Cap MU<br/> 182 &quot;&lt;code point&gt;&quot;<br/> 223 [ 83 83 ] SHARP S =&gt; &#39;SS&#39;<br/> 224 192<br/><br/>The first line means that the uppercase of code point 0 is 0, of 1 is 1, ...<br/>of 96 is 96. Without the C&lt;&quot;E&lt;lt&gt;code_pointE&lt;gt&gt;&quot;&gt; notation, every code <br/>point<br/>would have to have an entry. This would mean that the arrays would each <br/>have<br/>more than a million entries to list just the legal Unicode code points!<br/><br/>In some properties some code points map to a sequence of multiple code <br/>points.<br/>For those, the corresponding entries in the map array are not scalars, but<br/>references to anonymous arrays containing the ordered list of code points<br/>mapped to, as shown in the example above for 223.<br/><br/>The &quot;Name&quot; property map includes entries such as<br/><br/> CJK UNIFIED IDEOGRAPH-&lt;code point&gt;<br/><br/>This means that the name for the code point is &quot;CJK UNIFIED IDEOGRAPH-&quot;<br/>with the code point (expressed in hexadecimal) appended to it. Also, the<br/>notation &quot;E&lt;lt&gt;hangul syllableE&lt;gt&gt;&quot; occurs in this property, meaning <br/>that the<br/>name is algorithmically calculated. These names can be generated via the<br/>function C&lt;charnames::viacode&gt;().<br/><br/>The &quot;Decomposition_Mapping&quot; property also uses &quot;E&lt;lt&gt;hangul <br/>syllableE&lt;gt&gt;&quot; for<br/>those code points whose decomposition is algorithmically calculated. These<br/>can be generated via the function C&lt;Unicode::Normalize::NFD&gt;(). This <br/>property<br/>contains many occurrences of code points whose mappings are ordered lists of<br/>other code points.<br/><br/>The return value is<br/>C&lt;undef&gt; if the property is unknown;<br/>C&lt;s&gt; if all the elements of the map array are simple scalars;<br/>C&lt;n&gt; for the Name property, which has the complications described above;<br/>C&lt;d&gt; for the Decomposition_Mapping property (complications already <br/>described);<br/>otherwise C&lt;c&gt; if some of map array elements are S&lt;C&lt;&quot;E&lt;lt&gt;code <br/>pointE&lt;gt&gt;&quot;&gt;&gt;;<br/>and C&lt;l&gt; if additionally some are lists of code points.<br/><br/>A binary search can be used to quickly find a code point in the inversion<br/>list, and hence its corresponding mapping.<br/><br/>=cut<br/><br/><br/> http://www.nntp.perl.org/group/perl.unicode/2011/07/msg3332.html Thu, 21 Jul 2011 08:04:27 +0000 Re: Encode question by Karl Williamson On 07/07/2011 01:17 AM, Dave Saunders wrote:<br/>&gt; Dear Encode Developers,<br/>&gt;<br/>&gt; I am migrating a perl application from Solaris 2.10 to Linux Fedora Core<br/>&gt; 14 (2.6.35.13-92.fc14.x86_64), which is running perl 5.12.3. The app<br/>&gt; uses SDBM and I&#39;m encountering a problem which looks related to the<br/>&gt; Encode module (which is at 2.39).<br/>&gt;<br/>&gt; The error message is:<br/>&gt;<br/>&gt;<br/>&gt; panic: sv_setpvn called with negative strlen at p.pl line 21.<br/>&gt;<br/>&gt;<br/>&gt; and this URL led me to Encode:<br/>&gt;<br/>&gt;<br/>&gt; https://rt.cpan.org/Public/Bug/Display.html?id=65541<br/>&gt;<br/>&gt;<br/>&gt; I ran the simple test at the bottom of the web page:<br/>&gt;<br/>&gt;<br/>&gt; binmode STDOUT, &#39;:encoding(cp1250)&#39;;<br/>&gt; print( ( &quot;a&quot; x 1023 ) . &quot;\x{0378}&quot; );<br/>&gt;<br/>&gt;<br/>&gt; and got the same output:<br/>&gt;<br/>&gt;<br/>&gt; &quot;\x{0340}&quot; does not map to cp1250 at a.pl line 2.<br/>&gt; panic: sv_setpvn called with negative strlen at a.pl line 2.<br/>&gt;<br/>&gt;<br/>&gt; I looked at CPAN:<br/>&gt; http://search.cpan.org/~dankogai/Encode-2.43/<br/>&gt;<br/>&gt;<br/>&gt; and read the Changes file, but don&#39;t see Bug 65541 discussed.<br/>&gt;<br/>&gt;<br/>&gt; Has this problem been solved, or is it being worked on?<br/>&gt; Is there a work-around for SDBM files (other than not using SDBM)?<br/>&gt;<br/>&gt;<br/>&gt; Thanks very much for your help,<br/>&gt; David Saunders<br/>&gt; ITS/UVa<br/>&gt;<br/><br/>It happens in today&#39;s latest development version of Perl. It looks like <br/>an Encode bug, and you&#39;d have to contact the maintainer, Dan, directly <br/>to find out its status.<br/> http://www.nntp.perl.org/group/perl.unicode/2011/07/msg3331.html Thu, 07 Jul 2011 10:19:05 +0000 Encode question by Dave Saunders Dear Encode Developers,<br/><br/> I am migrating a perl application from Solaris 2.10 to Linux Fedora Core <br/>14 (2.6.35.13-92.fc14.x86_64), which is running perl 5.12.3. The app uses <br/>SDBM and I&#39;m encountering a problem which looks related to the<br/>Encode module (which is at 2.39).<br/><br/>The error message is:<br/><br/><br/>panic: sv_setpvn called with negative strlen at p.pl line 21.<br/><br/><br/>and this URL led me to Encode:<br/><br/><br/>https://rt.cpan.org/Public/Bug/Display.html?id=65541<br/><br/><br/>I ran the simple test at the bottom of the web page:<br/><br/><br/>binmode STDOUT, &#39;:encoding(cp1250)&#39;;<br/>print( ( &quot;a&quot; x 1023 ) . &quot;\x{0378}&quot; );<br/><br/><br/>and got the same output:<br/><br/><br/>&quot;\x{0340}&quot; does not map to cp1250 at a.pl line 2.<br/>panic: sv_setpvn called with negative strlen at a.pl line 2.<br/><br/><br/>I looked at CPAN:<br/>http://search.cpan.org/~dankogai/Encode-2.43/<br/><br/><br/>and read the Changes file, but don&#39;t see Bug 65541 discussed.<br/><br/><br/>Has this problem been solved, or is it being worked on?<br/>Is there a work-around for SDBM files (other than not using SDBM)?<br/><br/><br/>Thanks very much for your help,<br/>David Saunders<br/>ITS/UVa<br/> http://www.nntp.perl.org/group/perl.unicode/2011/07/msg3330.html Thu, 07 Jul 2011 00:31:52 +0000 Re: Need: list of Unicode characters that have canonical decompositions. by Karl Williamson On 07/01/2011 11:49 AM, Karl Williamson wrote:<br/>&gt; On 07/01/2011 10:40 AM, BobH wrote:<br/>&gt;&gt; Karl Williamson wrote:<br/>&gt;&gt;&gt;&gt;<br/>&gt;&gt;&gt;<br/>&gt;&gt;&gt; I&#39;m trying to think of a good name. Best so far is<br/>&gt;&gt;&gt; UCD::get_prop_invlist()<br/>&gt;&gt;<br/>&gt;&gt;<br/>&gt;&gt; Hm, &quot;get&quot; normally isn&#39;t needed.<br/>&gt;&gt;<br/>&gt;&gt; How about something simpler such as UCD::charlist()<br/>&gt;&gt;<br/>&gt;&gt; Bob<br/>&gt;&gt;<br/>&gt;<br/>&gt; I think not having prop in the name is potentially misleading, and it<br/>&gt; actually isn&#39;t a list of the chars. It&#39;s an inversion list that is<br/>&gt; readily convertible into such a list.<br/>&gt;<br/><br/>I&#39;ve mostly written and tested it. But here is my proposed API to see <br/>how people like it (or not); (I&#39;m still open to a better name, but I do <br/>thing that the name needs to have the requirements I mentioned above):<br/><br/><br/>=pod<br/><br/>=head2 prop_invlist<br/><br/>C&lt;prop_invlist&gt; returns an inversion list (see below) that defines all the<br/>code points for the Unicode property given by the input parameter string:<br/><br/> say join &quot;, &quot;, prop_invlist(&quot;Any&quot;);<br/> 0, 1114112<br/><br/>An empty list is returned if the given property is unknown.<br/><br/>L&lt;perluniprops|perluniprops/Properties accessible through \p{} and \P{}&gt; <br/>gives<br/>the list of properties that this function accepts, as well as all the <br/>possible<br/>forms for them. Note that many properties can be specified in a compound<br/>form, such as<br/><br/> say join &quot;, &quot;, prop_invlist(&quot;Script=Shavian&quot;);<br/> 66640, 66688<br/><br/> say join &quot;, &quot;, prop_invlist(&quot;ASCII_Hex_Digit=No&quot;);<br/> 0, 48, 58, 65, 71, 97, 103<br/><br/> say join &quot;, &quot;, prop_invlist(&quot;ASCII_Hex_Digit=Yes&quot;);<br/> 48, 58, 65, 71, 97, 103<br/><br/>Inversion lists are a compact way of specifying Unicode properties. The 0th<br/>item in the list is the lowest code point that has the property-value. The<br/>next item is the lowest code point after that one that does NOT have the<br/>property-value. And the next item after that is the lowest code point after<br/>that one that has the property-value, and so on. Put another way, each<br/>element in the list gives the beginning of a range that has the <br/>property-value<br/>(for even numbered elements), or doesn&#39;t have the property-value (for odd<br/>numbered elements).<br/><br/>In the final example above, the first ASCII Hex digit is code point 48, the<br/>character &quot;0&quot;, and all code points from it through 57 (a &quot;9&quot;) are ASCII hex<br/>digits. Code points 58 through 64 aren&#39;t, but 65 (an &quot;A&quot;) through 70 <br/>(an &quot;F&quot;)<br/>are, as are 97 (&quot;a&quot;) through 102 (&quot;f&quot;). 103 starts a range of code points<br/>that aren&#39;t ASCII hex digits. That range extends to infinity, which on your<br/>computer can be found in the variable C&lt;$Unicode::UCD::MAX_CP&gt;.<br/><br/>It is a simple matter to expand out an inversion list to a full list of all<br/>code points that have the property-value:<br/><br/> my @invlist = prop_invlist(&quot;My Property&quot;);<br/> die &quot;empty&quot; unless @invlist;<br/> my @full_list;<br/> for (my $i = 0; $i &lt; @invlist; $i += 2) {<br/> my $upper = ($i + 1) &lt; @invlist<br/> ? $invlist[$i+1] - 1 # In range<br/> : $Unicode::UCD::MAX_CP; # To infinity. You may want<br/> # to stop earlier<br/> for my $j ($invlist[$i] .. $upper) {<br/> print $upper, &quot;: &quot;, $j, &quot;\n&quot;;<br/> push @full_list, $j;<br/> }<br/> }<br/><br/>=cut<br/> http://www.nntp.perl.org/group/perl.unicode/2011/07/msg3329.html Wed, 06 Jul 2011 13:43:18 +0000 Re: Need: list of Unicode characters that have canonical decompositions. by Karl Williamson On 07/01/2011 10:40 AM, BobH wrote:<br/>&gt; Karl Williamson wrote:<br/>&gt;&gt;&gt;<br/>&gt;&gt;<br/>&gt;&gt; I&#39;m trying to think of a good name. Best so far is<br/>&gt;&gt; UCD::get_prop_invlist()<br/>&gt;<br/>&gt;<br/>&gt; Hm, &quot;get&quot; normally isn&#39;t needed.<br/>&gt;<br/>&gt; How about something simpler such as UCD::charlist()<br/>&gt;<br/>&gt; Bob<br/>&gt;<br/><br/>I think not having prop in the name is potentially misleading, and it <br/>actually isn&#39;t a list of the chars. It&#39;s an inversion list that is <br/>readily convertible into such a list.<br/> http://www.nntp.perl.org/group/perl.unicode/2011/07/msg3328.html Fri, 01 Jul 2011 10:50:24 +0000 Re: Need: list of Unicode characters that have canonical decompositions. by BobH Karl Williamson wrote:<br/>&gt;&gt;<br/>&gt;<br/>&gt; I&#39;m trying to think of a good name. Best so far is<br/>&gt; UCD::get_prop_invlist()<br/><br/><br/>Hm, &quot;get&quot; normally isn&#39;t needed.<br/><br/>How about something simpler such as UCD::charlist()<br/><br/>Bob<br/> http://www.nntp.perl.org/group/perl.unicode/2011/07/msg3327.html Fri, 01 Jul 2011 09:40:34 +0000 Re: Need: list of Unicode characters that have canonical decompositions. by Karl Williamson On 06/29/2011 09:06 AM, BobH wrote:<br/>&gt; Karl Williamson wrote:<br/>&gt;<br/>&gt;&gt; If I did this, I would be tempted to have it return an inversion<br/>&gt;&gt; list, instead of an array of every code point that matches the<br/>&gt;&gt; property. ...<br/>&gt;&gt;<br/>&gt;&gt; My question to you is would that be acceptable to you, do you think?<br/>&gt;&gt; I hate to return an enormous array by default when the application<br/>&gt;&gt; doesn&#39;t really need it.<br/>&gt;<br/>&gt; Yes, that kind of representation would be sufficient and reasonably<br/>&gt; compact.<br/>&gt;<br/>&gt; Thanks.<br/>&gt;<br/>&gt; Bob<br/>&gt;<br/><br/>I&#39;m trying to think of a good name. Best so far is UCD::get_prop_invlist()<br/><br/>Any ideas<br/> http://www.nntp.perl.org/group/perl.unicode/2011/07/msg3326.html Fri, 01 Jul 2011 08:38:22 +0000 Re: Need: list of Unicode characters that have canonical decompositions. by BobH Karl Williamson wrote:<br/><br/>&gt; If I did this, I would be tempted to have it return an inversion<br/>&gt; list, instead of an array of every code point that matches the<br/>&gt; property. ...<br/>&gt;<br/>&gt; My question to you is would that be acceptable to you, do you think?<br/>&gt; I hate to return an enormous array by default when the application<br/>&gt; doesn&#39;t really need it.<br/><br/>Yes, that kind of representation would be sufficient and reasonably compact.<br/><br/>Thanks.<br/><br/>Bob<br/> http://www.nntp.perl.org/group/perl.unicode/2011/06/msg3325.html Wed, 29 Jun 2011 08:06:21 +0000 Re: Need: list of Unicode characters that have canonical decompositions. by Karl Williamson On 06/27/2011 08:04 PM, BobH wrote:<br/>&gt; Karl Williamson wrote:<br/>&gt;<br/>&gt; &gt; I&#39;m presuming you need this not for a one-time only thing, but to be<br/>&gt; &gt; able to run this program over and over.<br/>&gt;<br/>&gt; Yes -- this is for a module that will be usable in a number of<br/>&gt; situations. See<br/>&gt; http://search.cpan.org/~bhallissy/Text-Unicode-Equivalents-0.05/.<br/>&gt;<br/>&gt; The current implementation cheats by accessing unicore/Decomposition.pl<br/>&gt; exactly the same way Unicode::UCD does.<br/>&gt;<br/>&gt; &gt; You can always download UnicodeData.txt from the Unicode web site.<br/>&gt;<br/>&gt; Yes I can -- and certainly have done for my personal use. But including<br/>&gt; that file (or some derivative) in a general purpose module would mean<br/>&gt; that it wouldn&#39;t necessarily have the same Unicode version as the Perl<br/>&gt; installation into which my module might be installed. And besides, the<br/>&gt; information I need is already in the Perl core -- though supposedly not<br/>&gt; usable.<br/>&gt;<br/>&gt; &gt; In a regular expression,<br/>&gt; &gt; \p{Dt= can} (Decomposition_Type=Canonical) will match all characters<br/>&gt; &gt; that you want.<br/>&gt;<br/>&gt; Yes, I understand that I can test a character to see if it has a<br/>&gt; particular decomposition, but I&#39;m not sure I understand how to use a<br/>&gt; regex to generate a complete list of characters with decompositions.<br/>&gt;<br/>&gt; &gt; I&#39;m thinking that 5.16 will have the stringification<br/>&gt; &gt; of that regex include the list you want, but not in 5.14, and<br/>&gt; &gt; stringification is not necessarily fixed either.<br/>&gt; &gt;<br/>&gt; &gt; I could easily write a new function for UCD that returns a list of<br/>&gt; &gt; all code points that have a given property.<br/>&gt;<br/>&gt; That is an interesting offer, and I think this should be given serious<br/>&gt; consideration. I&#39;m sure my little module isn&#39;t the only one that, as we<br/>&gt; go into the future, would benefit from such a function.<br/>&gt;<br/>&gt; Thanks for your reply, Karl.<br/>&gt;<br/>&gt; Bob<br/>&gt;<br/><br/>If I did this, I would be tempted to have it return an inversion list, <br/>instead of an array of every code point that matches the property. Such <br/>an array could be potentially length 1,114,112. The largest possible <br/>inversion list is potentially half that, but the largest one that <br/>matches a Unicode property is around length 700, and yours would be <br/>somewhat over 200 entries. That is why inversion lists are often used <br/>for Unicode because they compactly represent the Unicode properties.<br/><br/>An inversion list is an array. An example is:<br/>5, 101, 116, 120, ...<br/><br/>This represents 5..100, 116..119 ...<br/><br/>The 0th element gives the first code point that is in the property; the <br/>next element gives the first code point after that one that&#39;s not in the <br/>property, and so forth. Each succeeding element marks the beginning of <br/>a range that is/isn&#39;t in the property, inverting the is/isnt each time.<br/><br/>It is a simple matter to convert an inversion list into a true array or <br/>hash of every code point that matches.<br/><br/>My question to you is would that be acceptable to you, do you think? I <br/>hate to return an enormous array by default when the application doesn&#39;t <br/>really need it.<br/> http://www.nntp.perl.org/group/perl.unicode/2011/06/msg3324.html Tue, 28 Jun 2011 10:31:51 +0000 Re: Need: list of Unicode characters that have canonical decompositions. by BobH Karl Williamson wrote:<br/><br/> &gt; I&#39;m presuming you need this not for a one-time only thing, but to be<br/> &gt; able to run this program over and over.<br/><br/>Yes -- this is for a module that will be usable in a number of <br/>situations. See <br/>http://search.cpan.org/~bhallissy/Text-Unicode-Equivalents-0.05/.<br/><br/>The current implementation cheats by accessing unicore/Decomposition.pl <br/>exactly the same way Unicode::UCD does.<br/><br/> &gt; You can always download UnicodeData.txt from the Unicode web site.<br/><br/>Yes I can -- and certainly have done for my personal use. But including <br/>that file (or some derivative) in a general purpose module would mean <br/>that it wouldn&#39;t necessarily have the same Unicode version as the Perl <br/>installation into which my module might be installed. And besides, the <br/>information I need is already in the Perl core -- though supposedly not <br/>usable.<br/><br/> &gt; In a regular expression,<br/> &gt; \p{Dt= can} (Decomposition_Type=Canonical) will match all characters<br/> &gt; that you want.<br/><br/>Yes, I understand that I can test a character to see if it has a <br/>particular decomposition, but I&#39;m not sure I understand how to use a <br/>regex to generate a complete list of characters with decompositions.<br/><br/> &gt; I&#39;m thinking that 5.16 will have the stringification<br/> &gt; of that regex include the list you want, but not in 5.14, and<br/> &gt; stringification is not necessarily fixed either.<br/> &gt;<br/> &gt; I could easily write a new function for UCD that returns a list of<br/> &gt; all code points that have a given property.<br/><br/>That is an interesting offer, and I think this should be given serious <br/>consideration. I&#39;m sure my little module isn&#39;t the only one that, as we <br/>go into the future, would benefit from such a function.<br/><br/>Thanks for your reply, Karl.<br/><br/>Bob<br/> http://www.nntp.perl.org/group/perl.unicode/2011/06/msg3323.html Mon, 27 Jun 2011 19:04:19 +0000 Re: Need: list of Unicode characters that have canonical decompositions. by Karl Williamson On 06/27/2011 08:26 AM, BobH wrote:<br/>&gt; A project I&#39;m working on needs to build a list of all Unicode characters<br/>&gt; that have canonical decompositions. The most efficient ways I can think<br/>&gt; of to get such a list are from unicore/Decomposition.pl or by scanning<br/>&gt; unicore/UnicodeData.txt. However:<br/>&gt;<br/>&gt; Re unicore/Decomposition.pl, the header of this says:<br/>&gt;<br/>&gt;&gt; # !!!!!!! INTERNAL PERL USE ONLY !!!!!!!<br/>&gt;&gt; # This file is for internal use by the Perl program only. The format<br/>&gt;&gt; and even<br/>&gt;&gt; # the name or existence of this file are subject to change without<br/>&gt;&gt; notice.<br/>&gt;&gt; # Don&#39;t use it directly.<br/>&gt;<br/>&gt; Re unicore/UnicodeData.txt, I&#39;ve recently posted a version of my module<br/>&gt; that uses unicore/UnicodeData.txt to CPAN, and from Perl 5.14 testers<br/>&gt; I&#39;ve received only failure notices which indicate that the file cannot<br/>&gt; be found :-(<br/>&gt;<br/>&gt; Unicode::UCD can tell me if a specific character has a decomposition,<br/>&gt; but can&#39;t give me a list of characters that have decompositions.<br/>&gt;<br/>&gt; Any suggestions would be appreciated.<br/>&gt;<br/>&gt; Bob<br/>&gt;<br/><br/>I&#39;m presuming you need this not for a one-time only thing, but to be <br/>able to run this program over and over. You can always download <br/>UnicodeData.txt from the Unicode web site. In a regular expression, <br/>\p{Dt= can} (Decomposition_Type=Canonical) will match all characters <br/>that you want. I&#39;m thinking that 5.16 will have the stringification of <br/>that regex include the list you want, but not in 5.14, and <br/>stringification is not necessarily fixed either.<br/><br/>I could easily write a new function for UCD that returns a list of all <br/>code points that have a given property.<br/> http://www.nntp.perl.org/group/perl.unicode/2011/06/msg3322.html Mon, 27 Jun 2011 13:02:09 +0000 Re: Need: list of Unicode characters that have canonical decompositions. by BobH BobH wrote:<br/><br/>&gt; Re unicore/UnicodeData.txt, I&#39;ve recently posted a version of my module<br/>&gt; that uses unicore/UnicodeData.txt to CPAN, and from Perl 5.14 testers<br/>&gt; I&#39;ve received only failure notices which indicate that the file cannot<br/>&gt; be found :-(<br/>&gt;<br/><br/>Just installed ActivePerl 5.14 and, indeed, this file no longer exists <br/>-- guess that forces me to use unicore/Decomposition.pl in spite of its <br/>included warning.<br/><br/>Bob<br/> http://www.nntp.perl.org/group/perl.unicode/2011/06/msg3321.html Mon, 27 Jun 2011 08:10:47 +0000 Need: list of Unicode characters that have canonical decompositions. by BobH A project I&#39;m working on needs to build a list of all Unicode characters <br/>that have canonical decompositions. The most efficient ways I can think <br/>of to get such a list are from unicore/Decomposition.pl or by scanning <br/>unicore/UnicodeData.txt. However:<br/><br/>Re unicore/Decomposition.pl, the header of this says:<br/><br/>&gt; # !!!!!!! INTERNAL PERL USE ONLY !!!!!!!<br/>&gt; # This file is for internal use by the Perl program only. The format and even<br/>&gt; # the name or existence of this file are subject to change without notice.<br/>&gt; # Don&#39;t use it directly.<br/><br/>Re unicore/UnicodeData.txt, I&#39;ve recently posted a version of my module <br/>that uses unicore/UnicodeData.txt to CPAN, and from Perl 5.14 testers <br/>I&#39;ve received only failure notices which indicate that the file cannot <br/>be found :-(<br/><br/>Unicode::UCD can tell me if a specific character has a decomposition, <br/>but can&#39;t give me a list of characters that have decompositions.<br/><br/>Any suggestions would be appreciated.<br/><br/>Bob<br/> http://www.nntp.perl.org/group/perl.unicode/2011/06/msg3320.html Mon, 27 Jun 2011 07:27:06 +0000 Enumerating all canonically equivalent strings by BobH Does there exist a standard module or function that, given a Combining <br/>Character Sequence (or, more generally, an arbitrary Unicode text <br/>string), will generate a list of all canonically equivalent strings?<br/><br/>For example, if given the character U+1EAD, I&#39;d like to get back a list <br/>of all these canonically equivalent sequences:<br/><br/>0061 0302 0323<br/>0061 0323 0302<br/>00E2 0323<br/>1EA1 0302<br/>1EAD<br/><br/>(I don&#39;t particularly care whether the interface is in terms of arrays <br/>of USVs or utf strings.)<br/><br/>Some years ago I created such a module for my own use (I called it <br/>Unicode::MakeEquivalents), and am now wondering whether there exists a <br/>standard solution to this problem (so I can abandon my own stuff), or <br/>whether I should pursue adding this functionality to CPAN somewhere.<br/><br/>Suggestions?<br/><br/>Bob<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2011/06/msg3319.html Mon, 20 Jun 2011 16:22:12 +0000 Given 5.014 features, is encoding::warnings still needed? by Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 In &lt;http://stackoverflow.com/q/6281049#comment-7334585&gt;, tchrist asks:<br/>| Is `encoding::warnings` actually still needed given the `/dual` modifiers<br/>| and the `unicode_strings` feature?<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2011/06/msg3318.html Wed, 08 Jun 2011 12:57:46 +0000 Re: Unicode::Collate string replacements and case sensitivity by SADAHIRO Tomoyuki <br/>On Thu, 28 Apr 2011 10:06:58 -0700 (PDT)<br/>Frank M&Atilde;&frac14;ller &lt;pottwal1@freenet.de&gt; wrote:<br/><br/>&gt; dear all,<br/>&gt; I&#39;m trying to do some string replacements with Unicode::Collate which<br/>&gt; usually work very well, but these replacements seem to be case<br/>&gt; insensitive by default - how can I change this? look at this simple<br/>&gt; example:<br/>&gt; <br/>&gt; my $myCollator = Unicode::Collate-&gt;new( normalization =&gt; undef, level<br/>&gt; =&gt; 1 );<br/>&gt; my $str = &quot;Camel camel donkey zebra came\x{301}l CAMEL horse<br/>&gt; cAmEL...&quot;;<br/>&gt; $myCollator-&gt;gsubst($str, &quot;camel&quot;, sub { &quot;#$_[0]#&quot; });<br/>&gt; <br/>&gt; which makes the following replacements:<br/>&gt; <br/>&gt; #Camel# #camel# donkey zebra #cam&Atilde;&copy;l# #CAMEL# horse #cAmEL#...<br/>&gt; <br/>&gt; what I would love to see is the following result:<br/>&gt; <br/>&gt; Camel #camel# donkey zebra #cam&Atilde;&copy;l# CAMEL horse cAmEL...<br/>&gt; <br/>&gt; As there doesn&#39;t seem to be gsubst for case sensitive and gisubst for<br/>&gt; case insensitive string replacements, what would a solution look like?<br/>&gt; <br/>&gt; Thanks a lot for any suggestions,<br/>&gt; Frank<br/><br/>As (level =&gt; 1) is not default, (level =&gt; 3) is also allowed for case<br/>sensitive matching. But UCA thinks accent difference (level 2) is <br/>more important than case difference (level 3), then cam&Atilde;&copy;l won&#39;t<br/>match camel when (level =&gt; 3).<br/><br/>level 1: camel matches cam&Atilde;&copy;l and Camel.<br/>level 2: camel matches Camel but not cam&Atilde;&copy;l.<br/>level 3: camel matches neither Camel nor cam&Atilde;&copy;l.<br/>--Even at level 3, it isn&#39;t so strict:<br/> camel matches &quot;c-a-m-e-l&quot;, &quot;ca mel&quot;, etc.<br/> since punctuation difference is level 4.<br/><br/>To make camel match cam&Atilde;&copy;l but not Camel, other workwround is<br/>need. In next release, a new parameter (ignore_level2) will allow it.<br/>(However the behavior of ignore_level2 is quite different from<br/> so-called caseLevel in UCA etc.)<br/><br/>Regards,<br/>SADAHIRO Tomoyuki<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2011/05/msg3317.html Thu, 05 May 2011 06:07:02 +0000 Unicode::Collate string replacements and case sensitivity by Frank Müller dear all,<br/>I&#39;m trying to do some string replacements with Unicode::Collate which<br/>usually work very well, but these replacements seem to be case<br/>insensitive by default - how can I change this? look at this simple<br/>example:<br/><br/>my $myCollator = Unicode::Collate-&gt;new( normalization =&gt; undef, level<br/>=&gt; 1 );<br/>my $str = &quot;Camel camel donkey zebra came\x{301}l CAMEL horse<br/>cAmEL...&quot;;<br/>$myCollator-&gt;gsubst($str, &quot;camel&quot;, sub { &quot;#$_[0]#&quot; });<br/><br/>which makes the following replacements:<br/><br/>#Camel# #camel# donkey zebra #cam&eacute;l# #CAMEL# horse #cAmEL#...<br/><br/>what I would love to see is the following result:<br/><br/>Camel #camel# donkey zebra #cam&eacute;l# CAMEL horse cAmEL...<br/><br/>As there doesn&#39;t seem to be gsubst for case sensitive and gisubst for<br/>case insensitive string replacements, what would a solution look like?<br/><br/>Thanks a lot for any suggestions,<br/>Frank<br/><br/> http://www.nntp.perl.org/group/perl.unicode/2011/04/msg3316.html Fri, 29 Apr 2011 02:21:42 +0000 Re: encoding(UTF16-LE) on Windows by Erland Sommarskog Michael Ludwig (milu71@gmx.de) writes:<br/>&gt;&gt; For instance, I use Windows exclusively, so Unicode in file names is<br/>&gt;&gt; no problem.<br/>&gt; <br/>&gt; Did a quick test:<br/>&gt; <br/>&gt; (v5.12.1) built for MSWin32-x86-multi-thread (so ActiveState)<br/>&gt; <br/>&gt; * a&Atilde;&cent;&acirc;&#130;&not;&Acirc;&brvbar;b.txt<br/>&gt; * not correct<br/>&gt; * doesn&#39;t have anything with &quot;uni&quot; or &quot;utf&quot; in &quot;perl -V&quot;<br/> <br/>OK, so the implementation would have to know that on this platform <br/>filenames are in UTF-16, on this it is UTF-8 and so on.<br/><br/>Not that it is a terribly big deal. In the program where I want to <br/>support Unicode names, I&#39;ve already written a module around Win32API::File,<br/>which permits to open a file in Windows, and the associate it with <br/>a file handle.<br/><br/><br/>-- <br/>Erland Sommarskog, Stockholm, esquel@sommarskog.se<br/> http://www.nntp.perl.org/group/perl.unicode/2011/02/msg3315.html Wed, 02 Feb 2011 01:06:21 +0000 Re: encoding(UTF16-LE) on Windows by Michael Ludwig Erland Sommarskog schrieb am 31.01.2011 um 23:42 (+0100):<br/>&gt; Michael Ludwig (milu71@gmx.de) writes:<br/>&gt; &gt; Erland Sommarskog schrieb am 29.01.2011 um 14:02 (+0100):<br/>&gt; &gt; <br/>&gt; &gt;&gt; Yes, there certainly seems to be some more stuff to do in the<br/>&gt; &gt;&gt; Unicode support in Perl. For instance, support for Unicode<br/>&gt; &gt;&gt; filenames in open or opendir.<br/>&gt; &gt; <br/>&gt; &gt; I think there is no portable answer here, as it depends on the<br/>&gt; &gt; filesystem&#39;s support for Unicode.<br/>&gt; <br/>&gt; Did I say it have to be portable? :-)<br/><br/>No &hellip; but Perl did. :-)<br/><br/>&gt; For instance, I use Windows exclusively, so Unicode in file names is<br/>&gt; no problem.<br/><br/>Did a quick test:<br/><br/> \,,,/<br/> (o o)<br/>------oOOo-(_)-oOOo------<br/>use strict;<br/>use warnings;<br/>use utf8;<br/>my $fn = &#39;a&hellip;b.txt&#39;; # mit Unicode-Zeichen<br/>open my $fh, &#39;&gt;:encoding(UTF-8)&#39;, $fn or die &quot;open $fn: $!&quot;;<br/>print $fh &quot;$fn\n&quot;;<br/>close $fh;<br/>-------------------------<br/><br/>v5.10.1 (*) built for i686-cygwin-thread-multi-64int<br/><br/>* a&hellip;b.txt<br/>* correct (in Explorer, cmd.exe, MinTTY)<br/>* has: CYG17 utf8-paths (which might be responsible)<br/><br/>(v5.12.1) built for MSWin32-x86-multi-thread (so ActiveState)<br/><br/>* a&acirc;&euro;&brvbar;b.txt<br/>* not correct<br/>* doesn&#39;t have anything with &quot;uni&quot; or &quot;utf&quot; in &quot;perl -V&quot;<br/><br/>-- <br/>Michael Ludwig<br/> http://www.nntp.perl.org/group/perl.unicode/2011/01/msg3314.html Mon, 31 Jan 2011 17:32:59 +0000 Re: encoding(UTF16-LE) on Windows by Erland Sommarskog Michael Ludwig (milu71@gmx.de) writes:<br/>&gt; Erland Sommarskog schrieb am 29.01.2011 um 14:02 (+0100):<br/>&gt; <br/>&gt;&gt; Yes, there certainly seems to be some more stuff to do in the Unicode<br/>&gt;&gt; support in Perl. For instance, support for Unicode filenames in open<br/>&gt;&gt; or opendir.<br/>&gt; <br/>&gt; I think there is no portable answer here, as it depends on the<br/>&gt; filesystem&#39;s support for Unicode.<br/> <br/>Did I say it have to be portable? :-)<br/><br/>Obviously, Unicode cannot happen on systems which do not support Unicode.<br/><br/>For instance, I use Windows exclusively, so Unicode in file names is no <br/>problem. On the other hand, it&#39;s a dead case for system() and backticks <br/>as far as I can make out. (That is, I have not been able to run Unicode <br/>BAT files.)<br/><br/><br/><br/>-- <br/>Erland Sommarskog, Stockholm, esquel@sommarskog.se<br/> http://www.nntp.perl.org/group/perl.unicode/2011/01/msg3313.html Mon, 31 Jan 2011 16:22:29 +0000