develooper Front page | perl.perl6.users | Postings from September 2020

Re: "ICU - International Components for Unicode"

Thread Previous | Thread Next
From:
Matthew Stuckwisch
Date:
September 30, 2020 04:19
Subject:
Re: "ICU - International Components for Unicode"
Message ID:
B282E183-8A99-41BF-AA8D-6788AB53B8C0@softastur.org
In #raku it was mentioned that it would be nice to have a $*UNICODE variable of sorts that reports back the version, but not sure how that would be from an implementation POV.

I'm also late to the discussion, so pardon me jumping back a bit.  Basically, ICU is something that lets you quickly add in robust Unicode support.  But it's also a swiss army knife and overkill for what Raku generally needs (at whichever its implemented in), and also limiting in some ways because you become beholden to their structures which as Samantha pointed out, doesn't work for MoarVM's approach.  Rolling your own has a lot of advantages.

Beyond UCD and UAC (sorting), everything else really should go into module land since they're heavily based on an ever changing and growing CLDR, and even then, there can be good arguments made for putting sorting in module space too.  For reasons like performance, code clarity, data size, etc, companies have rolled their own ICU-like libraries (Google's Closure for JS, TwitterCLDR in Ruby, etc) running on the same CLDR data.  In Raku (shameless selfplug), a lot is already available in the Intl namespace.  There are actually some very cool things that can be done mixing CLDR and Raku like creating new character-class-like tokens, or even extending built ins — they just don't have any business being near core, just... core-like :-)

Matéu


PS: For understanding some of Samantha's incredible work, her talks at the Amsterdam convention are really great, and Perl Weekly has an archive of her grant write ups:
  Articles: https://perlweekly.com/a/samantha-mcvey.html
  High End Unicode in Perl 6: https://www.youtube.com/watch?v=Oj_lgf7A2LM
  Unicode Internals of Perl 6: https://www.youtube.com/watch?v=9Vv7nUUDdeA
  

> On Sep 29, 2020, at 3:14 PM, William Michels via perl6-users <perl6-users@perl.org> wrote:
> 
> Thank you, Samantha!
> 
> An outstanding question is one posed by Joseph Brenner--that
> is--knowing which version of the Unicode standard is supported by
> Raku. I grepped through two files, one called "unicode.c" and the
> other called "unicode_db.c". They're both located in rakudo at:
> /rakudo/rakudo-2020.06/nqp/MoarVM/src/strings/ .
> 
> Below are the first 4 lines of my grep results. As you can see
> (above/below), rakudo-2020.06 supports Unicode12.1.0:
> 
> ~$ raku -ne '.say if .grep(/unicode/)'
> ~/rakudo/rakudo-2020.06/nqp/MoarVM/src/strings/unicode_db.c
> # For terms of use, see http://www.unicode.org/terms_of_use.html
> # The UAXes can be accessed at http://www.unicode.org/versions/Unicode12.1.0/
> From http://unicode.org/copyright.html#Exhibit1 on 2017-11-28:
> Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
> <TRUNCATED>
> 
> It would be really interesting to follow your Unicode work, Samantha.
> The ideas you propose are interesting and everyone hopes for speed
> improvements. Is there any place Raku-uns can go to read
> updates--maybe a grant report, blog, or Github issue? Or maybe right
> here, on the Perl6-Users mailing list? Thanks in advance.
> 
> Best, Bill.
> 
> W. Michels, Ph.D.
> 
> 
> 
> On Sun, Sep 27, 2020 at 4:03 AM Samantha McVey <samantham@posteo.net> wrote:
>> 
>> So MoarVM uses its own database of the UCD. One nice thing is this can
>> probably be faster than calling to the ICU to look up information of each
>> codepoint in a long string. Secondly it implements its own text data
>> structures, so the nice features of the UCD to do that would be difficult to
>> use.
>> 
>> In my opinion, it could make sense to use ICU for things like localized
>> collation (sorting). It also could make sense to use ICU for unicode
>> properties lookup for properties that don't have to do with grapheme
>> segmentation or casing. This would be a lot of work but if something like this
>> were implemented it would probably happen in the context of a larger
>> rethinking of how we use unicode. Though everything is complicated by that we
>> support lots of complicated regular expressions on different unicode
>> properties. I guess first I'd start by benchmarking the speed of ICU and
>> comparing to the current implementation.
>> 
>> 

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About