develooper Front page | perl.perl6.users | Postings from September 2020

Re: "ICU - International Components for Unicode"

Thread Previous | Thread Next
From:
William Michels via perl6-users
Date:
September 29, 2020 19:14
Subject:
Re: "ICU - International Components for Unicode"
Message ID:
CAA99HCwkf0Gc3feA3X8dh62XG+Bpd2aXdEmPzns9Dg5xejMMfQ@mail.gmail.com
Thank you, Samantha!

An outstanding question is one posed by Joseph Brenner--that
is--knowing which version of the Unicode standard is supported by
Raku. I grepped through two files, one called "unicode.c" and the
other called "unicode_db.c". They're both located in rakudo at:
/rakudo/rakudo-2020.06/nqp/MoarVM/src/strings/ .

Below are the first 4 lines of my grep results. As you can see
(above/below), rakudo-2020.06 supports Unicode12.1.0:

~$ raku -ne '.say if .grep(/unicode/)'
~/rakudo/rakudo-2020.06/nqp/MoarVM/src/strings/unicode_db.c
# For terms of use, see http://www.unicode.org/terms_of_use.html
# The UAXes can be accessed at http://www.unicode.org/versions/Unicode12.1.0/
From http://unicode.org/copyright.html#Exhibit1 on 2017-11-28:
Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
<TRUNCATED>

It would be really interesting to follow your Unicode work, Samantha.
The ideas you propose are interesting and everyone hopes for speed
improvements. Is there any place Raku-uns can go to read
updates--maybe a grant report, blog, or Github issue? Or maybe right
here, on the Perl6-Users mailing list? Thanks in advance.

Best, Bill.

W. Michels, Ph.D.



On Sun, Sep 27, 2020 at 4:03 AM Samantha McVey <samantham@posteo.net> wrote:
>
> So MoarVM uses its own database of the UCD. One nice thing is this can
> probably be faster than calling to the ICU to look up information of each
> codepoint in a long string. Secondly it implements its own text data
> structures, so the nice features of the UCD to do that would be difficult to
> use.
>
> In my opinion, it could make sense to use ICU for things like localized
> collation (sorting). It also could make sense to use ICU for unicode
> properties lookup for properties that don't have to do with grapheme
> segmentation or casing. This would be a lot of work but if something like this
> were implemented it would probably happen in the context of a larger
> rethinking of how we use unicode. Though everything is complicated by that we
> support lots of complicated regular expressions on different unicode
> properties. I guess first I'd start by benchmarking the speed of ICU and
> comparing to the current implementation.
>
>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About