develooper Front page | perl.perl5.porters | Postings from October 2011

On dual-living Unicode::GCString in v5.16

Thread Next
From:
Tom Christiansen
Date:
October 31, 2011 14:00
Subject:
On dual-living Unicode::GCString in v5.16
Message ID:
28178.1320094769@chthon
======================================================================
 SUMMARY: the Unicode::GCString module should be dual-lived for v5.16
======================================================================

Only one part of Perl can cope with user-visible characters, meaning
graphemes.  Everything else can cope only with programmer-visible
characters, meaning code points.  Regarding this issue, Larry writes in
the upcoming 4th edition of Programming Perl:

    The only thing in the Perl core that knows about graphemes is C<\X> in
    a pattern.  Built-in functions like C<substr>, C<length>, C<index>,
    C<rindex>, and C<pos> access strings at the granularity of the
    codepoint, not of the grapheme.  So C<\X> is your hammer, and all of
    Unicode starts to look like nails.  A lot of nails.

And I illustrate that issue in the paragraph immediately following that one:

    Imagine reversing “crème brûlée” codepoint by codepoint.  Assuming
    normalization to NFD, you’d end up with “éel̂urb em̀erc” when you
    really want “eélûrb emèrc”.  Instead, use C<\X> to extract a list of
    graphemes, then reverse that.

But we don't have to try so hard, because There is a CPAN module that addresses
this "everything looks like a nail" issue.  It's called Unicode::GCString.
Almost all programs that do serious Unicode processing would benefit from having
Unicode::GCString available. 

I came to the conclusion that Unicode::GCString should be in the core after I
found that in the new book we had referenced that CPAN module far more often
than any other CPAN module: fifteen separate times, to be precise. That's
because nailing grapheme processing in core Perl ranges from tricky to virtually
impossible without it.

I realize that we are trying to pare down the core, but in this instance I
believe a cogent and convincing argument can be made that inclusion of
Unicode::GCString in that core enhances it.  I think we should include
Unicode::GCString with the core v5.16 release.  Here's why.

First, here are some examples that show how much easier, and indeed even
possible, that Unicode::GCString makes things.   For these examples, we
have a normal string, $str, and a grapheme cluster string, $gcs.

    $str = Unicode::Normalize::NFD("crème brûlée");
    $gcs = Unicode::GCString->new($str);

You may assume all input is in NFD and all output in NFC, but with
Unicode::GCString, it makes no difference, as it gives the same right answers 
no matter the normalization or lack thereof.  I have intentionally chosen simple
examples, as I feel they should these things should be simply done.  If you
are not convinced by the simple answers, consider that graphemes can be also
be like this:

    % perl -CS -Mutf8 -E 'say for "cherry coke", "crème brûlée"' | perl gcsdemo
    I saw that the mirror read, “.esaelp ,e̲k̲o̲c̲ y̲r̲r̲E̲h̲c̲ eht ekil d’I”, and wondered what it could mean.
    I saw that the mirror read, “.esaelp ,e̲É̲l̲û̲r̲b̲ e̲m̲È̲r̲c̲ eht ekil d’I”, and wondered what it could mean.

as in the demo included at the end.  Those never reduce to single code points no matter
the normalization, nor do many others, some of which I in the printf example.

================================================================================

> Examples of why we need Unicode::GCString for all serious Unicode processsing

=== Determine length in graphemes

    Wrong answer:       length($str)
    Hard  answer:       for ($count = 0; $str =~ /\X/g; $count++) {}
    Easy  answer:       $gcs->length

=== Extract first five graphemes

    Wrong answer:       substr($str, 0, 5)
    Hard  answer:       $str =~ /\A(\X{5})/ && $1
    Easy  answer:       $gcs->substr(0, 5)

=== Replace last six graphemes (i.e., "brûlée") with "fraîche"

    Wrong answer:       substr($str, -6, 6, "fraîche")
    Hard  answer:       $str =~ s/\X{6}\z/fraîche/
    Easy  answer:       $gcs->substr(-6, 6, "fraîche")

=== Insert " bien" after the first five graphemes

    Wrong answer:       substr($str, 5, 0, " bien")
    Hard answer:        $str =~ s/\A\X{5}\K/ bien/
    Easy  answer:       $gcs->substr(5, 0, " bien")

=== Print string reversed by grapheme

    Wrong answer:       print scalar reverse $str
    Hard  answer:       print reverse $str =~ /\X/g
    Easy  answer:       print reverse @$gcs

=== In-place reverse by grapheme

    Wrong answer:       $str = reverse $str
    Hard  answer:       $str = join( q() => reverse $str =~ /\X/g )
    Easy  answer:       $str = join( q() => reverse @$gcs )
    Easy? answer:       $gcs = reduce { $a . $b } reverse @$gcs

That was the reasonably easy stuff.   If you're already convinced that
almost any program doing serious Unicode processing would benefit from
having Unicode::GCString available, you can stop reading now.

If you are not yet convinced, we can move on formatting, but read
at your own peril.

==========================================================================

Perl can't do any of these three:

    String format:      printf("%-25s|", $str)
    Binary format:      pack("A25", $str)
    Picture format:     format
                        @<<<<<<<<<<<<<<<<<<<<<<<<
                        $str
                        .

Let alone

    Picture format:     format
                        ^<<<<<<<<<<<<<<<<<<<<<<<<
                        $str
                        ~~  ^<<<<<<<<<<<<<<<<<<<<
                        $str
                        .

I'll only demo the first of those, the printf equivalent, because that is the
one that comes up most often.

=== Print a column that's 25 wide

    Wrong answer:       printf "%-25s|", $str
    Right answer:       TOO HARD TO DO RIGHT
    GCS answer:         printf "%s%*s|\n", $gcs, (25 - $gcs->columns), "";

That last one may seem too complicated, and it is.  It's easier to write
a function that pads to the specified number of columns.  That leads to
code that looks like this:

    printf("%s £%.2f\n", pad($item, 25), $price);

    sub pad {
        my($s, $width) = @_;
        my $gs = Unicode::GCString->new($s);
        return $gs . (" " x ($width - $gs->columns));
    }

Yes, that is still too complicated.  However, it's a lot better than the wrong
answer.  To show you just how bad Perl's builtin formatting functions really
are, consider what happens with the wrong answer versus the right one using both
the string we had been using, and another with different characteristics.

First the wrong way:

    RULER: 12345678901234567890123456789012345678901234567890
                                    ˅ pipe should have been there
    WRONG: crème brûlée          |
    WRONG: Πηληϊάδεω Ἀχιλῆος    |
    WRONG: mojibake is 「文字化け」       |
    WRONG: Hebrew is ‪אָלֶף־בֵּית עִבְרִי‬|
                                    ˄ pipe should have been there
    RULER: 12345678901234567890123456789012345678901234567890

Now the right way:

    RULER: 12345678901234567890123456789012345678901234567890
                                    ˅ pipe should have been there
    RIGHT: crème brûlée             |
    RIGHT: Πηληϊάδεω Ἀχιλῆος        |
    RIGHT: mojibake is 「文字化け」 |
    RIGHT: Hebrew is ‪אָלֶף־בֵּית עִבְרִי‬   |
                                    ˄ pipe should have been there
    RULER: 12345678901234567890123456789012345678901234567890

It is honestly *too hard* to figure how to do this right using core Perl alone.

Consider that you have to account for code points that occupy anywhere between 0
and 2 columns, not even accounting for tabs.  It is not even as simple as just
looking for Control or Format or Marks  for no columns (some marks are spacing
and hence do occupy a column), or EA=Full or EA=Wide code points for two
columns, because EA=Ambiguous code points take on the width of the context in
which they are used.  And you don't want to know the effects of the last one.

That's because Perl's formatting functions (for string, binary, and picture
formats) are still in the "printable ASCII" mindset.

Unicode::GCString is actually subsidiary to Unicode::LineBreak, and its API
includes mechanisms for handling *real* line breaking, something infinitely more
sophisticated than $: can possibly be.

I think both should be dual-lived for 5.16, because handling graphemes is just
too hard in Perl without them.  Indeed, it "should" be easier than they make
things.  For example, the three formatting functions *should* really understand
print columns not code point count.  Dual-living these will allow people who
need to do so to craft their own robust solutions.  It is no more difficult
than the way we currently dual life Unicode::Normalize and Unicode::Collate,
plus also now Unicode::Collate::Locale.  All are of the highest quality, and
implement key aspects of the Unicode Standard without which you cannot do any
serious work in Unicode.

Because Unicode is one of Perl's core competence areas, things essential
for correct processing of Unicode text should be included in the core.  Yes,
we may not be able to fix printf/pack/format, but we can give people the means
to do so themselves.  That's why I believe that to these three:

    Module                      Implements
    ------------------------    -------------------------------------------------
    Unicode::Normalize          UAX #15: Unicode Normalization Forms
    Unicode::Collate            UTS #10: Unicode Collation Algorithm
    Unicode::Collate::Locale    UTS #35: Unicode Locale Data Markup Language (LDML)  [etc]

we should for v5.16 add these two:

    Module                      Implements
    ------------------------    -------------------------------------------------
    Unicode::GCString           UAX #29: Unicode Text Segmentation, Revision 15-17
    Unicode::LineBreak          UAX #11: East Asian Width, Revision 17-21
                                UAX #14: Unicode Line Breaking Algorithm, Revision 22-26

==========================================================================

Included below is another GCS demo.  Thepoint is to take a string as a
dessert selection, then underline and reverse it, while capitalizing
any *interior* e's in the underlined dessert:

    % perl -CS -Mutf8 -E 'say for "cherry coke", "crème brûlée"' | perl gcsdemo
    I saw that the mirror read, “.esaelp ,e̲k̲o̲c̲ y̲r̲r̲E̲h̲c̲ eht ekil d’I”, and wondered what it could mean.
    I saw that the mirror read, “.esaelp ,e̲É̲l̲û̲r̲b̲ e̲m̲È̲r̲c̲ eht ekil d’I”, and wondered what it could mean.

Since I've using Unicode::GCString objects, the reversal part is just:

    say q(I saw that the mirror read, “),
        reverse(@$gcs),
        q(”, and wondered what it could mean.);
--tom

#!/usr/bin/env perl
#
# gcsdemo, w/IMPLICIT canonical processing

use v5.14;
use utf8;
use warnings;
use warnings FATAL => "utf8";
use charnames qw(:full);
use open qw(:utf8 :std);
use re "/x";

############################################

my $DEBUG = 0;

launch_canons();
run_demo();

exit 0;

############################################

sub run_demo {

    use Unicode::GCString;

    while (my $dessert = <>) {   # this is *not* what you think it is
        chomp $dessert;

        # underline dessert selection
        $dessert = ul($dessert);

        # cap only internal e’s to demo graphemic awareness
        $dessert =~ s/ \B (?=e) (\X) \B /\u$1/g;

        my $str = "I’d like the $dessert, please.";
        my $gcs = new Unicode::GCString $str;

        # nor in fact, is this
        say q(I saw that the mirror read, “),
            reverse(@$gcs),
            q(”, and wondered what it could mean.);
    }
}

sub ul {
    return $_[0] =~ s{
        (?=\p{graph})  # printables only
        \X             # include whole grapheme
        \K             # but keep it
    }{\N{COMBINING LOW LINE}}gr;
}

# background two canonical agents: NFD and \R for input, NFC for output
sub launch_canons {

    use Unicode::Normalize;

    $| = 1;  # best keep this on!

    # register on call only, but register for all three of us (important)
    eval q{
        END {
            say STDERR "$$ closing STDOUT" if $DEBUG;
            close STDOUT; #  || die "$$: cannot close STDOUT: $! $?"
        }
        1;
    } // die "$$ ENOCLUE: $@";

    # need to clone this one first
    unless (my $postfilter = open(STDOUT, "|-") // die "$$: cannot fork: $!") {
        print STDOUT NFC($_) while <STDIN>;
        exit !close STDIN;
    }

    unless (my $prefilter = open(STDIN, "-|") // die "$$: cannot fork: $!") {
        # this is the real @ARGV
        $/ = "\n";              # hm, no need to localize in our parallel universe
        while (<ARGV>) {        # this is the real @ARGV
            $_ = NFD($_);       # we are so über-private already ’tis to laugh
            s/\R/\n/g;          # canonical linebreaks
            print STDOUT for split /(?=\n)/;
        }
        exit;
    } else {
        @ARGV = ();             # and now this is the fake @ARGV
    }

}

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About