develooper Front page | perl.perl5.porters | Postings from December 2001

Unicode SCRIPT and BLOCK names

Thread Next
From:
Jeffrey Friedl
Date:
December 23, 2001 14:33
Subject:
Unicode SCRIPT and BLOCK names
Message ID:
200112232233.fBNMXGE66841@ventrue.corp.yahoo.com

I just got bit by a change 5.6->5.8 change in what \p{InTibetan} means.

In 5.6, \p{InTibetan} refers to a Unicode _block_.
In bleedperl, \p{InTibetan} now refers to the Unicode _script_.

Unicode scripts are superior to blocks... unless you were expecting block
semantis, as legacy code does. It turns out that
  \N{TIBETAN MARK GTER TSHEG}
is not part of the script, so I got bit.

Programming Perl says:

    "Note that these 'In' properties are only testing to see if the
     character is in the block of characters allocated for that script."

Since scripts are closer to general categories (which use 'Is') than to
blocks, it might be appropriate to keep \p{In...} as the block, while
adding \p{Is...} to refer to the script:

                          current         
                 perl5.6  bleedperl       *proposed*
                 -------  -------------   -------------
     InTibetan   block    script||block   block
     IsTibetan   error    script||block   script
     Tibetan     error    script||block   script||block

     Where
       "script||block"
     means:
       "script if available, block otherwise"

This preserves existing semantics, keeps the Is/In distinction consistant,
yet allows easy access to the superior script concept.

Another benefit is that it allows us to get rid of the {TibetanBlock}
names, which are sometimes there and sometimes not (they are there when
there's a script with the same name, so you still have to go to the man
page every time to see which to use.)

   Jeffrey

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About