develooper Front page | perl.unicode | Postings from March 2012

Word boundaries

Thread Next
From:
=?UTF-8?Q?Zbigniew_=C5=81ukasiak?=
Date:
March 26, 2012 02:03
Subject:
Word boundaries
Message ID:
CAGL_UUtHdD+_uKHkvqw=TFeAm-tp3Sx3QGZYQkeBO+GysRFB3A@mail.gmail.com
For our spam classifier I need to split the text into words.
Unfortunately the '\b' regex does not yet work for languages with no
spaces (apparently it is covered in the level 3 of unicode support
http://unicode.org/reports/tr18/#Tailored_Word_Boundaries) - so I need
some custom solution.  This did not seem very difficult - just split
the text into blocks of same unicode script and then use '\b' for most
of the scripts and appropriate libraries for the rest (at least for
Chinese there are some tokenizers on CPAN) - but:

1. How can I split the text into blocks of same scripts?  (Wouldn't a
script-boundary regex property be useful?).  OK I can always loop over
the characters, check their script and check if it is the same as the
previous one - i.e. back to C mode of programming.  But then there is
still the question of:

2. How can I check what script a character belongs to?  Do I need to
cut and paste all the script ranges from unicode.org into a huge
if-else branch in my program or is there a simpler way?

Thanks in advance,
Zbigniew

Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About