develooper Front page | perl.unicode | Postings from March 2012

Word boundaries

Thread Next
March 26, 2012 02:03
Word boundaries
Message ID:
For our spam classifier I need to split the text into words.
Unfortunately the '\b' regex does not yet work for languages with no
spaces (apparently it is covered in the level 3 of unicode support - so I need
some custom solution.  This did not seem very difficult - just split
the text into blocks of same unicode script and then use '\b' for most
of the scripts and appropriate libraries for the rest (at least for
Chinese there are some tokenizers on CPAN) - but:

1. How can I split the text into blocks of same scripts?  (Wouldn't a
script-boundary regex property be useful?).  OK I can always loop over
the characters, check their script and check if it is the same as the
previous one - i.e. back to C mode of programming.  But then there is
still the question of:

2. How can I check what script a character belongs to?  Do I need to
cut and paste all the script ranges from into a huge
if-else branch in my program or is there a simpler way?

Thanks in advance,

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About