develooper Front page | perl.i18n | Postings from May 2008

RE: Stripping out Unicode combining characters (diacritics)

Thread Previous | Thread Next
From:
Doran, Michael D
Date:
May 5, 2008 19:12
Subject:
RE: Stripping out Unicode combining characters (diacritics)
Message ID:
9A9F358293FF2641A70B4A0AC7D082F33104D3@MAILFS2.uta.edu
Hi Mike,

I appreciate the quick reply.  I am familiar with the Unicode::Normalize module (and will also be using that), but I left it out of this question because it's not relevant to the problem I'm currently trying to solve.  The text I'm trying to strip diacritics out of does not have precomposed accented characters.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# doran@uta.edu
# http://rocky.uta.edu/doran/



-----Original Message-----
From: Mike Rylander [mailto:mrylander@gmail.com]
Sent: Mon 5/5/2008 8:52 PM
To: Doran, Michael D
Cc: perl-i18n@perl.org; Perl4lib
Subject: Re: Stripping out Unicode combining characters (diacritics)
 
On Mon, May 5, 2008 at 8:26 PM, Doran, Michael D <doran@uta.edu> wrote:
[snip]
>
>  I'm pulling my hair out on this... so any help would be appreciated.  If there's any other info I can provide, let me know.
>

You'll want to transform the text to NFD format (nominally, base
characters plus combining marks) instead of NFC (precombined
characters) using Unicode::Normalize:

 use Unicode::Normalize;

 my $text = NFD($original);
 $text =~ s/\pM+//go;

Hope that helps.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone: 1-877-OPEN-ILS (673-6457)
 | email: miker@esilibrary.com
 | web: http://www.esilibrary.com


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About