develooper Front page | perl.i18n | Postings from May 2008

Re: Stripping out Unicode combining characters (diacritics)

Thread Previous | Thread Next
From:
Leif Andersson
Date:
May 6, 2008 05:57
Subject:
Re: Stripping out Unicode combining characters (diacritics)
Message ID:
A21D63D98F93B848A371F387C2D0C26403C8FA1E@mail.intranet.sub.su.se
I've been doing it like Mike R suggested for quite some while.
But some characters do not map nicely into this scheme.

So you may want to manually take care of stuff like german eszet, ligature oe etc, etc.

s/\x{00df}/ss/g;
s/\x{0152}/Oe/g;
s/\x{0153}/oe/g;
...to be continued...

Leif
======================================
Leif Andersson, Systems Librarian
Stockholm University Library
SE-106 91 Stockholm
SWEDEN
Phone : +46 8 162769
Mobile: +46 70 6904281

-----Ursprungligt meddelande-----
Från: Doran, Michael D [mailto:doran@uta.edu] 
Skickat: den 6 maj 2008 04:13
Till: Mike Rylander
Kopia: perl-i18n@perl.org; Perl4lib
Ämne: RE: Stripping out Unicode combining characters (diacritics)

Hi Mike,

I appreciate the quick reply.  I am familiar with the Unicode::Normalize module (and will also be using that), but I left it out of this question because it's not relevant to the problem I'm currently trying to solve.  The text I'm trying to strip diacritics out of does not have precomposed accented characters.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# doran@uta.edu
# http://rocky.uta.edu/doran/



-----Original Message-----
From: Mike Rylander [mailto:mrylander@gmail.com]
Sent: Mon 5/5/2008 8:52 PM
To: Doran, Michael D
Cc: perl-i18n@perl.org; Perl4lib
Subject: Re: Stripping out Unicode combining characters (diacritics)
 
On Mon, May 5, 2008 at 8:26 PM, Doran, Michael D <doran@uta.edu> wrote:
[snip]
>
>  I'm pulling my hair out on this... so any help would be appreciated.  If there's any other info I can provide, let me know.
>

You'll want to transform the text to NFD format (nominally, base
characters plus combining marks) instead of NFC (precombined
characters) using Unicode::Normalize:

 use Unicode::Normalize;

 my $text = NFD($original);
 $text =~ s/\pM+//go;

Hope that helps.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone: 1-877-OPEN-ILS (673-6457)
 | email: miker@esilibrary.com
 | web: http://www.esilibrary.com


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About