develooper Front page | perl.i18n | Postings from May 2008

Stripping out Unicode combining characters (diacritics)

Thread Next
Doran, Michael D
May 5, 2008 17:27
Stripping out Unicode combining characters (diacritics)
Message ID:
I'm trying to strip out combining diacritics from some form input using this code:

    <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <form action="test.cgi" accept-charset="UTF-8" method="get">
    <input type="text" name="text" value="" size="10">
    <input type="submit" value="submit">

use CGI;
$query = CGI::new();
$search_term = $query->param('text');
$sans_diacritics  = $search_term;
$sans_diacritics  =~ s/\p{M}*//g;
#$sans_diacritics  =~ s/o//g;
print qq(Content-type: text/plain; charset=utf-8


In the form, I'm inputting the string "BartoĢk" with the accented character being a base character (small Latin letter "o") followed by a combining acute accent.  However, when I print (to the web) $sans_diacritics, I get my input with no change -- the combining diacritic is still there.  I know that my input is not a precomposed accented character, because I can strip out the base "o" and the combining accent either stands alone or jumps to another character [2].

The "\p{M}" is a Unicode class name for the character class of Unicode 'marks', for example accent marks [1].  I've tried these variations (and many others) and none seem to be doing what I want:

       $sans_diacritics =~ s#[\p{Mark}]*##g;
       $sans_diacritics =~ tr#[\p{InCombiningDiacriticalMarks}]##;
       $sans_diacritics =~ tr#[\p{M}]##;
       $sans_diacritics =~ s/\p{M}*//g;
       $sans_diacritics =~ s#[\p{M}]##g;
       $sans_diacritics =~ s#\x{0301}##g;
       $sans_diacritics =~ s#\x{006F}\x{0301}##g;
       $sans_diacritics =~ s#[\x{0300}-\x{036F}]*##g;

I'm pulling my hair out on this... so any help would be appreciated.  If there's any other info I can provide, let me know.

My Perl version is 5.8.8 and the script is running on a server running Solaris 9.

-- Michael

[1] per and other documentation

[2] using $sans_diacritics  =~ s/o//g;

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About