develooper Front page | perl.i18n | Postings from May 2008

Re: Stripping out Unicode combining characters (diacritics) -

Thread Previous | Thread Next
From:
Brad Baxter
Date:
May 8, 2008 04:15
Subject:
Re: Stripping out Unicode combining characters (diacritics) -
Message ID:
f65b37ea0805071220p4fcbd016j85e98e9db5d51196@mail.gmail.com
Just to throw this out there: you may be interested in Text::Unidecode
(http://search.cpan.org/~sburke/Text-Unidecode-0.04/) if your ultimate
goal is to try to represent a unicode character with its closest ascii
(or perhaps I should say, "romanized") equivalent.

-- Brad

On Wed, May 7, 2008 at 9:51 AM, Doran, Michael D <doran@uta.edu> wrote:

> I received a number of helpful suggestions and solutions.  The approach I
> decided to adopt in my larger script is to 'decode' all the incoming form
> input as UTF-8 as well as the input from the database that I'll be matching
> the form input against.  This seems to allow the '\p{M}' syntax to work as
> expected in a Perl regexp.  In my test.cgi script for form input it would
> like like this:
>
> #!/usr/local/bin/perl
> use strict;
> use CGI;
> use Encode;
> my $query = CGI::new();
> my $search_term = decode('UTF-8',$query->param('text'));
> my $sans_diacritics  = $search_term;
> $sans_diacritics =~ s/\pM*//g;
> print qq(Content-type: text/plain; charset=utf-8
>
> search_term     is $search_term
> sans_diacritics is $sans_diacritics
> );
> exit(0);
>
> I'm slowly figuring out how to work with Unicode in my web scripts, but
> still have a lot to learn.  Thanks for all the help. :-)
>
> -- Michael
>
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # doran@uta.edu
> # http://rocky.uta.edu/doran/
>
>
> > -----Original Message-----
> > From: Doran, Michael D [mailto:doran@uta.edu]
> > Sent: Monday, May 05, 2008 7:27 PM
> > To: perl-i18n@perl.org
> > Cc: Perl4lib
> > Subject: Stripping out Unicode combining characters (diacritics)
> >
> > I'm trying to strip out combining diacritics from some form
> > input using this code:
> >
> > <head>
> >     <META http-equiv="Content-Type" content="text/html;
> > charset=UTF-8"> </head> <body>
> >   <form action="test.cgi" accept-charset="UTF-8" method="get">
> >     <input type="text" name="text" value="" size="10">
> >     <input type="submit" value="submit">
> >   </form>
> > </body>
> > </html>
> >
> > #!/usr/local/bin/perl
> > use CGI;
> > $query = CGI::new();
> > $search_term = $query->param('text');
> > $sans_diacritics  = $search_term;
> > $sans_diacritics  =~ s/\p{M}*//g;
> > #$sans_diacritics  =~ s/o//g;
> > print qq(Content-type: text/plain; charset=utf-8
> >
> > $sans_diacritics
> > );
> > exit(0);
> >
> >
> > In the form, I'm inputting the string "BartoĢk" with the
> > accented character being a base character (small Latin letter
> > "o") followed by a combining acute accent.  However, when I
> > print (to the web) $sans_diacritics, I get my input with no
> > change -- the combining diacritic is still there.  I know
> > that my input is not a precomposed accented character,
> > because I can strip out the base "o" and the combining accent
> > either stands alone or jumps to another character [2].
> >
> > The "\p{M}" is a Unicode class name for the character class
> > of Unicode 'marks', for example accent marks [1].  I've tried
> > these variations (and many others) and none seem to be doing
> > what I want:
> >
> >        $sans_diacritics =~ s#[\p{Mark}]*##g;
> >        $sans_diacritics =~ tr#[\p{InCombiningDiacriticalMarks}]##;
> >        $sans_diacritics =~ tr#[\p{M}]##;
> >        $sans_diacritics =~ s/\p{M}*//g;
> >        $sans_diacritics =~ s#[\p{M}]##g;
> >        $sans_diacritics =~ s#\x{0301}##g;
> >        $sans_diacritics =~ s#\x{006F}\x{0301}##g;
> >        $sans_diacritics =~ s#[\x{0300}-\x{036F}]*##g;
> >
> > I'm pulling my hair out on this... so any help would be
> > appreciated.  If there's any other info I can provide, let me know.
> >
> > My Perl version is 5.8.8 and the script is running on a
> > server running Solaris 9.
> >
> > -- Michael
> >
> > [1] per http://perldoc.perl.org/perlretut.html and other documentation
> >
> > [2] using $sans_diacritics  =~ s/o//g;
> >
> > # Michael Doran, Systems Librarian
> > # University of Texas at Arlington
> > # 817-272-5326 office
> > # 817-688-1926 mobile
> > # doran@uta.edu
> > # http://rocky.uta.edu/doran/
> >
>

Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About