develooper Front page | perl.unicode | Postings from September 2010

Workaround to a unicode bug needed

Thread Next
From:
Pierre Nugues
Date:
September 6, 2010 02:09
Subject:
Workaround to a unicode bug needed
Message ID:
0699C954-E99E-47C3-84E1-E9BCEABAA799@cs.lth.se
Dear All,

I wrote a simple tokenizer for texts containing Latin9 characters. It does not behave as expected with the Swedish text below and I would like to find a workaround.

More precisely, perl does not remove properly the Swedish quotes: » (RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK, U+00BB) from the text. See the first character of the first line of this text.

When I run the program on a Mac Snow Leopard, with version 5.8.8 on the text encoded in UTF-8, Perl outputs a defective UTF-8 code for this character: <BB>
I could solve the problem by removing the û character from the tr// list (LATIN SMALL LETTER U WITH CIRCUMFLEX, U+00FB.)
Do you know of a better, cleaner way to work around this bug?

Thank you for your help
Pierre
--

### The Perl Program
### An elementary tokenizer. Save it in UTF-8
__BEGIN

while ($line = <>) { 
   $text .= $line;
}
$text =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs;
  # The dash character must be quoted
$text =~ s/([,.?!:;()'\-])/\n$1\n/g;
$text =~ s/\n+/\n/g;
print $text;

___END

### The text to reproduce the bug. Save it in UTF-8

___BEGIN
»Tjuvgömmare!» säga skatorna och se ut som samvetet självt. »Vi äro 
polisbetjänter, vi. Hit med tjuvgodset!» 
»Å, tyst, era rackare! Jag är gårdsfogden.» 
»Just den rätta!» håna de. 
___END
--
Pierre Nugues, Lunds Tekniska Högskola, Institutionen för datavetenskap, Box 118, S-221 00 Lund, Suède.
Tél. (0046) 46 222 96 40, http://www.cs.lth.se/~pierre
Visiteurs: Lunds Tekniska Högskola, E-huset, rum 4134A, Ole Römers väg 3, S-223 63 Lund.
Mon livre/My book: http://www.cs.lth.se/home/Pierre_Nugues/ilppp/



Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About