Workaround to a unicode bug needed

Pierre Nugues
September 6, 2010 02:09
Workaround to a unicode bug needed
Message ID:
Dear All,

I wrote a simple tokenizer for texts containing Latin9 characters. It does not behave as expected with the Swedish text below and I would like to find a workaround.

More precisely, perl does not remove properly the Swedish quotes: » (RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK, U+00BB) from the text. See the first character of the first line of this text.

When I run the program on a Mac Snow Leopard, with version 5.8.8 on the text encoded in UTF-8, Perl outputs a defective UTF-8 code for this character: <BB>
I could solve the problem by removing the û character from the tr// list (LATIN SMALL LETTER U WITH CIRCUMFLEX, U+00FB.)
Do you know of a better, cleaner way to work around this bug?

Thank you for your help

### The Perl Program
### An elementary tokenizer. Save it in UTF-8

while ($line = <>) { 
   $text .= $line;
$text =~ tr/a-zåàâäæçéèêëîïôöœßùûüÿA-ZÅÀÂÄÆÇÉÈÊËÎÏÔÖŒÙÛÜŸ'()\-,.?!:;/\n/cs;
  # The dash character must be quoted
$text =~ s/([,.?!:;()'\-])/\n$1\n/g;
$text =~ s/\n+/\n/g;
print $text;


### The text to reproduce the bug. Save it in UTF-8

»Tjuvgömmare!» säga skatorna och se ut som samvetet självt. »Vi äro 
polisbetjänter, vi. Hit med tjuvgodset!» 
»Å, tyst, era rackare! Jag är gårdsfogden.» 
»Just den rätta!» håna de. 
Pierre Nugues, Lunds Tekniska Högskola, Institutionen för datavetenskap, Box 118, S-221 00 Lund, Suède.
Tél. (0046) 46 222 96 40,
Visiteurs: Lunds Tekniska Högskola, E-huset, rum 4134A, Ole Römers väg 3, S-223 63 Lund.
Mon livre/My book:

