develooper Front page | perl.beginners | Postings from January 2012

Some UTF-8-related questions

Thread Next
From:
Hamann, T.D.
Date:
January 11, 2012 02:59
Subject:
Some UTF-8-related questions
Message ID:
2AE6512DA75354478DD9408686ABA09A0406A627@VSTPST01.nhncml.org
Hi,

Thanks for the answers on my last question. I have since then dug a bit further in the UTF-8-related error message I got, and after some reading have a few questions with regards to UTF-8 handling in perl:

(Please bear in mind that I am not an IT guy)

1a) My use statements are the following:

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';

Now if I understand it correctly, there's two ways of encoding UTF-8 in perl: One liberal (utf8) and one strict (UTF-8). For my purpose, I need correctly encoded UTF-8 files. However, I cannot be sure whether the files I start with are properly encoded in UTF-8. 
So is it possible to open a file using the liberal interpretation, and write to a new file using the strict interpretation? Are there any issues regarding this, like characters that might not be re-encoded properly?

1b) How can I check whether a file is properly encoded UTF-8?


2a) As I understand it, Windows has a somewhat limited ability to display certain UTF-8 characters, although some fonts can display more of them. The characters do exist in the file, even if Windows can't display them (besides showing a square). Is this correct? If not, does that impact perl's ability to handle Unicode? 

2b) Do scripts themselves have to be encoded in UTF-8 to be able to process UTF-8-files? If not, when should you encode the scripts in UTF-8 and when not? Most of my scripts add text to UTF-8 encoded text files. I've noticed that this sometimes seems to change the encoding or give error messages when e.g. accented characters are involved. Am I right in assuming that only scripts that remove text or extract certain parts do not need to be encoded in UTF-8?

2c) Not really a perl question: Does anyone know of a monospaced font for Windows that handles most UTF-8 characters gracefully? I would like one for use in Notepad++ to make it easier to write scripts containing special characters not normally displayable in Windows.


3) Windows uses UTF-8 with BOM, Unix and Unix-likes UTF-8 without BOM. A particular script of mine prepends a piece of text to UTF-8 encoded text files created with MS Word on Windows (saved as .txt with UTF-8 encoding). Unfortunately, this appears to break the encoding, which changes from "UTF-8 with BOM" to "UTF-8 without BOM", probably because the text is inserted *before* the BOM at the start of the file. How do I prevent this? How can my script recognize the BOM at the start of the file?

Thanks for reading.

Regards,
Thomas










Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About