develooper Front page | perl.perl5.porters | Postings from May 2014

RFC: Making -B and -T work better on 8-bit encodings

Thread Next
From:
Karl Williamson
Date:
May 2, 2014 20:29
Subject:
RFC: Making -B and -T work better on 8-bit encodings
Message ID:
53640000.7020804@khwilliamson.com
The -B and -T file test operators don't really work well for non-ASCII 
text files, except perhaps under 'use locale'

I'm proposing to change them in several ways early in 5.21 to make sure 
that it doesn't adversely affect existing code before we decide for 5.22.

Attached is a preliminary patch along those lines, though my thinking 
has evolved somewhat since I wrote it based on some experimentation.

The way it currently works is that it goes through a buffer (prob. 512 
bytes) and says the file is binary if the ratio of probable text to 
non-text characters is less than 3.  But a single NUL in the input is 
enough to make the file be classified as binary.

It seems to me that every true binary file will contain plenty of NULs, 
so it seems overkill to look at anything else

But assuming I'm wrong, what I'm proposing is to count the upper-Latin1 
range printables as text, and to correspondingly cut down the ratio of 
non-text.

The current UTF8 handling is haphazard, and I think suboptimal.  The 
patch changes things to see if the entire block is ASCII.  If not, it 
then looks to see if it is one long UTF-8 string.  The odds of something 
that passes that test for 512 bytes not being UTF-8 are vanishingly small.

It seems to me that if the entire block is ASCII, that no further work 
need be done, that it has got to be a text file.  I can't imagine a 
binary file not having the upper bit of some byte set within the first 
512 bytes.  But the patch currently will do the byte-by-byte 
classification even so.

I also think that Vertical Tab and Form Feed are so infrequent that they 
should be counted as non-text (currently VT is non-text, but FF is) 
Also, a byte with all 8 bits on is very common in binary, but extremely 
uncommon in text.  It represents a y with diaeresis in Latin1, which I 
believe occurs in modern French in only a couple of place names, and is 
not used in the other languages that Latin1 is designed for.  So I would 
change the patch to classify \xFF as a control.  ESC currently is 
considered text, and my patch retains that, as it is relatively commonly 
used in rich text files.

It seems to me that by lowering the ratio so that greater than about 
15-20% non-text cause the file to be classified as binary, while 
expanding the text characters by the 95 upper Latin1 printable 
characters (except for \xFF) will give good results, better than the 
existing.





Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About