The -B and -T file test operators don't really work well for non-ASCII text files, except perhaps under 'use locale' I'm proposing to change them in several ways early in 5.21 to make sure that it doesn't adversely affect existing code before we decide for 5.22. Attached is a preliminary patch along those lines, though my thinking has evolved somewhat since I wrote it based on some experimentation. The way it currently works is that it goes through a buffer (prob. 512 bytes) and says the file is binary if the ratio of probable text to non-text characters is less than 3. But a single NUL in the input is enough to make the file be classified as binary. It seems to me that every true binary file will contain plenty of NULs, so it seems overkill to look at anything else But assuming I'm wrong, what I'm proposing is to count the upper-Latin1 range printables as text, and to correspondingly cut down the ratio of non-text. The current UTF8 handling is haphazard, and I think suboptimal. The patch changes things to see if the entire block is ASCII. If not, it then looks to see if it is one long UTF-8 string. The odds of something that passes that test for 512 bytes not being UTF-8 are vanishingly small. It seems to me that if the entire block is ASCII, that no further work need be done, that it has got to be a text file. I can't imagine a binary file not having the upper bit of some byte set within the first 512 bytes. But the patch currently will do the byte-by-byte classification even so. I also think that Vertical Tab and Form Feed are so infrequent that they should be counted as non-text (currently VT is non-text, but FF is) Also, a byte with all 8 bits on is very common in binary, but extremely uncommon in text. It represents a y with diaeresis in Latin1, which I believe occurs in modern French in only a couple of place names, and is not used in the other languages that Latin1 is designed for. So I would change the patch to classify \xFF as a control. ESC currently is considered text, and my patch retains that, as it is relatively commonly used in rich text files. It seems to me that by lowering the ratio so that greater than about 15-20% non-text cause the file to be classified as binary, while expanding the text characters by the 95 upper Latin1 printable characters (except for \xFF) will give good results, better than the existing.Thread Next