On 05/10/2014 03:49 PM, bulk88 wrote: > Karl Williamson wrote: >> It seems to me that by lowering the ratio so that greater than about >> 15-20% non-text cause the file to be classified as binary, while >> expanding the text characters by the 95 upper Latin1 printable >> characters (except for \xFF) will give good results, better than the >> existing. > > Why do we use percent cutoffs in the first place? Either it is > printable/glyphable or its not. perlfunc does document the %s behavior, > so I would guess ANY change to the algorithm would break backcompat for > the few people willing to use such an unreliable algo. I would suggest > to leave it alone as a backcompat/legacy/obsolete feature, or deprecate > and remove -T/-B and tell people to use CPAN/something smarter for their > specific purpose. > > Being purely printable doesn't mean a string/data is risk free but a > fixed set of rules is better than a % "guess". > > http://www.blackhatlibrary.net/Shellcode/Null-free > http://www.blackhatlibrary.net/Ascii_shellcode > From what he has said privately, I think RJBS pretty much agrees with this, and that breakage will likely come in the field rather than by smoking CPAN, as it would depend on real-world data, not terribly likely to be in the test files. This is a pity, as I have 20 year-old code that would benefit from it. I do intend to fix the current broken UTF-8 handling, and add some documentation about it. The 'file' command used to work, when I knew about it, by examining a particular location in the text segment of files to see what its 'magic number' was, a registry of which was kept somewhere, which would indicate the type of file, like modern-day pdf, etc. Something that didn't look like it was a magic number would start a guessing process, like looking for troff commands.Thread Previous | Thread Next