develooper Front page | perl.perl5.porters | Postings from November 2008

PATCH: bug in Text::Tabs etc

Thread Next
Tom Christiansen
November 24, 2008 12:23
PATCH: bug in Text::Tabs etc
Message ID:
I tried to reply to 

    Date: Mon, 24 Nov 2008 19:30:51 +0100
    From: demerphq <>
    To: "karl williamson" <>
    Subject: Re: Matching multi-character folds
    Cc: "Perl5 Porters" <>,
	    "Juerd Waalboer" <>,
	    "Rafael Garcia-Suarez" <>
    In-Reply-To: <>
    MIME-Version: 1.0
    Content-Type: text/plain; charset=UTF-8
    Content-Transfer-Encoding: base64
    Content-Disposition: inline
    References: <>

But I found that I had a lot of trouble doing so, and just couldn't get
it to work right.  No, it wasn't the nasty inline base64.  It was, in 
fact, in a sense, the matter of folded characters in regexes.  In particular,
I couldn't get Text::Autoformat to behave correctly, and when I tried to
fix that, I found that Text::Tabs had its own troubles.

Why?  Because they're folding, and there's folding.

Consider the following code and its two lines of DATA:

use Text::Tabs;
binmode(DATA, ":utf8") || die;
while (<DATA>)  {
    print expand($_);
	This résumé is:	padded well.
	This résumé is:	not padded quite so well.

In case you can't tell the difference, those two lines of DATA read:

\tThis r\x{e9}sum\x{e9} is:\tpadded well.
\tThis re\x{301}sume\x{301} is:\tnot padded quite so well.

As you see, one is the NFC version and the other the NFD version
of one's CV.   That is, one has combined characters in it.  This
throws everything off in code that doesn't know about such.

However, the code in Text::Tabs makes the blind assumption that 
visible characters are one per, and that's not correct.  It 
produces erroneous output:

        This résumé is:   padded well.
        This résumé is: not padded quite so well.

whereas with my tiny patch below it now reads:

        This résumé is:   padded well.
        This résumé is:   not padded quite so well.

which I *think* to be correct.

My tiny patch follows; it has *not* been exhaustively tested,
and needs to be!!!

I first noticed this because Text::Autoformat was counting characters
wrong.  But reading the code, I saw that before I could fix that, I
realized I had to fix Text::Tabs.

I believe this to be necessary but insufficient to fix those modules
that rely upon Text::Tabs.  For example, Text::Wrap uses it, but it
also has several bugs in it whereby it uses [^\n] to mean any character,
but it needs to be using (?:(?:![^\n])\X) or some such.

Similary, dear Damian's much beloved Text::Autoformat makes similar
errors, such as these:

	my $pagewidth = 
		2*length($prespace) + length($line->{text});
	push @length, [length $prespace,$pagewidth];

Alas, that will not tell you the length you are looking for, at least in my
2 example DATA lines, it won't.  He also makes assumptions like

    my $quotechunk = qq{(?:$quotechar(?![a-z])|(?:[a-z]\\w*)?>+)};
    my $quoter = qq{(?:(?i)(?:$quotechunk(?:[ \\t]*$quotechunk)*))};

which won't work right on the CV lines above either, at least, not
stripped of "This ".

I'm submitting this patch to Text::Wrap to try to get people thinking about
this pervasive problem.  It is as troubling, AND WORSE, as when I had to
fix all the File::Find and File::Basename and such modules that were using
/./ but needed /./s or /[^\n/, and similarly were using /$/ when they
needed /\z/.  There are probably now too many broken modules out there
making these subtly wrong assumptions for me to fix them all by myself this
time around.

Question: Might this even suggest we may need a different version of 
length() and pos()?  I don't know.


--- /home/tchrist/bleadperl/lib/Text/    Fri Sep 19 15:04:13 2008
+++ /home/tchrist/lingua/Text/   Mon Nov 24 12:55:30 2008
@@ -24,7 +24,7 @@
                 for (split(/^/m, $_, -1)) {
                         my $offs = 0;
-                                $pad = $tabstop - (pos() + $offs) % $tabstop;
+                                $pad = $tabstop - (cpos() + $offs) % $tabstop;
                                 $offs += $pad - 1;
                                 " " x $pad;
@@ -36,6 +36,10 @@
         return $l[0];
+sub cpos {
+    return scalar ( () = (substr($_, 0, pos()) =~ /\X/g) );
 sub unexpand
         my (@l) = @_;
@@ -49,7 +53,7 @@
                 @lines = split("\n", $x, -1);
                 for $line (@lines) {
                         $line = expand($line);
-                        @e = split(/(.{$tabstop})/,$line,-1);
+                        @e = split(/(\X{$tabstop})/,$line,-1);
                         $lastbit = pop(@e);
                         $lastbit = '' 
                                 unless defined $lastbit;

Thread Next Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at | Group listing | About