develooper Front page | perl.perl5.porters | Postings from September 2006

[perl #40417] File::Find has issues with symlinks

Thread Next
From:
ammon @ rhythm . com
Date:
September 27, 2006 01:14
Subject:
[perl #40417] File::Find has issues with symlinks
Message ID:
rt-3.5.HEAD-31257-1159328317-1472.40417-75-0@perl.org
# New Ticket Created by  ammon@rhythm.com 
# Please include the string:  [perl #40417]
# in the subject line of all future correspondence about this issue. 
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=40417 >



This is a bug report for perl from ammon@rhythm.com,
generated with the help of perlbug 1.35 running under perl v5.8.6.


-----------------------------------------------------------------
[Please enter your report here]

There are several issues with File::Find, and how it deals with 
symlinks. At least one is a bug (IMHO), while others may be bugs,
but may also be considered unexpected non-DWIMish features. The
installed version of File::Find is 1.07, but none of these things
has been fixed in later versions of the module. Following is a patch
(against version 1.10 from CPAN) which fixes the following issues:

*   It reports non-existing files, when using follow or follow_fast.

    lid3213:> perl -MFile::Find=find -e'find({wanted => sub { print "$_\n" } }, "/nosuchfile")'
    lid3213:> perl -MFile::Find=find -e'find({wanted => sub { print "$_\n" }, follow => 1 }, "/nosuchfile")'
    nosuchfile
    lid3213:> perl -MFile::Find=find -e'find({wanted => sub { print "$_\n" }, follow_fast => 1 }, "/nosuchfile")'
    nosuchfile

As you can see above, this behaviour is only exhibited when following
symlinks, so I'm sure this is a bug.



*   $File::Find::fullname does not have '..' reduced correctly when 
    links are encountered with '..' at the end. Admittedly, this is
    a bit of an edge-case, but it bit me while trying to figure out
    the some File::Find behaviour that I didn't understand.
   

    #!/usr/local/bin/perl5

    use warnings;
    use strict;

    use File::Find;

    my $root = "/tmp/perlbug1";
    -d $root or mkdir $root or die $!;
    -d "$root/dir" or mkdir "$root/dir" or die $!;

    symlink "..", "$root/dir/link_to_parent" or die $!;
    symlink "$root/dir/../", "$root/dir/link2_to_parent" or die $!;

    File::Find::find({ wanted      => sub { print "$File::Find::fullname\n"},
                       follow_fast => 1,
                       follow_skip => 2,
                       no_chdir    => 1 }, $root);
    __END__


    lid3213:> perl bug.pl
    /tmp/perlbug1
    /tmp/perlbug1/dir
    /tmp/perlbug1/dir/..
    /tmp/perlbug1/dir/../dir
    /tmp/perlbug1/dir/..
    /tmp/perlbug1/dir/../dir

Since the directory path is correctly reduced for other cases of
'..' in the path, I'm sure this is a bug, as well.



*   $File::Find::fullname is undef in some cases, particularly with 
    dangling links. This may be intended, but I think it's an
    accidental consequence of overloading function return values to
    indicate both a status and resolved file path. $File::Find::fullname
    should always have a value, regardless of whether the link can
    resolve to an existing file.

    (This doesn't exist in version 1.07, due to a bug which prevents
    this bug from appearing.)


    #!/usr/local/bin/perl5

    use warnings;
    use strict;

    use File::Find;

    my $root = "/tmp/perlbug2;
    -d $root or mkdir $root or die $!;

    symlink "dangling", "$root/dangler" or die $!;

    File::Find::find(
        { wanted            => sub { print "$File::Find::fullname\n"},
          dangling_symlinks => sub { print "$File::Find::fullname\n"},
          follow_fast       => 1,
          follow_skip       => 2,
          no_chdir          => 1 }, $root);

    __END__

    lid3213:> perl bug.pl
    File::Find::VERSION => 1.11
    /tmp/perlbug2
    /tmp/perlbug2
    Use of uninitialized value in concatenation (.) or string at bug.pl line 14.



*   Doubled directory separators ("//") are not properly reduced.

    #!/usr/local/bin/perl5

    use warnings;
    use strict;

    use File::Find;

    my $root = "/tmp/perlbug3";
    -d $root or mkdir $root or die $!;

    -d "$root/dir" or mkdir "$root/dir" or die $!;
    open F, ">$root/dir/file" and close F or die $!;
    symlink "dir/", "$root/link_to_dir" or die $!;

    File::Find::find(
        { wanted      => sub { print "$File::Find::fullname\n"},
          follow_fast => 1,
          follow_skip => 2,
          no_chdir    => 1 }, $root);

    __END__

    lid3213:> perl bug.pl
    /tmp/perlbug3
    /tmp/perlbug3/dir
    /tmp/perlbug3/dir/file
    /tmp/perlbug3/dir/
    /tmp/perlbug3/dir//file

Looking at the code, this appears not to be a problem for the MacOS.
For some reason, there are divergent algorithms between MacOS, and 
non-MacOS, despite the fact that the only (significant) difference
is the particular directory separator.



*   Using the 'follow' option only reports the first link, or file
    that leads to a particular target. This may be by design, but it's
    not intuitive behaviour. What I expected to see is a complete list
    of the contents, as I would if 'follow' was turned off, but the
    resolved target in $File::Find::fullname, as advertised:


    #!/usr/local/bin/perl5

    use warnings;
    use strict;

    use File::Find;

    my $root = "/tmp/perlbug4";
    -d $root or mkdir $root or die $!;

    -d "$root/dir" or mkdir "$root/dir" or die $!;

    open F, ">$root/a" and close F or die $!;
    open F, ">$root/z" and close F or die $!;
    open F, ">$root/dir/file" and close F or die $!;

    symlink "$root/a", "$root/link_to_a" or die $!;
    symlink "$root/z", "$root/link_to_z" or die $!;
    symlink "$root/dir", "$root/link_to_dir" or die $!;

    my $pat = "   %-30s => %-30s\n";

    print "Found:\n";

    File::Find::find({
        wanted      => sub { printf $pat, $_, $File::Find::fullname },
        follow      => 1,
        follow_skip => 2,
        no_chdir    => 1 }, $root);

    print "Expected:\n";

    File::Find::find({
        wanted      => sub { my $n = $_; printf $pat, $n, $_ },
        no_chdir    => 1 }, $root);

    __END__

    lid3213:> perl bug.pl
    Found:
       /tmp/perlbug4                  => /tmp/perlbug4
       /tmp/perlbug4/z                => /tmp/perlbug4/z
       /tmp/perlbug4/a                => /tmp/perlbug4/a
       /tmp/perlbug4/link_to_dir      => /tmp/perlbug4/dir
       /tmp/perlbug4/link_to_dir/file => /tmp/perlbug4/dir/file
    Expected:
       /tmp/perlbug4                  => /tmp/perlbug4
       /tmp/perlbug4/z                => /tmp/perlbug4/z
       /tmp/perlbug4/link_to_z        => /tmp/perlbug4/link_to_z
       /tmp/perlbug4/a                => /tmp/perlbug4/a
       /tmp/perlbug4/link_to_a        => /tmp/perlbug4/link_to_a
       /tmp/perlbug4/link_to_dir      => /tmp/perlbug4/link_to_dir
       /tmp/perlbug4/dir              => /tmp/perlbug4/dir
       /tmp/perlbug4/dir/file         => /tmp/perlbug4/dir/file

There are a couple reasons why this is a problem. First, if the
directory to search ($rootLink) is a link, using either 'follow' or
'follow_fast' is required (otherwise, the only thing found is the
initial directory, a link). Unfortunately, the filesystem I'm using
File::Find to crawl is a custom NFS filesystem where the initial several
directory levels are symlinks generated by the contents of a database.

Second, using 'follow' or 'follow_fast' is the only way to a) ensure
that an lstat has been called before the wanted() function, and b)
to get the resolved target for links ($File::Find::fullname is undef
otherwise) without resorting to writing your own routines.

According to the documentation, when using 'follow', File::Find builds
a hash of the files to avoid reporting duplicated files. However, what
it's really doing is building a hash of resolved target files. The 
reason I expect what I do is because 'a' and 'link_to_a' are not the
same file, regardless that 'link_to_a' eventually resolves to 'a'.
Likewise, 'dir' and 'link_to_dir' or different directories, and both
should be reported. However, I do expect that 'follow' (as opposed to
'follow_fast') is smart enough to recognize that 'link_to_dir' resolves
to 'dir', and that it doesn't need to descend into 'link_to_dir' once
'dir' has already been trawled.

There is only one way to achieve the expected output:

    my %seenFiles = ();
    sub wanted {
        my ($dev, $inode) = lstat $_;

        return if $seenFiles{$dev, $inode}++;

        printf $pat, $_, $File::Find::fullname;
    }

    File::Find::find({
        wanted      => \&wanted,
        follow_fast => 1,
        follow_skip => 2,
        no_chdir    => 1 }, $rootLink);

The problem with this solution is two-fold. First, you have to duplicate
the hash being built up internally in File::Find (as mentioned in the
docs for 'follow'). Second, you must re-stat the original file/link,
since the last stat structure done by File::Find is actually the stat
structure for the target file, not the original file. Consequently, not
only does this solution more than double the amount of memory, it also
doubles the number of stat() calls, amounting to a horribly inefficient
work-around.

Obtaining the observed behaviour from a File::Find module that works in
the manner I expect is less onerous. An extra hash lookup still needs to
be made, but not the extra lstat(). 

Although the following patch changes File::Find to exhibit the expected
behaviour, it may not be the best idea, due to the wealth of existing
code depending on the current 'follow' behaviour (it doesn't change the
behaviour of 'follow_fast'). In that instance, the expected behaviour 
should be made available through another option (say, 'follow_dirs').


--- /tmp/Find.pm	2006-09-26 20:26:43.000000000 -0700
+++ Find.pm	2006-09-26 20:30:10.778002000 -0700
@@ -3,7 +3,7 @@
 use strict;
 use warnings;
 use warnings::register;
-our $VERSION = '1.10';
+our $VERSION = '1.12';
 require Exporter;
 require Cwd;
 
@@ -425,6 +425,7 @@
 # Should ideally be my() not our() but local() currently
 # refuses to operate on lexicals
 
+our %FoundFileWith;
 our %SLnkSeen;
 our ($wanted_callback, $avoid_nlink, $bydepth, $no_chdir, $follow,
     $follow_skip, $full_check, $untaint, $untaint_skip, $untaint_pat,
@@ -439,8 +440,9 @@
 
     $fn =~ s|^\./||;
 
-    my $abs_name= $cdir . $fn;
+    $fn .= '/' if $fn eq '..' or substr($fn, -3, 3) eq '/..';
 
+    my $abs_name= $cdir . $fn;
     if (substr($fn,0,3) eq '../') {
        1 while $abs_name =~ s!/[^/]*/\.\./!/!;
     }
@@ -506,7 +508,7 @@
 
 	# (simple) check for recursion
 	if ( ( $Base =~ /^$AbsName/) && (-d $AbsName) ) { # recursion
-	    return undef;
+	    return ( $AbsName, 1 );
 	}
     }
     else {
@@ -518,59 +520,77 @@
 	}
 
 	# (simple) check for recursion
-	my $newlen= length($AbsName);
+	my $newlen = length($AbsName);
 	if ($newlen <= length($Base)) {
 	    if (($newlen == length($Base) || substr($Base,$newlen,1) eq '/')
 		&& $AbsName eq substr($Base,0,$newlen))
 	    {
-		return undef;
+		return ($AbsName, 1);
 	    }
 	}
     }
-    return $AbsName;
+    return ( $AbsName, 0 );
 }
 
 sub Follow_SymLink($) {
     my ($AbsName) = @_;
 
-    my ($NewName,$DEV, $INO);
-    ($DEV, $INO)= lstat $AbsName;
+    my ($NewName, $recursive);
+    my ($DEV, $INO) = lstat $AbsName;
+
+    # Make sure the file exists before resolving links.
+    return (1, $AbsName) unless (defined $DEV);
+
+    if ($full_check) {
+        if ($FoundFileWith{$DEV, $INO}++) {
+            # This thing is being reported a second time.
+            return (1, $AbsName);
+        }
+
+        # Since we're keeping a full list in %FoundFileWith, this can just 
+        # keep track of things found while resolving the current link.
+        %SLnkSeen = ();
+    }
 
     while (-l _) {
 	if ($SLnkSeen{$DEV, $INO}++) {
-	    if ($follow_skip < 2) {
+            if ($follow_skip < 2) {
 		die "$AbsName is encountered a second time";
 	    }
-	    else {
-		return undef;
-	    }
+            else {
+                return (1, undef);
+            }
 	}
-	$NewName= PathCombine($AbsName, readlink($AbsName));
-	unless(defined $NewName) {
-	    if ($follow_skip < 2) {
+
+	($NewName, $recursive) = PathCombine($AbsName, readlink($AbsName));
+        if ($recursive) {
+            if ($follow_skip < 2) {
 		die "$AbsName is a recursive symbolic link";
+            }
+            else {
+		return (1, $NewName);
 	    }
-	    else {
-		return undef;
-	    }
-	}
-	else {
+        }
+        else {
 	    $AbsName= $NewName;
-	}
+        }
 	($DEV, $INO) = lstat($AbsName);
-	return undef unless defined $DEV;  #  dangling symbolic link
+        return (1, $AbsName, 1) unless (defined $DEV); # Dangling symlink
+
+        if ($full_check && exists $FoundFileWith{$DEV, $INO}) {
+            return (1, $AbsName);
+        }
     }
 
     if ($full_check && defined $DEV && $SLnkSeen{$DEV, $INO}++) {
 	if ( ($follow_skip < 1) || ((-d _) && ($follow_skip < 2)) ) {
 	    die "$AbsName encountered a second time";
 	}
-	else {
-	    return undef;
-	}
+        else {
+            return (1, $AbsName);
+        }
     }
-
-    return $AbsName;
+    return (0, $AbsName);
 }
 
 our($dir, $name, $fullname, $prune);
@@ -594,6 +614,7 @@
     # This function must local()ize everything because callbacks may
     # call find() or finddepth()
 
+    local %FoundFileWith;
     local %SLnkSeen;
     local ($wanted_callback, $avoid_nlink, $bydepth, $no_chdir, $follow,
 	$follow_skip, $full_check, $untaint, $untaint_skip, $untaint_pat,
@@ -672,9 +693,10 @@
 		    $abs_dir = contract_name("$cwd/",$top_item);
 		}
 	    }
-	    $abs_dir= Follow_SymLink($abs_dir);
-	    unless (defined $abs_dir) {
-		if ($dangling_symlinks) {
+
+	    my ($skip_entry, $resolved, $dangling) = Follow_SymLink($abs_dir);
+	    if ($skip_entry) {
+		if ($dangling && $dangling_symlinks) {
 		    if (ref $dangling_symlinks eq 'CODE') {
 			$dangling_symlinks->($top_item, $cwd);
 		    } else {
@@ -685,6 +707,7 @@
 	    }
 
 	    if (-d _) {
+                $abs_dir = $resolved;
 		_find_dir_symlnk($wanted, $abs_dir, $top_item);
 		$Is_Dir= 1;
 	    }
@@ -701,7 +724,7 @@
 		$Is_Dir= 1;
 	    }
 	    else {
-		$abs_dir= $top_item;
+		$abs_dir = $top_item;
 	    }
 	}
 
@@ -1019,8 +1042,8 @@
 	$dir_pref = ($p_dir =~ /:$/) ? "$p_dir" : "$p_dir:";
 	$loc_pref = ($dir_loc =~ /:$/) ? "$dir_loc" : "$dir_loc:";
     } else {
-	$dir_pref = ( $p_dir   eq '/' ? '/' : "$p_dir/" );
-	$loc_pref = ( $dir_loc eq '/' ? '/' : "$dir_loc/" );
+	$dir_pref = substr($p_dir,   -1, 1) eq '/' ? $p_dir   : "$p_dir/";
+	$loc_pref = substr($dir_loc, -1, 1) eq '/' ? $dir_loc : "$dir_loc/";
     }
 
     local ($dir, $name, $fullname, $prune, *DIR);
@@ -1114,11 +1137,11 @@
 	    next if $FN =~ $File::Find::skip_pattern;
 
 	    # follow symbolic links / do an lstat
-	    $new_loc = Follow_SymLink($loc_pref.$FN);
-
+	    my ($skip_entry, $resolved, $dangling) = Follow_SymLink($loc_pref.$FN);
+          
 	    # ignore if invalid symlink
-	    unless (defined $new_loc) {
-	        if ($dangling_symlinks) {
+	    if ($skip_entry) {
+                if ($dangling && $dangling_symlinks) {
 	            if (ref $dangling_symlinks eq 'CODE') {
 	                $dangling_symlinks->($FN, $dir_pref);
 	            } else {
@@ -1126,7 +1149,7 @@
 	            }
 	        }
 
-	        $fullname = undef;
+	        $fullname = $resolved;
 	        $name = $dir_pref . $FN;
 	        $_ = ($no_chdir ? $name : $FN);
 	        { $wanted_callback->() };
@@ -1134,10 +1157,11 @@
 	    }
 
 	    if (-d _) {
+                $new_loc = $resolved;
 		push @Stack,[$new_loc,$updir_loc,$dir_name,$FN,1];
 	    }
 	    else {
-		$fullname = $new_loc; # $File::Find::fullname
+		$fullname = $resolved; # $File::Find::fullname
 		$name = $dir_pref . $FN; # $File::Find::name
 		$_ = ($no_chdir ? $name : $FN); # $_
 		{ $wanted_callback->() }; # protect against wild "next"
@@ -1156,9 +1180,10 @@
 		$loc_pref = ($dir_loc =~ /:$/) ? $dir_loc : "$dir_loc:";
 	    }
 	    else {
-		$dir_name = ($p_dir eq '/' ? "/$dir_rel" : "$p_dir/$dir_rel");
+                # $p_dir does *not* always have a trailing '/'.
+		$dir_name = substr($p_dir,    -1, 1) eq '/' ? "$p_dir$dir_rel" : "$p_dir/$dir_rel";
 		$dir_pref = "$dir_name/";
-		$loc_pref = "$dir_loc/";
+		$loc_pref = substr($dir_loc, -1, 1) eq '/' ? $dir_loc : "$dir_loc/";
 	    }
 	    if ( $byd_flag < 0 ) {  # must be finddepth, report dirname now
 		unless ($no_chdir || ($dir_rel eq $File::Find::current_dir)) {
[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
    category=library
    severity=medium
---
This perlbug was built using Perl v5.8.6 - Sat Mar 19 17:33:54 UTC 2005
It is being executed now by  Perl v5.8.6 - Sat Mar 19 17:29:44 UTC 2005.

Site configuration information for perl v5.8.6:

Configured by abuild at Sat Mar 19 17:29:44 UTC 2005.

Summary of my perl5 (revision 5 version 8 subversion 6) configuration:
  Platform:
    osname=linux, osvers=2.6.9, archname=i586-linux-thread-multi
    uname='linux salieri 2.6.9 #1 smp fri jan 14 15:41:33 utc 2005 i686 athlon i386 gnulinux '
    config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Duseshrplib=true -Doptimize=-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -g -Wall -pipe'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -march=i586 -mcpu=i686 -fmessage-length=0 -Wall -g -Wall -pipe',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -pipe'
    ccversion='', gccversion='3.3.5 20050117 (prerelease) (SUSE Linux)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =''
    libpth=/lib /usr/lib /usr/local/lib
    libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.3.4'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.8.6/i586-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared'

Locally applied patches:
    

---
@INC for perl v5.8.6:
    /usr/local/prod/perl
    /usr/lsd/lib/perl/5.8.6/i586-linux-thread-multi
    /usr/lsd/lib/perl/5.8.6
    /usr/lsd/lib/perl/i586-linux-thread-multi
    /usr/lsd/lib/perl
    /usr/lib/perl5/5.8.6/i586-linux-thread-multi
    /usr/lib/perl5/5.8.6
    /usr/lib/perl5/site_perl/5.8.6/i586-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.6
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.6/i586-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.6
    /usr/lib/perl5/vendor_perl
    .

---
Environment for perl v5.8.6:
    HOME=/home/ammon
    LANG=en_US.ISO-8859-1
    LANGUAGE (unset)
    LC_COLLATE=POSIX
    LD_LIBRARY_PATH=/toast/misc/jets/ammon/lib/
    LOGDIR (unset)
    PATH=/home/ammon/bin:/toast/misc/jets/ammon/bin:/toast/misc/jets/bin:/muse/bin:/usr/apps/bin:/usr/local/prod/bin:/usr/site/bin:/usr/local/bin:/usr/site/spec/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11:/usr/lsd/bin:/opt/gnome/bin:/opt/kde3/bin:/usr/games
    PERL5LIB=/usr/local/prod/perl:/usr/lsd/lib/perl
    PERL_BADLANG (unset)
    SHELL=/bin/tcsh


Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About