Efficient way to walk directory tree containing link cycles

Question

Is there a more efficient way to walk a directory tree that contains link cycles than tracking which files have already been visited?

For example consider walking a directory containing these files:

symlink "parent" -> ".."
symlink "uh_oh" -> "/"
regular file "reg"
symlink "reg2" -> "reg"

It should be possibly read a directory entry type (d_type) when using readdir(), in this way the program can determine if its a symlink or directory or file and ignore/continue looking further. Refer to http://linux.die.net/man/3/readdir, again this depends on implementation of the underlying FS. — askb, Oct 19 '14 at 17:00

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

You should also track which directories have been visited, as per your first example, but otherwise there is no better solution than maintaining visited flags for every file.

Maintaining the flags would be easier if there were a portable way of getting a short unique identifier for a mounted filesystem. Even then, you need to think through the consequences of mount and umount operations occurring during the scan, particularly since such a scan might take quite a long time if the filesystem tree includes remote filesystems.

In theory, you can get a "filesystem id" from the stafvfs interface, but in practice that is not totally portable. Quoting man statfs from a Linux distro:

Nobody knows what f_fsid is supposed to contain…

…The general idea is that f_fsid contains some random stuff such that the pair (f_fsid,ino) uniquely determines a file. Some operating systems use (a variation on) the device number, or the device number combined with the filesystem type. Several OSes restrict giving out the f_fsid field to the superuser only (and zero it for unprivileged users), because this field is used in the filehandle of the filesystem when NFS-exported, and giving it out is a security concern.

This latter restriction -- that f_fsid is presented as 0 to non-privileged users -- does not violate the Posix standard cited above, because that standard includes a very general disclaimer: "It is unspecified whether all members of the statvfs structure have meaningful values on all file systems."

Would just tracking directories work if symlinks can point to already visited regular files? — Matt Joiner, Oct 20 '14 at 01:30
@MattJoiner: Depends on whether you care about visiting files twice :) — rici, Oct 20 '14 at 01:33

score 0 · Answer 2 · answered Oct 20 '14 at 07:55

The tree walk algorithm guarantees that you'll visit every file under a directory, so instead of tracking individual files you can maintain a list of search "roots":

Add the initial directory to the list of roots
Walk the directory tree for each search root
For every symlink you find, check if it's already contained by a search root. If it's not, add it as a new search root.

This way you'll visit every file and directory, will never get stuck in a loop, but may visit files and directories more than once. That can happen only when you find a symlink to an ancestor of an existing root. To avoid doing that, you can check if a directory is a search root before entering it.

Efficient way to walk directory tree containing link cycles

2 Answers2