Warning: this git clean
behavior will slightly change with Git 2.14 (Q3 2017)
"git clean -d"
used to clean directories that has ignored files, even though the command should not lose ignored ones without "-x
".
"git status --ignored
" did not list ignored and untracked files without "-uall
".
See commit 6b1db43 (23 May 2017), and commit bbf504a, commit fb89888, commit df5bcdf, commit 0a81d4a, commit b3487cc (18 May 2017) by Samuel Lijin (sxlijin
).
(Merged by Junio C Hamano -- gitster
-- in commit f4fd99b, 02 Jun 2017)
clean
: teach clean -d
to preserve ignored paths
There is an implicit assumption that a directory containing only untracked and ignored paths should itself be considered untracked. This makes sense in use cases where we're asking if a directory should be added to the git database, but not when we're asking if a directory can be safely removed from the working tree; as a result, clean -d
would assume that an "untracked" directory containing ignored paths could be deleted, even though doing so would also remove the ignored paths.
To get around this, we teach clean -d
to collect ignored paths and skip an untracked directory if it contained an ignored path, instead just removing the untracked contents thereof.
To achieve this, cmd_clean()
has to collect all untracked contents of untracked directories, in addition to all ignored paths, to determine which untracked dirs must be skipped (because they contain ignored paths) and which ones should not be skipped.
But... Since 2017, that change means git status --ignored
hangs indefinitely!
As reported by Martin Melka in this thread, and analyzed by SZEDER Gábor:
with a depth of 120 directories it would take over 6*10^23 years to complete
This slowdown was caused by commit df5bcdf, which was part of a
patch series to fix 'git clean -d
' deleting untracked directories even
if they contained ignored files.
So... a fix is in progress, to be released later in 2020.
Git 2.24 (Q4 2019) illustrates this git clean
behavior change introduces a regression.
See commit 502c386 (25 Aug 2019) by SZEDER Gábor (szeder
).
(Merged by Junio C Hamano -- gitster
-- in commit 026428c, 30 Sep 2019)
t7300-clean
: demonstrate deleting nested repo with an ignored file breakage
'git clean -fd
' must not delete an untracked directory if it belongs
to a different Git repository or worktree.
Unfortunately, if a '.gitignore
' rule in the outer repository happens to match a file in a nested repository or worktree, then something goes awry and 'git clean -fd
' does delete the content of the nested repository's worktree
except that ignored file, potentially leading to data loss.
Add a test to 't7300-clean.sh
' to demonstrate this breakage.
This issue is a regression introduced in 6b1db43 (clean
: teach
clean -d
to preserve ignored paths, 2017-05-23, Git v2.13.2).
Git 2.24 further clarifies git clean -d
:
See commit 69f272b (01 Oct 2019), and commit 902b90c, commit ca8b539, commit 09487f2, commit e86bbcf, commit 3aca580, commit 29b577b, commit 89a1f4a, commit a3d89d8, commit 404ebce, commit a5e916c, commit bbbb6b0, commit 7541cc5 (17 Sep 2019) by Elijah Newren (newren
).
(Merged by Junio C Hamano -- gitster
-- in commit aafb754, 11 Oct 2019)
t7300
: add testcases showing failure to clean specified pathspecs
Someone brought me a testcase where multiple git-clean
invocations were
required to clean out unwanted files:
mkdir d{1,2}
touch d{1,2}/ut
touch d1/t && git add d1/t
With this setup, the user would need to run
git clean -ffd */ut
twice to delete both ut
files.
A little testing showed some interesting variants:
- If only one of those two ut files existed (either one), then only one clean command would be necessary.
- If both directories had tracked files, then only one git clean would be necessary to clean both files.
- If both directories had no tracked files then the clean command above would never clean either of the untracked files despite the pathspec explicitly calling both of them out.
A bisect showed that the failure to clean out the files started with commit cf424f5 ("clean
: respect pathspecs with "-d
", 2014-03-10, Git v1.9.1).
However, that pointed to a separate issue: while the "-d
" flag was used by the original user who showed me this problem, that flag should have been irrelevant to this problem.
Testing again without the "-d
" flag showed that the same buggy behavior exists without using that flag, and has in fact existed since before cf424f5.
So:
clean
: respect pathspecs with "-d
"
git-clean
uses read_directory to fill in a struct dir
with potential hits. However, read_directory does not actually check against our pathspec. It uses a simplified version that may turn up false positives. As a result, we need to
check that any hits match our pathspec.
We do so reliably for non-directories.
For directories, if "-d" is not given we check that the pathspec matched exactly (i.e., we are even stricter, and require an explicit "git clean foo
" to clean "foo/
"). But if "-d
" is given, rather than relaxing the exact match to allow a recursive match, we do not check the pathspec at all.
This regression was introduced in 113f10f (Make git-clean a builtin, 2007-11-11, Git v1.5.4-rc0).
dir
: if our pathspec might match files under a dir, recurse into it
For git clean
, if a directory is entirely untracked and the user did not specify -d
(corresponding to DIR_SHOW_IGNORED_TOO
), then we usually do not want to remove that directory and thus do not recurse into it.
However, if the user manually specified specific (or even globbed) paths somewhere under that directory to remove, then we need to recurse into the directory to make sure we remove the relevant paths under that directory as the user requested.
Note that this does not mean that the recursed-into directory will be added to dir->entries
for later removal; as of a few commits earlier in this series, there is another more strict match check that is run after returning from a recursed-into directory before deciding to add it to the list of entries.
Therefore, this will only result in files underneath the given directory which match one of the pathspecs being added to the entries list.
And:
dir
: also check directories for matching pathspecs
Even if a directory doesn't match a pathspec, it is possible, depending on the precise pathspecs, that some file underneath it might.
So we special case and recurse into the directory for such situations.
However, we previously always added any untracked directory that we recursed into to the list of untracked paths, regardless of whether the directory itself matched the pathspec.
For the case of git-clean
and a set of pathspecs of "dir/file
" and "more
",
this caused a problem because we'd end up with dir entries for both of:
"dir"
"dir/file"
Then correct_untracked_entries()
would try to helpfully prune duplicates for us by removing "dir/file
" since it's under "dir
", leaving us with
"dir"
Since the original pathspec only had "dir/file
", the only entry left doesn't match and leaves nothing to be removed.
(Note that if only one pathspec was specified, e.g. only "dir/file
", then the common_prefix_len
optimizations in fill_directory
would cause us to bypass this problem, making it appear in simple tests that we could correctly remove manually specified pathspecs.)
Fix this by actually checking whether the directory we are about to add to the list of dir entries actually matches the pathspec; only do this matching check after we have already returned from recursing into the directory.
That results in:
clean
: disambiguate the definition of -d
The -d
flag pre-dated git-clean
's ability to have paths specified.
As such, the default for git-clean
was to only remove untracked files in
the current directory, and -d
existed to allow it to recurse into
subdirectories.
The interaction of paths and the -d
option appears to not have been carefully considered, as evidenced by numerous bugs and a dearth of tests covering such pairings in the testsuite.
The definition turns out to be important, so let's look at some of the various ways one could interpret the -d
option:
A) Without -d
, only look in subdirectories which contain tracked files under them; with -d
, also look in subdirectories which are untracked for files to clean.
B) Without specified paths from the user for us to delete, we need to have some kind of default, so...without -d
, only look in subdirectories which contain tracked files under them; with -d
, also look in subdirectories which are untracked for files to clean.
The important distinction here is that choice B says that the presence or absence of '-d
' is irrelevant if paths are specified.
The logic behind option B is that if a user explicitly asked us to clean a
specified pathspec, then we should clean anything that matches that
pathspec.
Some examples may clarify.
Should:
git clean -f untracked_dir/file
remove untracked_dir/file or not?
It seems crazy not to, but a strict reading of option A says it shouldn't be removed.
How about:
git clean -f untracked_dir/file1 tracked_dir/file2
or
git clean -f untracked_dir_1/file1 untracked_dir_2/file2
?
Should it remove either or both of these files?
Should it require multiple runs to remove both the files listed? (If this sounds like a crazy question to even ask, see the commit message of "t7300: Add some
testcases showing failure to clean specified pathspecs" added earlier in
this patch series.)
What if -ffd
were used instead of -f
-- should that allow these to be removed? Should it take multiple invocations with -ffd
?
What if a glob (such as 'tracked') were used instead of spelling out the directory names?
What if the filenames involved globs, such as
git clean -f '*.o'
or
git clean -f '*/*.o'
?
The current documentation actually suggests a definition that is
slightly different than choice A
, and the implementation prior to this
series provided something radically different than either choices A or B.
(The implementation, though, was clearly just buggy).
There may be other choices as well.
However, for almost any given choice of definition for -d
that I can think of, some of the examples above will appear buggy to the user.
The only case that doesn't have negative surprises is choice B: treat a user-specified path as a request to clean all untracked files which match that path specification, including recursing into any untracked directories.
Change the documentation and basic implementation to use this definition.
There were two regression tests that indirectly depended on the current
implementation, but neither was about subdirectory handling.
These two tests were introduced in commit 5b7570c ("git-clean
: add tests for relative path", 2008-03-07, Git v1.5.5-rc0) which was solely created to add coverage for the changes in commit fb328947c8e ("git-clean: correct printing relative path", 2008-03-07).
Both tests specified a directory that happened to have an untracked subdirectory, but both were only checking that the resulting printout of a file that was removed was shown with a relative path.
Update these tests appropriately.
Finally, see "Git clean exclude nested sub directory".
Warning: The directory traversal code had redundant recursive calls which made its performance characteristics exponential with respect to the depth of the tree, which was corrected, with Git 2.27 (Q2 2020).
And that affects git clean
as well.
See commit c0af173, commit 95c11ec, commit 7f45ab2, commit 1684644, commit 8d92fb2, commit 2df179d, commit 0126d14, commit cd129ee, commit 446f46d, commit 7260c7b, commit ce5c61a (01 Apr 2020) by Elijah Newren (newren
).
See commit 0bbd0e8 (01 Apr 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 6eacc39, 29 Apr 2020)
dir
: replace exponential algorithm with a linear one
Signed-off-by: Elijah Newren
dir's read_directory_recursive()
naturally operates recursively in order to walk the directory tree.
Treating of directories is sometimes weird because there are so many different permutations about how to handle directories.
Some examples:
- '
git ls-files -o --directory
' only needs to know that a directory itself is untracked; it doesn't need to recurse into it to see what is underneath.
- '
git status
' needs to recurse into an untracked directory, but only to determine whether or not it is empty.
If there are no files underneath, the directory itself will be omitted from the output.
If it is not empty, only the directory will be listed.
- '
git status --ignored
' needs to recurse into untracked directories and report all the ignored entries and then report the directory as untracked -- UNLESS all the entries under the directory are ignored, in which case we don't print any of the entries under the directory and just report the directory itself as ignored.
(Note that although this forces us to walk all untracked files underneath the directory as well, we strip them from the output, except for users like 'git clean' who also set DIR_KEEP_TRACKED_CONTENTS
.)
- For '
git clean
', we may need to recurse into a directory that doesn't match any specified pathspecs, if it's possible that there is an entry underneath the directory that can match one of the pathspecs.
In such a case, we need to be careful to omit the directory itself from the list of paths (see commit 404ebceda01c ("dir
: also check directories for matching pathspecs", 2019-09-17, Git v2.24.0-rc0))
Part of the tension noted above is that the treatment of a directory can change based on the files within it, and based on the various settings in dir->flags
.
Trying to keep this in mind while reading over the code, it is easy to think in terms of "treat_directory()
tells us what to do with a directory, and read_directory_recursive()
is the thing that recurses".
Since we need to look into a directory to know how to treat it, though, it is quite easy to decide to (also) recurse into the directory from treat_directory()
by adding a read_directory_recursive()
call.
Adding such a call is actually fine, IF we make sure that read_directory_recursive()
does not also recurse into that same directory.
Unfortunately, commit df5bcdf83aeb ("dir
: recurse into untracked dirs for ignored files", 2017-05-18, Git v2.14.0-rc0 -- merge listed in batch #5), added exactly such a case to the code, meaning we'd have two calls to read_directory_recursive()
for an untracked directory.
So, if we had a file named
one/two/three/four/five/somefile.txt
and nothing in one/
was tracked, then 'git status --ignored
' would call read_directory_recursive()
twice on the directory 'one/
', and each of those would call read_directory_recursive()
twice on the directory 'one/two/
', and so on until read_directory_recursive()
was called 2^5
times for 'one/two/three/four/five/
'.
Avoid calling read_directory_recursive()
twice per level by moving a lot of the special logic into treat_directory()
.
Since dir.c
is somewhat complex, extra cruft built up around this over time.
While trying to unravel it, I noticed several instances where the first call to read_directory_recursive()
would return e.g.
path_untracked
for some directory and a later one would return e.g.
path_none,
despite the fact that the directory clearly should have been considered untracked.
The code happened to work due to the side-effect from the first invocation of adding untracked entries to dir->entries;
this allowed it to get the correct output despite the supposed override in return value by the later call.
I am somewhat concerned that there are still bugs and maybe even testcases with the wrong expectation.
I have tried to carefully document treat_directory()
since it becomes more complex after this change (though much of this complexity came from elsewhere that probably deserved better comments to begin with).
However, much of my work felt more like a game of whackamole while attempting to make the code match the existing regression tests than an attempt to create an implementation that matched some clear design.
That seems wrong to me, but the rules of existing behavior had so many special cases that I had a hard time coming up with some overarching rules about what correct behavior is for all cases, forcing me to hope that the regression tests are correct and sufficient.
Such a hope seems likely to be ill-founded, given my experience with dir.c
-related testcases in the last few months:
Examples where the documentation was hard to parse or even just wrong:
- 3aca58045f4f (
git-clean.txt
: do not claim we will delete files with -n
/--dry-run
, 2019-09-17, Git v2.24.0-rc0)
- 09487f2cbad3 (
clean
: avoid removing untracked files in a nested git
repository, 2019-09-17, v2.24.0-rc0)
- e86bbcf987fa (
clean
: disambiguate the definition of -d
, 2019-09-17)
Examples where testcases were declared wrong and changed:
- 09487f2cbad3 (
clean
: avoid removing untracked files in a nested git repository, 2019-09-17, Git v2.24.0-rc0)
- e86bbcf987fa (
clean
: disambiguate the definition of -d
, 2019-09-17, Git v2.24.0-rc0)
- a2b13367fe55 (Revert "
dir.c
: make 'git-status --ignored
' work within leading directories", 2019-12-10, Git v2.25.0-rc0)
Examples where testcases were clearly inadequate:
- 502c386ff944 (
t7300-clean
: demonstrate deleting nested repo with an ignored file breakage, 2019-08-25, Git v2.24.0-rc0)
- 7541cc530239 (
t7300
: add testcases showing failure to clean specified pathspecs, 2019-09-17, Git v2.24.0-rc0)
- a5e916c7453b (
dir
: fix off-by-one error in match_pathspec_item
, 2019-09-17, Git v2.24.0-rc0)
- 404ebceda01c (
dir
: also check directories for matching pathspecs, 2019-09-17, Git v2.24.0-rc0)
- 09487f2cbad3 (
clean
: avoid removing untracked files in a nested git repository, 2019-09-17, Git v2.24.0-rc0)
- e86bbcf987fa (
clean
: disambiguate the definition of -d
, 2019-09-17, Git v2.24.0-rc0)
- 452efd11fbf6 (
t3011
: demonstrate directory traversal failures, 2019-12-10, Git v2.25.0-rc0)
- b9670c1f5e6b (
dir
: fix checks on common prefix directory, 2019-12-19, Git v2.25.0-rc0)
Examples where "correct behavior" was unclear to everyone:
Other commits of note:
- 902b90cf42bc (
clean
: fix theoretical path corruption, 2019-09-17, Git v2.24.0-rc0)
However, on the positive side, it does make the code much faster.
For the following simple shell loop in an empty repository:
for depth in $(seq 10 25)
do
dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
rm -rf dir
mkdir -p $dirs
$dirs/untracked-file
/usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
done
I saw the following timings, in seconds (note that the numbers are a little noisy from run-to-run, but the trend is very clear with every run):
10: 0.03
11: 0.05
12: 0.08
13: 0.19
14: 0.29
15: 0.50
16: 1.05
17: 2.11
18: 4.11
19: 8.60
20: 17.55
21: 33.87
22: 68.71
23: 140.05
24: 274.45
25: 551.15
For the above run, using strace
I can look for the number of untracked directories opened and can verify that it matches the expected 2^($depth+1)-2
(the sum of 2^1 + 2^2 + 2^3 + ... + 2^$depth
).
After this fix, with strace
I can verify that the number of untracked directories that are opened drops to just $depth, and the timings all drop to 0.00.
In fact, it isn't until a depth of 190 nested directories that it sometimes starts reporting a time of 0.01 seconds and doesn't consistently report 0.01 seconds until there are 240 nested directories. The previous code would have taken
17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS
to have completed the 240 nested directories case.
It's not often that you get to speed something up by a factor of 3*10^69.