12

According to the help, without -x option git clean should let alone the ignored files, but it doesn't.

[il@reallin test]$ cat .gitignore
*.sar
[il@reallin test]$ mkdir -p conf/sar && touch conf/sar/aaa.sar
[il@reallin test]$ git status
# On branch master
nothing to commit, working directory clean
[il@reallin test]$ git clean -df
Removing conf/

conf/sar/aaa.sar is removed. Is it a bug?

basin
  • 3,407
  • 2
  • 20
  • 46

5 Answers5

5

According to man git clean:

-d
    Remove untracked directories in addition to untracked files.

In your case, directory conf/sar is not tracked - it does not contain any files that are tracked by git. If you did not have gitignore rule and executed git clean -fd, contents of this untracked directory would have been removed - just what documentation says.

Now, if you add .gitignore with rule to ignore *.sar files, it does not change basic fact that your directory conf/sar/ is still untracked, and having untracked file aaa.sar that is eligible for this gitignore rule should not suddenly make it unremovable by git clean -fd.

But, if you add any tracking file next to your ignored aaa.sar, then this directory would not be removed and your file will be left alone.

In other words, while it looks confusing, this is not a bug and git does exactly what documentation says.

mvp
  • 94,368
  • 12
  • 106
  • 137
  • 2
    As a side note: it would probably be best to put a `.gitignore` ignoring `*.sar` files inside the `sar` directory itself. Not only this makes the top-level `.gitignore` cleaner and keep ignore information where they are needed, it also has the added benefit of keeping that directory alive as @mvp mentioned. – Shahbaz Jun 04 '14 at 09:03
4

Warning: this git clean behavior will slightly change with Git 2.14 (Q3 2017)

"git clean -d" used to clean directories that has ignored files, even though the command should not lose ignored ones without "-x".
"git status --ignored" did not list ignored and untracked files without "-uall".

See commit 6b1db43 (23 May 2017), and commit bbf504a, commit fb89888, commit df5bcdf, commit 0a81d4a, commit b3487cc (18 May 2017) by Samuel Lijin (sxlijin).
(Merged by Junio C Hamano -- gitster -- in commit f4fd99b, 02 Jun 2017)

clean: teach clean -d to preserve ignored paths

There is an implicit assumption that a directory containing only untracked and ignored paths should itself be considered untracked. This makes sense in use cases where we're asking if a directory should be added to the git database, but not when we're asking if a directory can be safely removed from the working tree; as a result, clean -d would assume that an "untracked" directory containing ignored paths could be deleted, even though doing so would also remove the ignored paths.

To get around this, we teach clean -d to collect ignored paths and skip an untracked directory if it contained an ignored path, instead just removing the untracked contents thereof.
To achieve this, cmd_clean() has to collect all untracked contents of untracked directories, in addition to all ignored paths, to determine which untracked dirs must be skipped (because they contain ignored paths) and which ones should not be skipped.

But... Since 2017, that change means git status --ignored hangs indefinitely!
As reported by Martin Melka in this thread, and analyzed by SZEDER Gábor:

with a depth of 120 directories it would take over 6*10^23 years to complete

This slowdown was caused by commit df5bcdf, which was part of a patch series to fix 'git clean -d' deleting untracked directories even if they contained ignored files.

So... a fix is in progress, to be released later in 2020.


Git 2.24 (Q4 2019) illustrates this git clean behavior change introduces a regression.

See commit 502c386 (25 Aug 2019) by SZEDER Gábor (szeder).
(Merged by Junio C Hamano -- gitster -- in commit 026428c, 30 Sep 2019)

t7300-clean: demonstrate deleting nested repo with an ignored file breakage

'git clean -fd' must not delete an untracked directory if it belongs to a different Git repository or worktree.

Unfortunately, if a '.gitignore' rule in the outer repository happens to match a file in a nested repository or worktree, then something goes awry and 'git clean -fd' does delete the content of the nested repository's worktree except that ignored file, potentially leading to data loss.

Add a test to 't7300-clean.sh' to demonstrate this breakage.

This issue is a regression introduced in 6b1db43 (clean: teach clean -d to preserve ignored paths, 2017-05-23, Git v2.13.2).


Git 2.24 further clarifies git clean -d:

See commit 69f272b (01 Oct 2019), and commit 902b90c, commit ca8b539, commit 09487f2, commit e86bbcf, commit 3aca580, commit 29b577b, commit 89a1f4a, commit a3d89d8, commit 404ebce, commit a5e916c, commit bbbb6b0, commit 7541cc5 (17 Sep 2019) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit aafb754, 11 Oct 2019)

t7300: add testcases showing failure to clean specified pathspecs

Someone brought me a testcase where multiple git-clean invocations were required to clean out unwanted files:

mkdir d{1,2}
touch d{1,2}/ut
touch d1/t && git add d1/t

With this setup, the user would need to run

git clean -ffd */ut

twice to delete both ut files.

A little testing showed some interesting variants:

  • If only one of those two ut files existed (either one), then only one clean command would be necessary.
  • If both directories had tracked files, then only one git clean would be necessary to clean both files.
  • If both directories had no tracked files then the clean command above would never clean either of the untracked files despite the pathspec explicitly calling both of them out.

A bisect showed that the failure to clean out the files started with commit cf424f5 ("clean: respect pathspecs with "-d", 2014-03-10, Git v1.9.1).
However, that pointed to a separate issue: while the "-d" flag was used by the original user who showed me this problem, that flag should have been irrelevant to this problem.
Testing again without the "-d" flag showed that the same buggy behavior exists without using that flag, and has in fact existed since before cf424f5.

So:

clean: respect pathspecs with "-d"

git-clean uses read_directory to fill in a struct dir with potential hits. However, read_directory does not actually check against our pathspec. It uses a simplified version that may turn up false positives. As a result, we need to check that any hits match our pathspec.

We do so reliably for non-directories.

For directories, if "-d" is not given we check that the pathspec matched exactly (i.e., we are even stricter, and require an explicit "git clean foo" to clean "foo/"). But if "-d" is given, rather than relaxing the exact match to allow a recursive match, we do not check the pathspec at all.

This regression was introduced in 113f10f (Make git-clean a builtin, 2007-11-11, Git v1.5.4-rc0).

dir: if our pathspec might match files under a dir, recurse into it

For git clean, if a directory is entirely untracked and the user did not specify -d (corresponding to DIR_SHOW_IGNORED_TOO), then we usually do not want to remove that directory and thus do not recurse into it.

However, if the user manually specified specific (or even globbed) paths somewhere under that directory to remove, then we need to recurse into the directory to make sure we remove the relevant paths under that directory as the user requested.

Note that this does not mean that the recursed-into directory will be added to dir->entries for later removal; as of a few commits earlier in this series, there is another more strict match check that is run after returning from a recursed-into directory before deciding to add it to the list of entries.
Therefore, this will only result in files underneath the given directory which match one of the pathspecs being added to the entries list.

And:

dir: also check directories for matching pathspecs

Even if a directory doesn't match a pathspec, it is possible, depending on the precise pathspecs, that some file underneath it might.
So we special case and recurse into the directory for such situations.
However, we previously always added any untracked directory that we recursed into to the list of untracked paths, regardless of whether the directory itself matched the pathspec.

For the case of git-clean and a set of pathspecs of "dir/file" and "more", this caused a problem because we'd end up with dir entries for both of:

"dir"
"dir/file"

Then correct_untracked_entries() would try to helpfully prune duplicates for us by removing "dir/file" since it's under "dir", leaving us with

"dir"

Since the original pathspec only had "dir/file", the only entry left doesn't match and leaves nothing to be removed.
(Note that if only one pathspec was specified, e.g. only "dir/file", then the common_prefix_len optimizations in fill_directory would cause us to bypass this problem, making it appear in simple tests that we could correctly remove manually specified pathspecs.)

Fix this by actually checking whether the directory we are about to add to the list of dir entries actually matches the pathspec; only do this matching check after we have already returned from recursing into the directory.

That results in:

clean: disambiguate the definition of -d

The -d flag pre-dated git-clean's ability to have paths specified.
As such, the default for git-clean was to only remove untracked files in the current directory, and -d existed to allow it to recurse into subdirectories.

The interaction of paths and the -d option appears to not have been carefully considered, as evidenced by numerous bugs and a dearth of tests covering such pairings in the testsuite.
The definition turns out to be important, so let's look at some of the various ways one could interpret the -d option:

A) Without -d, only look in subdirectories which contain tracked files under them; with -d, also look in subdirectories which are untracked for files to clean.

B) Without specified paths from the user for us to delete, we need to have some kind of default, so...without -d, only look in subdirectories which contain tracked files under them; with -d, also look in subdirectories which are untracked for files to clean.

The important distinction here is that choice B says that the presence or absence of '-d' is irrelevant if paths are specified.
The logic behind option B is that if a user explicitly asked us to clean a specified pathspec, then we should clean anything that matches that pathspec.

Some examples may clarify.

Should:

git clean -f untracked_dir/file

remove untracked_dir/file or not?
It seems crazy not to, but a strict reading of option A says it shouldn't be removed.
How about:

git clean -f untracked_dir/file1 tracked_dir/file2

or

git clean -f untracked_dir_1/file1 untracked_dir_2/file2

?
Should it remove either or both of these files?
Should it require multiple runs to remove both the files listed? (If this sounds like a crazy question to even ask, see the commit message of "t7300: Add some testcases showing failure to clean specified pathspecs" added earlier in this patch series.)
What if -ffd were used instead of -f -- should that allow these to be removed? Should it take multiple invocations with -ffd?
What if a glob (such as 'tracked') were used instead of spelling out the directory names?
What if the filenames involved globs, such as

git clean -f '*.o'

or

git clean -f '*/*.o'

?

The current documentation actually suggests a definition that is slightly different than choice A, and the implementation prior to this series provided something radically different than either choices A or B.
(The implementation, though, was clearly just buggy).

There may be other choices as well.
However, for almost any given choice of definition for -d that I can think of, some of the examples above will appear buggy to the user.
The only case that doesn't have negative surprises is choice B: treat a user-specified path as a request to clean all untracked files which match that path specification, including recursing into any untracked directories.

Change the documentation and basic implementation to use this definition.

There were two regression tests that indirectly depended on the current implementation, but neither was about subdirectory handling.
These two tests were introduced in commit 5b7570c ("git-clean: add tests for relative path", 2008-03-07, Git v1.5.5-rc0) which was solely created to add coverage for the changes in commit fb328947c8e ("git-clean: correct printing relative path", 2008-03-07).
Both tests specified a directory that happened to have an untracked subdirectory, but both were only checking that the resulting printout of a file that was removed was shown with a relative path.
Update these tests appropriately.

Finally, see "Git clean exclude nested sub directory".


Warning: The directory traversal code had redundant recursive calls which made its performance characteristics exponential with respect to the depth of the tree, which was corrected, with Git 2.27 (Q2 2020).

And that affects git clean as well.

See commit c0af173, commit 95c11ec, commit 7f45ab2, commit 1684644, commit 8d92fb2, commit 2df179d, commit 0126d14, commit cd129ee, commit 446f46d, commit 7260c7b, commit ce5c61a (01 Apr 2020) by Elijah Newren (newren).
See commit 0bbd0e8 (01 Apr 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 6eacc39, 29 Apr 2020)

dir: replace exponential algorithm with a linear one

Signed-off-by: Elijah Newren

dir's read_directory_recursive() naturally operates recursively in order to walk the directory tree.

Treating of directories is sometimes weird because there are so many different permutations about how to handle directories.

Some examples:

  • 'git ls-files -o --directory' only needs to know that a directory itself is untracked; it doesn't need to recurse into it to see what is underneath.
  • 'git status' needs to recurse into an untracked directory, but only to determine whether or not it is empty.
    If there are no files underneath, the directory itself will be omitted from the output.
    If it is not empty, only the directory will be listed.
  • 'git status --ignored' needs to recurse into untracked directories and report all the ignored entries and then report the directory as untracked -- UNLESS all the entries under the directory are ignored, in which case we don't print any of the entries under the directory and just report the directory itself as ignored.
    (Note that although this forces us to walk all untracked files underneath the directory as well, we strip them from the output, except for users like 'git clean' who also set DIR_KEEP_TRACKED_CONTENTS.)
  • For 'git clean', we may need to recurse into a directory that doesn't match any specified pathspecs, if it's possible that there is an entry underneath the directory that can match one of the pathspecs.
    In such a case, we need to be careful to omit the directory itself from the list of paths (see commit 404ebceda01c ("dir: also check directories for matching pathspecs", 2019-09-17, Git v2.24.0-rc0))

Part of the tension noted above is that the treatment of a directory can change based on the files within it, and based on the various settings in dir->flags.

Trying to keep this in mind while reading over the code, it is easy to think in terms of "treat_directory() tells us what to do with a directory, and read_directory_recursive() is the thing that recurses".

Since we need to look into a directory to know how to treat it, though, it is quite easy to decide to (also) recurse into the directory from treat_directory() by adding a read_directory_recursive() call.

Adding such a call is actually fine, IF we make sure that read_directory_recursive() does not also recurse into that same directory.

Unfortunately, commit df5bcdf83aeb ("dir: recurse into untracked dirs for ignored files", 2017-05-18, Git v2.14.0-rc0 -- merge listed in batch #5), added exactly such a case to the code, meaning we'd have two calls to read_directory_recursive() for an untracked directory.

So, if we had a file named

one/two/three/four/five/somefile.txt

and nothing in one/ was tracked, then 'git status --ignored' would call read_directory_recursive() twice on the directory 'one/', and each of those would call read_directory_recursive() twice on the directory 'one/two/', and so on until read_directory_recursive() was called 2^5 times for 'one/two/three/four/five/'.

Avoid calling read_directory_recursive() twice per level by moving a lot of the special logic into treat_directory().

Since dir.c is somewhat complex, extra cruft built up around this over time.

While trying to unravel it, I noticed several instances where the first call to read_directory_recursive() would return e.g. path_untracked for some directory and a later one would return e.g. path_none, despite the fact that the directory clearly should have been considered untracked.

The code happened to work due to the side-effect from the first invocation of adding untracked entries to dir->entries; this allowed it to get the correct output despite the supposed override in return value by the later call.

I am somewhat concerned that there are still bugs and maybe even testcases with the wrong expectation.

I have tried to carefully document treat_directory() since it becomes more complex after this change (though much of this complexity came from elsewhere that probably deserved better comments to begin with).

However, much of my work felt more like a game of whackamole while attempting to make the code match the existing regression tests than an attempt to create an implementation that matched some clear design.

That seems wrong to me, but the rules of existing behavior had so many special cases that I had a hard time coming up with some overarching rules about what correct behavior is for all cases, forcing me to hope that the regression tests are correct and sufficient.

Such a hope seems likely to be ill-founded, given my experience with dir.c-related testcases in the last few months:

Examples where the documentation was hard to parse or even just wrong:

  • 3aca58045f4f (git-clean.txt: do not claim we will delete files with -n/--dry-run, 2019-09-17, Git v2.24.0-rc0)
  • 09487f2cbad3 (clean: avoid removing untracked files in a nested git repository, 2019-09-17, v2.24.0-rc0)
  • e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17)

Examples where testcases were declared wrong and changed:

  • 09487f2cbad3 (clean: avoid removing untracked files in a nested git repository, 2019-09-17, Git v2.24.0-rc0)
  • e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17, Git v2.24.0-rc0)
  • a2b13367fe55 (Revert "dir.c: make 'git-status --ignored' work within leading directories", 2019-12-10, Git v2.25.0-rc0)

Examples where testcases were clearly inadequate:

  • 502c386ff944 (t7300-clean: demonstrate deleting nested repo with an ignored file breakage, 2019-08-25, Git v2.24.0-rc0)
  • 7541cc530239 (t7300: add testcases showing failure to clean specified pathspecs, 2019-09-17, Git v2.24.0-rc0)
  • a5e916c7453b (dir: fix off-by-one error in match_pathspec_item, 2019-09-17, Git v2.24.0-rc0)
  • 404ebceda01c (dir: also check directories for matching pathspecs, 2019-09-17, Git v2.24.0-rc0)
  • 09487f2cbad3 (clean: avoid removing untracked files in a nested git repository, 2019-09-17, Git v2.24.0-rc0)
  • e86bbcf987fa (clean: disambiguate the definition of -d, 2019-09-17, Git v2.24.0-rc0)
  • 452efd11fbf6 (t3011: demonstrate directory traversal failures, 2019-12-10, Git v2.25.0-rc0)
  • b9670c1f5e6b (dir: fix checks on common prefix directory, 2019-12-19, Git v2.25.0-rc0)

Examples where "correct behavior" was unclear to everyone:

Other commits of note:

  • 902b90cf42bc (clean: fix theoretical path corruption, 2019-09-17, Git v2.24.0-rc0)

However, on the positive side, it does make the code much faster.

For the following simple shell loop in an empty repository:

for depth in $(seq 10 25)
do
  dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
  rm -rf dir
  mkdir -p $dirs

$dirs/untracked-file /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null done

I saw the following timings, in seconds (note that the numbers are a little noisy from run-to-run, but the trend is very clear with every run):

10: 0.03
11: 0.05
12: 0.08
13: 0.19
14: 0.29
15: 0.50
16: 1.05
17: 2.11
18: 4.11
19: 8.60
20: 17.55
21: 33.87
22: 68.71
23: 140.05
24: 274.45
25: 551.15

For the above run, using strace I can look for the number of untracked directories opened and can verify that it matches the expected 2^($depth+1)-2 (the sum of 2^1 + 2^2 + 2^3 + ... + 2^$depth).

After this fix, with strace I can verify that the number of untracked directories that are opened drops to just $depth, and the timings all drop to 0.00.

In fact, it isn't until a depth of 190 nested directories that it sometimes starts reporting a time of 0.01 seconds and doesn't consistently report 0.01 seconds until there are 240 nested directories. The previous code would have taken

17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS

to have completed the 240 nested directories case.

It's not often that you get to speed something up by a factor of 3*10^69.

VonC
  • 1,042,979
  • 435
  • 3,649
  • 4,283
0

Yes, git clean does seem to behave contrary to the docs, by removing an ignored file even though -x/-X is not specified.

It seems that option -d overrides the absence of -x/-X. That is, git clean -df will remove untracked directories, even if they contain untracked, but ignored files.

I don't know whether this is an oversight, or intentional, but the manpage is clearly incomplete in this respect. You might consider sending a patch for the manpage to the git mailing list.

BTW, the same problem is discussed in the question How to preserve all ignored files in git clean -fd? . There it is noted that git clean -df will not delete directories which are in .gitignore. So to retain your conf/, you can add it to .gitignore.

Community
  • 1
  • 1
sleske
  • 73,934
  • 32
  • 166
  • 212
0

To get the behavior you want, protecting an untracked directory from git clean -d and selectively deleting content from within those untracked directories, you have to explicitly ignore the entire topmost untracked directory, in your case

echo /conf/ >>.gitignore   # or .git/info/excludes if it's just you

Now, git clean does not recurse into untracked directories, but fortunately it's an easy handroll:

# recursive x-ray git clean with various options:

git ls-files --exclude-standard '-x!*/' -oz  | xargs -0 rm -f   #
git ls-files                            -oz  | xargs -0 rm -f   # -x
git ls-files --exclude-standard '-x!*/' -oiz | xargs -0 rm -f   # -X

(or git ls-files --exclude-standard '-x!/conf/' to skip just the one spec). The single quotes are there because ! is interactive-shell syntax for pulling in pieces of previous command lines.

To clean up empty directories you can get close to your desired behavior with

find -depth -type d -empty -delete
# -delete is -exec rm -f '{}' ';' on non-GNU userlands

But this really belongs in a makefile recipe, followed by a batch of mkdir -ps to recreate any of the structure you want preserved even if empty, since make is built to manage transients like build/test/install products.

jthill
  • 42,819
  • 4
  • 65
  • 113
  • "git entirely ignores conf/sar/" - false. conf is still removed – basin Jun 02 '14 at 06:52
  • "git clean -f" - it does nothing – basin Jun 02 '14 at 06:53
  • You're right, I overlooked that `/conf/` is also untracked, and apparently haven't ever used `git clean -f` without `-d` in a directory where I expected it to be recursive. Better now? – jthill Jun 02 '14 at 23:01
0

In addition of the git clean fixes I mentioned previously, with Git 2.28 (Q3 2020), code clean-up of "git clean" resulted in a fix of recent performance regression.

See commit 7233f17, commit f7f5c6c, commit 351ea1c, commit e6c0be9 (11 Jun 2020) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 5367469, 25 Jun 2020)

clean: optimize and document cases where we recurse into subdirectories

Reported-by: Brian Malehorn
Signed-off-by: Elijah Newren

Commit 6b1db43109 ("clean: teach clean -d to preserve ignored paths", 2017-05-23, Git v2.14.0-rc0 -- merge listed in batch #5) added the following code block (among others) to git-clean:

if (remove_directories)
    dir.flags |= DIR_SHOW_IGNORED_TOO | DIR_KEEP_UNTRACKED_CONTENTS;

The reason for these flags is well documented in the commit message, but isn't obvious just from looking at the code.

Add some explanations to the code to make it clearer.

Further, it appears git-2.26 did not correctly handle this combination of flags from git clean.

With both these flags and without DIR_SHOW_IGNORED_TOO_MODE_MATCHING set, git is supposed to recurse into all untracked AND ignored directories.

git-2.26.0 clearly was not doing that.

I don't know the full reasons for that or whether git < 2.27.0 had additional unknown bugs because of that misbehavior, because I don't feel it's worth digging into.

As per the huge changes and craziness documented in commit 8d92fb2927 ("dir: replace exponential algorithm with a linear one", 2020-04-01, Git v2.27.0-rc0 -- merge listed in batch #5), the old algorithm was a mess and was thrown out.

What I can say is that git-2.27.0 correctly recurses into untracked AND ignored directories with that combination.

However, in clean's case we don't need to recurse into ignored directories; that is just a waste of time.

Thus, when git-2.27.0 started correctly handling those flags, we got a performance regression report.

Rather than relying on other bugs in fill_directory()'s former logic to provide the behavior of skipping ignored directories, make use of the DIR_SHOW_IGNORED_TOO_MODE_MATCHING value specifically added in commit eec0f7f2b7 ("status: add option to show ignored files differently", 2017-10-30, Git v2.16.0-rc0 -- merge listed in batch #4) for this purpose.

VonC
  • 1,042,979
  • 435
  • 3,649
  • 4,283