54

The following bash script is slow when scanning for .git directories because it looks at every directory. If I have a collection of large repositories it takes a long time for find to churn through every directory, looking for .git. It would go much faster if it would prune the directories within repos, once a .git directory is found. Any ideas on how to do that, or is there another way to write a bash script that accomplishes the same thing?

#!/bin/bash

# Update all git directories below current directory or specified directory

HIGHLIGHT="\e[01;34m"
NORMAL='\e[00m'

DIR=.
if [ "$1" != "" ]; then DIR=$1; fi
cd $DIR>/dev/null; echo -e "${HIGHLIGHT}Scanning ${PWD}${NORMAL}"; cd ->/dev/null

for d in `find . -name .git -type d`; do
  cd $d/.. > /dev/null
  echo -e "\n${HIGHLIGHT}Updating `pwd`$NORMAL"
  git pull
  cd - > /dev/null
done

Specifically, how would you use these options? For this problem, you cannot assume that the collection of repos is all in the same directory; they might be within nested directories.

top
  repo1
  dirA

  dirB
     dirC
        repo1
casperOne
  • 70,959
  • 17
  • 175
  • 239
Mike Slinn
  • 6,212
  • 5
  • 38
  • 69
  • 1
    Consider adding the `-maxdepth` option and setting it to `1` (for `find`) – Burhan Khalid Aug 16 '12 at 06:27
  • Just add the `-prune` option should word. – Xiè Jìléi Aug 16 '12 at 06:56
  • Specifically, how would you use these options? For this problem, you cannot assume that the collection of repos is all in the same directory; they might be within nested directories. top repo1 dirA repo2 repo3 repo4 dirB repo5 dirC repo6 – Mike Slinn Aug 16 '12 at 13:08
  • 1
    Is it the `find` that is "slow", or is it the fact that you're doing a `git pull` at each directory? I suspect simply running `find . -type d -name .git -print` should be pretty quick (unless you're running over a slow network file system like NFS or CIFS, or on a floppy drive or something)... – twalberg Aug 16 '12 at 14:13
  • temporarily remove the pull; see if it is still slow – Clayton Stanley Aug 17 '12 at 01:51
  • Clay, you are another victim of groupthink. Try running the script and you will see that you don't understand how it works. – Mike Slinn Aug 17 '12 at 05:07
  • https://unix.stackexchange.com/questions/333862/how-to-find-all-git-repositories-within-given-folders-fast – Adam Feb 21 '19 at 16:33

8 Answers8

50

Check out Dennis' answer in this post about find's -prune option:

How to use '-prune' option of 'find' in sh?

find . -name .git -type d -prune

Will speed things up a bit, as find won't descend into .git directories, but it still does descend into git repositories, looking for other .git folders. And that 'could' be a costly operation.

What would be cool is if there was some sort of find lookahead pruning mechanism, where if a folder has a subfolder called .git, then prune on that folder...

That said, I'm betting your bottleneck is in the network operation 'git pull', and not in the find command, as others have posted in the comments.

Community
  • 1
  • 1
Clayton Stanley
  • 6,723
  • 7
  • 27
  • 43
  • Dennis, you also have not tried your answer, and you also don't understand how the script works. You did not indentify the bottleneck correctly. If you folks run the script you will see it does not work the way you think it does. I don't intend to be mean, but you folks are all making the same conceptual errors. Negative groupthink. – Mike Slinn Aug 17 '12 at 05:04
  • This works, though I'm not sure it satisfies the "quick" constraint. :P Thanks for this though! – ThorSummoner Oct 03 '14 at 17:00
  • 2
    The "lookahead" you mentioned is described by @vaab in his answer, a shorter (but perhaps weaker) version is: `find -type d -execdir test -d {}/.git \; -print -prune` . – krlmlr Feb 19 '16 at 10:20
  • 3
    Adding `-maxdepth 2` (1 or 2 depending on how deep you want to go) makes it even faster, especially in git bash under Windows: ```find -maxdepth 2 -type d -execdir test -d {}/.git \; -print -prune``` – djKianoosh May 23 '16 at 14:08
  • 2
    To build up on this answer, I found this pretty useful for my scripts. Command is bash-specific though. `find . -name .git -type d -prune -exec dirname {} \;` – ColdLearning Jul 28 '17 at 18:16
14

Here is an optimized solution:

#!/bin/bash
# Update all git directories below current directory or specified directory
# Skips directories that contain a file called .ignore

HIGHLIGHT="\e[01;34m"
NORMAL='\e[00m'

function update {
  local d="$1"
  if [ -d "$d" ]; then
    if [ -e "$d/.ignore" ]; then 
      echo -e "\n${HIGHLIGHT}Ignoring $d${NORMAL}"
    else
      cd $d > /dev/null
      if [ -d ".git" ]; then
        echo -e "\n${HIGHLIGHT}Updating `pwd`$NORMAL"
        git pull
      else
        scan *
      fi
      cd .. > /dev/null
    fi
  fi
  #echo "Exiting update: pwd=`pwd`"
}

function scan {
  #echo "`pwd`"
  for x in $*; do
    update "$x"
  done
}

if [ "$1" != "" ]; then cd $1 > /dev/null; fi
echo -e "${HIGHLIGHT}Scanning ${PWD}${NORMAL}"
scan *
Mike Slinn
  • 6,212
  • 5
  • 38
  • 69
  • 1
    Looks good; but the fact that 'find' is compiled and bash is interpreted? makes me wonder if this is actually faster than the find method. Also not sure on bash's optimization of recursive functions. If you have time, care to post some benchmarks? I'd be interested to see the results. – Clayton Stanley Aug 18 '12 at 04:13
  • 1
    The optimized version is dramatically faster. Try each version for yourself and stop wondering. – Mike Slinn Aug 18 '12 at 11:38
  • @MikeSlinn Your script doesn't work. I have a terminal dump to show you demonstrating the weird behavior but it's too big to post in a comment here... Basically it is not doing a very good job finding all the git repo's (how many it finds depends on which spot containing directory I run the script from) but still spends an **extremely long time** crawling everything. – Steven Lu Apr 08 '13 at 01:49
  • I have used my script, and improved it, since posting. Works great. – Mike Slinn Apr 08 '13 at 09:39
  • use ``"$PWD"`` instead of ``pwd`` will make you gain some substantial time. – vaab Apr 28 '14 at 12:28
  • ${PWD} is essentially the same as $PWD. Note that `pwd` is commented out. Here is a more recent version, with better interop: https://gist.github.com/mslinn/3151915 – Mike Slinn Apr 28 '14 at 16:48
  • 1
    To make the script above a slightly more robust, replace `cd $d > /dev/null` with `cd -- $d > /dev/null` since some git repositories like babel contain directories whose names start with a hyphen (e.g., `--extensions`). – mmizutani Jul 01 '16 at 09:38
  • 1
    @mmizutani the version in the gist quotes $d, so it is robust: cd "$d" > /dev/null – Mike Slinn Jul 01 '16 at 14:16
  • I updated the gist so it does not use ANSI colors if running as a cron job: http://gist.github.com/mslinn/3151915 – Mike Slinn Sep 12 '18 at 14:45
11

I've taken the time to copy-paste the script in your question, compare it to the script with your own answer. Here some interesting results:

Please note that:

  • I've disabled the git pull by prefixing them with a echo
  • I've removed also the color things
  • I've removed also the .ignore file testing in the bash solution.
  • And removed the unecessary > /dev/null here and there.
  • removed pwd calls in both.
  • added -prune which is obviously lacking in the find example
  • used "while" instead of "for" which was also counter productive in the find example
  • considerably untangled the second example to get to the point.
  • added a test on the bash solution to NOT follow sym link to avoid cycles and behave as the find solution.
  • added shopt to allow * to expand to dotted directory names also to match find solution's functionality.

Thus, we are comparing, the find based solution:

#!/bin/bash

find . -name .git -type d -prune | while read d; do
   cd $d/..
   echo "$PWD >" git pull
   cd $OLDPWD
done

With the bash shell builting solution:

#!/bin/bash

shopt -s dotglob

update() {
    for d in "$@"; do
        test -d "$d" -a \! -L "$d" || continue
        cd "$d"
        if [ -d ".git" ]; then
            echo "$PWD >" git pull
        else
            update *
        fi
        cd ..
    done
}

update *

Note: builtins (function and the for) are immune to MAX_ARGS OS limit for launching processes. So the * won't break even on very large directories.

Technical differences between solutions:

The find based solution uses C function to crawl repository, it:

  • has to load a new process for the find command.
  • will avoid ".git" content but will crawl workdir of git repositories, and loose some times in those (and eventually find more matching elements).
  • will have to chdir through several depth of sub-dir for each match and go back.
  • will have to chdir once in the find command and once in the bash part.

The bash based solution uses builtin (so near-C implementation, but interpreted) to crawl repository, note that it:

  • will use only one process.
  • will avoid git workdir subdirectory.
  • will only perform chdir one level at a time.
  • will only perform chdir once for looking and performing the command.

Actual speed results between solutions:

I have a working development collection of git repository on which I launched the scripts:

  • find solution: ~0.080s (bash chdir takes ~0.010s)
  • bash solution: ~0.017s

I have to admit that I wasn't prepared to see such a win from bash builtins. It became more apparent and normal after doing the analysis of what's going on. To add insult to injuries, if you change the shell from /bin/bash to /bin/sh (you must comment out the shopt line, and be prepared that it won't parse dotted directories), you'll fall to ~0.008s . Beat that !

Note that you can be more clever with the find solution by using:

find . -type d \( -exec /usr/bin/test -d "{}/.git" -a "{}" != "." \; -print -prune \
       -o -name .git -prune \)

which will effectively remove crawling all sub-repository in a found git repository, at the price of spawning a process for each directory crawled. The final find solution I came with was around ~0.030s, which is more than twice faster than the previous find version, but remains 2 times slower than the bash solution.

Note that /usr/bin/test is important to avoid search in $PATH which costs time, and I needed -o -name .git -prune and -a "{}" != "." because my main repository was itself a git subrepository.

As a conclusion, I won't be using the bash builtin solution because it has too much corner cases for me (and my first test hit one of the limitation). But it was important for me to explain why it could be (much) faster in some cases, but find solution seems much more robust and consistent to me.

vaab
  • 8,740
  • 4
  • 49
  • 56
  • 1
    I use the bash solution many times a day, on multiple OSes (Mac, Linux, Windows). It works very well. These are features: - won't parse sub-git repositories. - will follow symbolic links - won't look into directory with name starting with "." Here is an updated script that has better interop: https://gist.github.com/mslinn/3151915 – Mike Slinn Apr 28 '14 at 16:40
7

The answers above all rely on finding a ".git" repository. However not all git repos have these (e.g. bare repos). The following command will loop through all directories and ask git if it considers each to be a directory. If so, it prunes sub dirs off the tree and continues.

find . -type d -exec sh -c 'cd "{}"; git rev-parse --git-dir 2> /dev/null 1>&2' \; -prune -print

It's a lot slower than other solutions because it's executing a command in each directory, but it doesn't rely on a particular repository structure. Could be useful for finding bare git repositories for example.

CharlieB
  • 626
  • 6
  • 14
  • I have never needed to keep a bare git repo up to date by using "git pull", and I have never needed to locate a bare git repo via a recursive search. The performance hit for supporting these features would be significant. – Mike Slinn Sep 26 '17 at 16:13
  • 1
    Indeed they are, but I did find myself needing to locate bare repos recursively and, having looked through Stack Overflow for a solution and not found one, I decided to post mine here. – CharlieB Sep 26 '17 at 16:15
  • Cool. Why not make a gist of the entire working script and post the URL? I maintain a similar gist, shown above. – Mike Slinn Sep 26 '17 at 20:14
  • My script is bulky and does a lot of things that aren't really relevant to this question. The heart of it is the line above though: from there it's a simple `for f in "${output_of_above}; do` to loop through the results – CharlieB Sep 27 '17 at 14:05
  • This is a heck of a lot cheaper way to accomplish the same thing: `find docroot -type d -name ".git" -prune -print -exec rm -rf {} \;`. You don't want to target the based directory, but anything under that which might be problematic. In my case I am working with Drupal 8. So everything under the docroot directory is concerning.' – Patrick Dec 20 '17 at 20:30
  • @pthurmond correct me if I’m wrong, but I think your method relies on detecting `.git` directories like the other methods. That’s fine in most cases, but some git repos (e.g. bare repos) don’t have one. – CharlieB Dec 20 '17 at 20:35
  • It does, but you should never be embedding other git repos into another repo except via automated processes that download the contents of those repos (such as using PHP composer or adding node modules). In that case you would always get the git directories. – Patrick Dec 20 '17 at 20:51
  • But what if I have a normal folder which contains a load of unrelated bare repos, some of which are in subdirs for organisation? – CharlieB Dec 20 '17 at 20:57
  • You would have to target them specifically. Or wrap this in an iterator that runs this only in each subdirectory, but not the current one. – Patrick Dec 21 '17 at 15:19
  • Or I could use git’s own mechanisms to identify valid directories, as in this answer. There are three possibilities for valid repositories. a) a .git folder exists. b) the repository is bare and contains no tree or c) the git files live elsewhere and are pointed at by a `.git` text file (see https://git-scm.com/docs/gitrepository-layout). My approach catches all these, recursively, albeit comparatively slowly. – CharlieB Dec 21 '17 at 15:27
3

For windows, you can put the following into a batch file called gitlist.bat and put it on your PATH.

@echo off
if {%1}=={} goto :usage
for /r %1 /d %%I in (.) do echo %%I | find ".git\."
goto :eof
:usage
echo usage: gitlist ^<path^>
3

I list all git repositories anywhere in the current directory using:

find . -type d -execdir test -d {}/.git \\; -prune -print

This is fast since it stops recursing once it finds a git repository. (Although it does not handle bare repositories.) Of course, you can change the . to whatever directory you want. If you need, you can change the -print to -print0 for null-separated values.

To also ignore directories containing a .ignore file:

find . -type d \( -execdir test -e {}/.ignore \; -prune \) -o \( -execdir test -d {}/.git \; -prune -print \)

I've added this alias to my ~/.gitconfig file:

[alias]
  repos =  !"find -type d -execdir test -d {}/.git \\; -prune -print"

Then I just need to execute:

git repos

To get a complete listing of all the git repositories anywhere in my current directory.

Greg Barrett
  • 550
  • 3
  • 6
2

Check out the answer using the locate command: Is there any way to list up git repositories in terminal?

The advantages of using locate instead of a custom script are:

  1. The search is indexed, so it scales
  2. It does not require the use (and maintenance) of a custom bash script

The disadvantages of using locate are:

  1. The db that locate uses is updated weekly, so freshly-created git repositories won't show up

Going the locate route, here's how to list all git repositories under a directory, for OS X:

Enable locate indexing (will be different on Linux):

sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.locate.plist

Run this command after indexing completes (might need some tweaking for Linux):

repoBasePath=$HOME
locate '.git' | egrep '.git$' | egrep "^$repoBasePath" | xargs -I {} dirname "{}"
Community
  • 1
  • 1
Clayton Stanley
  • 6,723
  • 7
  • 27
  • 43
  • My solution is really fast, and flexible, and adapts to new repos, and does not require a daemon to index the drive periodically. I'm very happy with it. – Mike Slinn Jan 05 '13 at 04:44
0

This answer combines the partial answer provided @Greg Barrett with my optimized answer above.

#!/bin/bash

# Update all git directories below current directory or specified directory
# Skips directories that contain a file called .ignore

HIGHLIGHT="\e[01;34m"
NORMAL='\e[00m'

export PATH=${PATH/':./:'/:}
export PATH=${PATH/':./bin:'/:}
#echo "$PATH"

DIRS="$( find "$@" -type d \( -execdir test -e {}/.ignore \; -prune \) -o \( -execdir test -d {}/.git \; -prune -print \) )"

echo -e "${HIGHLIGHT}Scanning ${PWD}${NORMAL}"
for d in $DIRS; do
  cd "$d" > /dev/null
  echo -e "\n${HIGHLIGHT}Updating `pwd`$NORMAL"
  git pull 2> >(sed -e 's/X11 forwarding request failed on channel 0//')
  cd - > /dev/null
done
Mike Slinn
  • 6,212
  • 5
  • 38
  • 69