Pull only latest commits to shallow clone

Question

I've seen many questions about merging into a shallow pull but most seem to be way out of date and others are not clear.

Our git repository history is enormous. This was caused by some erroneous committing of massive files in the past (these have since been removed). Performing a full clone takes forever since it pulls those files and only then deletes them. It also results in a massive .git directory on the computer.

To solve this we performed a shallow clone with depth of 1. This all works great and we are able to work and commit and merge back to the master. However, if there are changes to the master then I need to pull them into my branch. This is where the problem starts. The pull now goes back and fetches all the history of master just like the full clone. What I need is only the changes since my last pull.

So, is there any way to tell it to do that? Will another pull with depth of 1 solve my problem?

The branch I am working on is a special feature which will take some time to complete, so it will be running "parallel" to the master for some time before it is merged in. Will pulling depth 1 disconnect my branch from master and thus prevent me from merging back?

score 2 · Accepted Answer · answered Sep 05 '17 at 14:35

Don't think about pull; that's like trying to rub your belly and pat your head at the same time. It can be done, but it's best done after mastering each half. :-) So, break it up into its constituent parts:

fetch
merge

Now, remembering that clone is essentially init + remote add + fetch + checkout, we can see that a shallow clone is really a shallow fetch.

This means you want to modify step 1, the fetch, to be shallow.

So far, this is no big deal, but now we get to step 2, merge. Doing a merge requires some amount of depth. But—how much depth? Well, this is why we split the problem. The amount of depth required depends on the nodes in the graph, and to get the nodes, we have to fetch them. Assume, for the moment, that we've fetched "enough" of them, however many that is: we'll now have a graph that looks like this:

...--o--*--o--o   <-- yourbranch (HEAD)
         \
          o--o--o   <-- other

The other argument to git merge other, typically a remote tracking branch name like origin/master, specifies some other commit that you intend to merge. Git needs your branch tip commit (the o node to which yourbranch, e.g., master, points), their branch tip commit, and the merge base commit which I marked with *.

For Git to find that merge base, and prove to Git's satisfaction that this is the merge base, Git needs all the commits between * and each branch tip (including * itself).

How many is that? Well, that depends on the actual graph. We drew a graph where the top line—your branch—requires three: the tip, the first step back from the tip, and the second step back from the tip. Our bottom line requires four: the tip, two steps back from the tip, and one more step backwards to reach *.

Hence, for this graph, the --depth required would be 4, because the larger of 3 and 4 is 4.

How many does it take for your graph? Well, that depends on your graph! There's no telling in advance: until you have enough of the graph to find out if you have enough of the graph, you don't have enough of the graph. Once you do have enough of the graph, you can find the merge base, then count the "deepest line".

Note that we drew a very simple graph. It could be more complex, e.g.:

...--o--*--o--...--o   <-- yours (HEAD)
         \
          \      o--o--o
           \    /       \
            o--o         o--o   <-- other
                \       /
                 o-----o

To find this graph's minimum depth, trace the straight line across the top from yours back to *, and trace both lines across the bottom from other back to *. (Obviously the top half of the bottom split will be the longer line, so we can be a little lazy and just count those nodes.)

Where to do this counting

The problem is now obvious: in order to count back to (and including) the merge base, we must find the merge base, which means we need enough depth to be able to find the merge base.

Unfortunately, we have to do this in a repository that has all the commits, and you've made new commits in your repository that aren't in the other repository. It would be easy if we could just push your commits to the main repository and do the work there.

(We don't actually have to get the counts all at once. It suffices to get "your count" and "their count" and take the larger of those two. The details get kind of sticky; I will let you work them out, if you want to go this way. Think about whether the merge base is contained in the sub-graph you've obtained thus far, or not; these are the two cases that you would have to implement.)

One solution, then, is literally to do just that: push your commits to a repository that already has everything else. (See VonC's answer to Why can't I push from a shallow clone? for constraints on this operation.) To do this, you will need to write to a name on that more-complete clone that is reserved for you to do this kind of operation. For instance, you could have a reserved branch or tag name, for-blitz-count-trick or some such:

git push $remote HEAD:for-blitz-count-trick

Then have the Git at $remote do the computation of the merge base and count the commits. Then simply delete the name for-blitz-count-trick entirely so that you are ready for the next time you need to do this again.

Let's assume for the moment that you plan to run git merge $remote/other, so that the name on $remote is other, and you've done this special push. You now log in to $remote, where you can compute the correct --depth.

If you're willing to overshoot, possibly by a fairly large margin, the command sequence:

base=$(git merge-base for-blitz-count-trick other)
git rev-list --count --ancestry-path $base^@..for-blitz-count-trick
git rev-list --count --ancestry-path $base^@..other

should do the job.

I have not actually tested this, but it's all based on the obvious graph operations. For cases like the more complex graph I showed, it counts all the nodes on all fork-and-join sequences, which is why it risks overcounting. I use $base^@ to include commit $base while excluding its parents. It's also worth noting that if there is no common merge base, or if there are multiple merge bases, this method will fail, so it might be a good idea to check that there is exactly one merge base.

I don't think --ancestry-path can be combined with --left-right, but the similar command:

git rev-list --count --left-right --boundary for-blitz-count-trick...other

should work too, at the risk of overcounting boundary commits in some cases, since --boundary is implemented kind of sloppily. This one won't fail the presence of multiple merge bases, and gets both counts in one command, so it might be the way to go in practice.

If that's impossible (or even just too annoying)

It may be the case that you cannot log in to $remote to do this work there, or that some policy prevents creating a temporary name there, or both. In that case, you can simply repeatedly increase your clone depth until you can find the merge base, or have entirely un-shallowed your clone (the latter occurs if and only if there is no merge base).

Fundamentally, the problem is that you need enough depth to compute how much depth is enough. Once you have that depth, you can "re-shallow" to the exact number, whatever it may be, but there's no real point to this "re-shallowing". As the repository itself grows by adding new commits, the --depth required will also tend to grow, though if your work is repeatedly merged back (and pushed) that will tend to shrink the required --depth.

Practically speaking, it's probably sufficient to add, say, 50 at a time until you have enough depth, then stay at that depth, whatever it is, until it proves too shallow; then increase it. Note that you will need to store this number yourself somewhere: see how to know the depth of a git's shallow clone?

Hence the ugly but practical method: Just pick some depth that works, and use it until it doesn't work, then increase it. And never run git pull, just break it into its constituent git fetch and other commands (usually git merge but git rebase is fine too).

"Don't think about pull; that's like trying to rub your belly and pat your head at the same time."... darn, here I am thinking about "pull --rebase", which does a lot of merges! — VonC, Sep 05 '17 at 15:12
Thanks. In the end we decided to skirt the problem by copying the whole project from someone else. A really dirty trick but one that will make life easier in the future. Kinda avoiding the problem I guess. — theblitz, Sep 05 '17 at 15:32
@VonC: yes, but we're highly trained acrobats. Or is it [monkeys](https://en.wikipedia.org/wiki/Code_Monkey_(song))? — torek, Sep 05 '17 at 15:50
@torek we were, but now that role is taken by https://stackoverflow.com/users/3440248/drunken-code-monkey ;) (kidding, I just like the user name) — VonC, Sep 05 '17 at 15:58

Pull only latest commits to shallow clone

1 Answers1

Where to do this counting

If that's impossible (or even just too annoying)