60

We use git to distribute an operating system and keep it upto date. We can't distribute the full repository since it's too large (>2GB), so we have been using shallow clones (~300M). However recently when fetching from a shallow clone, it's now inefficiently fetches the entire >2GB repository. This is an untenable waste of bandwidth for deployments.

The git documentation says you cannot fetch from a shallow repository, though that's strictly not true. Are there any workarounds to make a git clone --depth 1 able to fetch just what's changed from it? Or some other strategy to keep the distribution size as small as possible whilst having all the bits git needs to do an update?

I have unsuccessfully tried cloning from --depth 20 to see if it will upgrade more efficiently, that didn't work. I did also look into http://git-scm.com/docs/git-bundle, but that seems to create huge bundles.

hendry
  • 8,343
  • 14
  • 61
  • 105
  • "but that seems to create huge bundles": only for the first one. After that, you can create incremental bundles. – VonC Oct 14 '13 at 07:57
  • My initial distribution cannot be huge... – hendry Oct 14 '13 at 09:12
  • 3
    You will have to try again fetching for your shallow clone with Git 1.9/2.0 (Q1 2014): those operations are now much more efficient. See [my answer below](http://stackoverflow.com/a/21217326/6309) – VonC Jan 19 '14 at 13:25
  • Git 2.5 (Q2 2015) supports a single fetch commit! I have edited my answer below, now referencing "[Pull a specific commit from a remote git repository](http://stackoverflow.com/a/30701724/6309)". – VonC Jun 08 '15 at 05:35

5 Answers5

48

--depth is a git fetch option. I see the doc doesn't really highlight that git clone does a fetch.

When you fetch, the two repos swap info on who has what by starting from the remote's heads and searching backward for the most recent shared commit in the fetched refs' histories, then filling in all the missing objects to complete just the new commits between the most recent shared commits and the newly fetched ones.

A --depth=1 fetch just gets the branch tips and no prior history. Further fetches of those histories will fetch everything new by the above procedure, but if the previously-fetched commits aren't in the newly fetched history, fetch will retrieve all of it -- unless you limit the fetch with --depth.

Your client did a depth=1 fetch from one repo and switched urls to a different repo. At least one long ancestry path in this new repo's refs apparently shares no commits with anything currently in your repo. That might be worth investigating, but either way unless there's some particular reason, your clients can just do every fetch --depth=1.

jthill
  • 42,819
  • 4
  • 65
  • 113
  • As you can see in my [test](https://github.com/Webconverger/webc/issues/174), I reset hard to a26424 which is in the remote https://github.com/Webconverger/webc/commits/master. So I don't understand why it just doesn't fetch everything new. How can I compare remote refs? `git ls-remote` only shows tags/branches ... – hendry Oct 16 '13 at 05:23
  • 1
    You switched repos. You have ten branches and seventeen tags in this new repo, and at least one of them references a long ancestry having no commits in common with any history presently in your repo. – jthill Oct 16 '13 at 05:54
  • So.. IIUC, I should prune the branches/tags on http://github.com/webconverger/webc (the new repo), to ensure everything is in common with say "a26424"? – hendry Oct 16 '13 at 07:01
  • Or fetch only the refs you want (to set defaults see the ['remote..fetch`](https://www.kernel.org/pub/software/scm/git/docs/git-fetch.html#_named_remote_in_configuration_file) entry in `fetch`'s discussion of how to configure remotes) – jthill Oct 16 '13 at 07:25
  • IIUC --depth 1 is the way to go, though we didn't implement that way since my colleague discovered a bug, which is now fixed in https://github.com/git/git/commit/238504b014230d0bc244fb0de84990863fcddd59 So we are waiting to here back from github whether that's deployed and then we will be using it. – hendry Oct 17 '13 at 01:25
  • That might be easiest. With an unrestricted refspec you'll still be fetching 27 commits the first time. Have you checked what refs the old and new repos have in common, or rather don't? – jthill Oct 17 '13 at 02:14
  • I haven't actually figured out how to check common or excluded refs easily. Any tips? – hendry Oct 21 '13 at 02:23
  • `git ls-remote` will tell you all the remote's refs, `git branch -a` and `git tag` will tell you all the ones you have. – jthill Oct 21 '13 at 03:13
  • 2
    `git fetch -v -v -v` I've found to be very useful btw – hendry Oct 23 '13 at 07:42
35

Just did g clone github.com:torvalds/linux and it took so much time, so I just skipped it by CTRL+C.

Then did g clone github.com:torvalds/linux --depth 1 and it did cloned quite fast. And I have only one commit in git log.

So clone --depth 1 should work. If you need to update existing repository, you should use git fetch origin remoteBranch:localBranch --depth 1. It works too, it fetches only one commit.

Summing up:

Initial clone:

git clone git_url --depth 1

Code update

git fetch origin remoteBranch:localBranch --depth 1
Sachin Joseph
  • 15,841
  • 3
  • 36
  • 54
Waterlink
  • 1,979
  • 1
  • 14
  • 26
  • I'd like to add the depth thing to the config, so I can do git fetch origin without needing to remember the depth filter. Is that possible? – artfulrobot Feb 21 '15 at 10:15
  • Yes you may want to create an alias. Here is the manual on aliasing in git: http://git-scm.com/book/en/v2/Git-Basics-Git-Aliases – Waterlink Feb 21 '15 at 19:26
  • 4
    Only *this* solution worked for me (--unshallow doesn't work). Key was `branch:branch` – FractalSpace Feb 17 '17 at 19:20
12

Note that Git 1.9/2.0 (Q1 2014) could be more efficient in fetching for a shallow clone.
See commit 82fba2b, from Nguyễn Thái Ngọc Duy (pclouds):

Now that git supports data transfer from or to a shallow clone, these limitations are not true anymore.

All the details are in "shallow.c: the 8 steps to select new commits for .git/shallow".

You can see the consequence in commits like 0d7d285, f2c681c, and c29a7b8 which support clone, send-pack /receive-pack with/from shallow clones.
smart-http now supports shallow fetch/clone too.
You can even clone form a shallow repo.

Update 2015: git 2.5+ (Q2 2015) will even allow for a single commit fetch! See "Pull a specific commit from a remote git repository".

Update 2016 (Oct.): git 2.11+ (Q4 2016) allows for fetching:

VonC
  • 1,042,979
  • 435
  • 3,649
  • 4,283
9

If you can select a specific branch, it can be even faster. Here's an example using Spark master branch and latest tag:

Initial clone

git clone git@github.com:apache/spark.git --branch master --single-branch --depth 1

Update to specific tag

git fetch --depth 1 origin tags/v1.6.0

It becomes very fast to switch tags/branch this way.

Martin Tapp
  • 2,416
  • 1
  • 25
  • 34
1

I don't know if it suites your set-up but what I use is to have ha full clone of a repo in a separate directory. Then I do shallow clone from the remote repository with reference to the local one.

git clone --depth 1 --reference /path/to/local/clone git@some.com/group/repo.git 

That way only the differences with the reference repository and remote are actually fetched. To make it even quicker you can use the --shared option, but be sure to read about the restrictions in the git documentation (it can be dangerous).

Also I found out that in some circumstances when the remote has changed a lot, the clone starts fetching too much data. It is good to break it then and update the reference repo (which strangely takes much less bandwidth than it took in the first place.) And then start the clone again.

Rajish
  • 6,525
  • 3
  • 31
  • 49