21

We have a number of git repositories which have grown to an unmanageable size due to the historical inclusion of binary test files and java .jar files.

We are just about to go through the exercise of git filter-branching these repositories, re-cloning them everywhere they are used (from dozens to hundreds of deployments each, depending on the repo) and given the problems with rewriting history I was wondering if there might be any other solutions.

Ideally I would like to externalise problem files without rewriting the history of each repository. In theory this should be possible because you are checking out the same files, with the same sizes and the same hashes, just sourcing them from a different place (a remote rather than the local object store). Alas none of the potential solutions I have found so far appear to allow me to do this.

Starting with git-annex, the closest I could find to a solution to my problem was How to retroactively annex a file already in a git repo, but as with just removing the large files, this requires the history to be re-written to convert the original git add into a git annex add.

Moving on from there, I started looking at other projects listed on what git-annex is not, so I examined git-bigfiles, git-media and git-fat. Unfortunately we can't use the git-bigfiles fork of git since we are an Eclipse shop and use a mixture of git and EGit. It doesn't look like git-media or git-fat can do what I want either, since while you could replace existing large files with the external equivalents, you would still need to rewrite the history in order to remove large files which had already been committed.

So, is it possible to slim a .git repository without rewriting history, or should we go back to the plan of using git filter-branch and a whole load of redeployments?


As an aside, believe that this should be possible, but is probably tied to the same limitations as those of gits current shallow clone implementation.

Git already supports multiple possible locations for the same blob, since any given blob could be in the loose object store (.git/objects) or in a pack file (.git/objects) so theoretically you would just need something like git-annex to be hooked in at that level rather than higher up (i.e. have the concept of a download on demand remote blob if you like). Unfortunately I can't find anyone having implemented or even suggested anything like this.

Community
  • 1
  • 1
Mark Booth
  • 6,794
  • 2
  • 60
  • 88
  • As far as I can tell you are asking how to rewrite history without rewriting history. – alternative Jul 11 '13 at 13:52
  • @alternative not quite, I'm asking if there is a way to slim the repository *without* rewriting the history. At the moment it looks like using *shallow clones* might be the only way, but the limitations probably wouldn't work well with our workflow and even if it did then they would only slim the local (clone) repos, not the remote bare repos. – Mark Booth Jul 11 '13 at 16:32
  • The only way to "slim" the repository would be to delete the content you are slimming - hence, rewriting (which is why every answer says that this is not possible). There are not truly any problems with rewriting history as long as you do it correctly. And yes, shallow clones would only affect the local repositories. – alternative Jul 11 '13 at 19:27
  • @alternative - If you are working in a small team and have few external collaborators (forks on github) then rewriting history isn't a big deal. If you have dozens of developers, collaborators and even more clones, then the cost of forcing all of those ref updates can quickly spiral out of control. – Mark Booth Jul 18 '13 at 12:39

4 Answers4

11

Sort of. You can use Git's replace feature to set aside the big bloated history so that it is only downloaded if needed. It's like a shallow clone, but without a shallow clone's limitations.

The idea is you reboot a branch by creating a new root commit, then cherry-pick the old branch's tip commit. Normally you would lose all of the history this way (which also means you don't have to clone those big .jar files), but if the history is needed you can fetch the historical commits and use git replace to seamlessly stitch them back in.

See Scott Chacon's excellent blog post for a detailed explanation and walk-through.

Advantages of this approach:

  • History is not modified. If you need to go back to an older commit complete with it's big .jars and everything, you still can.
  • If you don't need to look at the old history, the size of your local clone is nice and small, and any fresh clones you make won't require downloading tons of mostly-useless data.

Disadvantages of this approach:

  • The complete history is not available by default—users need to jump through some hoops to get at the history.
  • If you do need frequent access to the history, you'll end up downloading the bloated commits anyway.
  • This approach still has some of the same problems as rewriting history. For example, if your new repository looks like this:

    * modify bar (master)
    |
    * modify foo  <--replace-->  * modify foo (historical/master)
    |                            |
    * instructions               * remove all of the big .jar files
                                 |
                                 * add another jar
                                 |
                                 * modify a jar
                                 |
    

    and someone has an old branch off of the historical branch that they merge in:

    * merge feature xyz into master (master)
    |\__________________________
    |                           \
    * modify bar                 * add feature xyz
    |                            |
    * modify foo  <--replace-->  * modify foo (historical/master)
    |                            |
    * instructions               * remove all of the big .jar files
                                 |
                                 * add another jar
                                 |
                                 * modify a jar
                                 |
    

    then the big historical commits will reappear in your main repository and you're back to where you started. Note that this is no worse than rewriting history—someone might accidentally merge in the pre-rewrite commits.

    This can be mitigated by adding an update hook in your shared repository to reject any pushes that would reintroduce the historical root commit(s).

thakis
  • 4,306
  • 27
  • 31
Richard Hansen
  • 44,218
  • 20
  • 84
  • 95
  • Wow, thanks Richard, this looks like it might be just what I've been looking for. I'll see if I can get it to work next week and if so, there will be a tick coming your way too... – Mark Booth Jul 13 '13 at 12:16
  • Ah, I see, so the example rewrites the history of *recent commits* to remove the large historical commits without needing to rewrite the history of those *historical commits*, but uses `git replace` to allow you to bring back the *historical commits* later if you need to. So, that's not quite what I'm after, but I'll think some more about how I can make use of it to solve my problem. – Mark Booth Jul 13 '13 at 12:39
  • I do wish I'd known about this when we created our `git` repos from our old `svn` repo. Instead of having to chose between starting a new epoch with no history from `svn` or starting our `git` repo with years of accumulated `svn` cruft, we could have just kept our entire `svn` repo in a set of historical `git` repos and then used `git replace` to bring them back when they were needed. In fact, I wonder whether we might still be able to go back and add retrospective `git replace` targets. Interesting, very interesting... – Mark Booth Jul 13 '13 at 12:44
  • @MarkBooth: Yes, you can append old history with `git replace`. It’s not too late ;). – Chronial Jul 18 '13 at 12:13
  • Thanks Richard, I presented this solution to my team today and we have decided to trial this approach with one particularly messy repository. All we need now is for jgit/egit to support `git replace`, which at the moment it doesn't. – Mark Booth Jul 18 '13 at 12:25
  • 1
    @MarkBooth you could have a look at grafts – they a very similar and might be supported since they are a lot older. But note that this approach inherits all of the problems of the history-rewriting approach, so as long as you know there are big files that should not be in the repo, you are probably better of removing them from history. – Chronial Jul 18 '13 at 15:33
8

No, that is not possible – You will have to rewrite history. But here are some pointers for that:

  • As VonC mentioned: If it fits your scenario, use BFG- repo cleaner – it’s a lot easier to use than git filter-branch.
  • You do not need to clone again! Just run these commands instead of git pull and you will be fine (replace origin and master with your remote and branch):

    git fetch origin
    git reset --hard origin/master
    

    But note that unlike git pull, you will loose all the local changes that are not pushed to the server yet.

  • It helps a lot if you (or somebody else in you team) fully understand how git sees history, and what git pull, git merge and git rebase (also as git rebase --onto) do. Then give everybody involved a quick training on how to handle this rewrite situation (5-10 mins should be enough, the basic dos and don’ts).
  • Be aware that git filter-branch does not cause any harm in itself, but causes a lot of standard workflows to cause harm. If people don’t act accordingly and merge old history, you might just have to rewrite history again if you don’t notice soon enough.
  • You can prevent people from merging (more precisely pushing) the old history by writing (5 lines) an appropriate update hook on the server. Just check whether the history of the pushed head contains a specific old commit.
Community
  • 1
  • 1
Chronial
  • 55,303
  • 13
  • 76
  • 85
  • Thanks Chronial. The only real problem with *not* re-cloning is having to `reset` each and every branch used locally (to get rid of all local refs to the obsolete branch) and running `git gc --prune=now --aggressive` to actually shrink the repo. If you do this and the repo *doesn't* shrink, then you know that you missed a ref somewhere. Re-cloning removes the need for all of these steps (we deploy our 20 or so `git` repos using `buckminster` so re-cloning *everything* is easy for us). Sadly we also use gitolite for hosting our `git` repos, which reserves the `update` hook for it's own use. – Mark Booth Jul 11 '13 at 12:48
  • I don't know *gitolite*, but [hooks and gitolite](http://gitolite.com/gitolite/cust.html#hooks) says that *You can install any hooks except these: (all repos) gitolite reserves the `update` hook* so I will have to wait until our gitolite expert gets back to tell me if there is a way around this. – Mark Booth Jul 11 '13 at 16:26
  • 2
    @MarkBooth a custom update hook in gitolite V3 is called a VREF (like in this answer: http://stackoverflow.com/a/11517112/6309), and you can define as many "gitolite-update hooks" (or VRefs) as you need: http://stackoverflow.com/a/10888358/6309. Gitolite V2 would use hook chaining (http://stackoverflow.com/a/15941289/6309). – VonC Jul 11 '13 at 19:32
4

I don't know of a solution which would avoid rewriting the history.

In that case, cleaning the rpeo with a tool like BFG- repo cleaner is the easiest solution (easier that git filter-branch).

VonC
  • 1,042,979
  • 435
  • 3,649
  • 4,283
2

I honestly can't think of a way to do that. If you think about what Git "promises" you as a user, with regards to data integrity, I can't think of a way you could remove a file from the repository and keep the same hash. In other words, if what you're asking were possible, then Git would be a lot less reliable...

Assaf Lavie
  • 63,560
  • 33
  • 139
  • 197