152

145M = .git/objects/pack/

I wrote a script to add up the sizes of differences of each commit and the commit before it going backwards from the tip of each branch. I get 129MB, which is without compression and without accounting for same files across branches and common history among branches.

Git takes all those things into account so I would expect much much smaller repository. So why is .git so big?

I've done:

git fsck --full
git gc --prune=today --aggressive
git repack

To answer about how many files/commits, I have 19 branches about 40 files in each. 287 commits, found using:

git log --oneline --all|wc -l

It should not be taking 10's of megabytes to store information about this.

the Tin Man
  • 150,910
  • 39
  • 198
  • 279
Ian Kelling
  • 8,644
  • 8
  • 31
  • 37
  • 6
    Linus recommends the following over aggressive gc. Does it make a significant difference? git repack -a -d --depth=250 --window=250 – Greg Bacon Jun 23 '09 at 01:18
  • thanks gbacon, but no difference. – Ian Kelling Jun 23 '09 at 01:21
  • 1
    That's because you are missing the -f. http://metalinguist.wordpress.com/2007/12/06/the-woes-of-git-gc-aggressive-and-how-git-deltas-work/ – spuder Jan 09 '14 at 06:47
  • `git repack -a -d` shrunk my **956MB** repo to **250MB**. Great success! Thanks! – AlexGrafe May 24 '15 at 12:10
  • One caveat I found was that if you have git submodules, then the .git repo of the submodules show up in the super module's .git directory, so `du` may be misleading about the super module being large, when it is in fact a submodule and the answers below need to be run in the submodule directory. – esmit Jan 12 '21 at 22:21

13 Answers13

164

Some scripts I use:

git-fatfiles

git rev-list --all --objects | \
    sed -n $(git rev-list --objects --all | \
    cut -f1 -d' ' | \
    git cat-file --batch-check | \
    grep blob | \
    sort -n -k 3 | \
    tail -n40 | \
    while read hash type size; do 
         echo -n "-e s/$hash/$size/p ";
    done) | \
    sort -n -k1
...
89076 images/screenshots/properties.png
103472 images/screenshots/signals.png
9434202 video/parasite-intro.avi

If you want more lines, see also Perl version in a neighbouring answer: https://stackoverflow.com/a/45366030/266720

git-eradicate (for video/parasite.avi):

git filter-branch -f  --index-filter \
    'git rm --force --cached --ignore-unmatch video/parasite-intro.avi' \
     -- --all
rm -Rf .git/refs/original && \
    git reflog expire --expire=now --all && \
    git gc --aggressive && \
    git prune

Note: the second script is designed to remove info from Git completely (including all info from reflogs). Use with caution.

Vi.
  • 29,583
  • 14
  • 78
  • 138
  • 2
    Finally... Ironically I saw this answer earlier in my search but it looked too complicated...after trying other things, this one started to make sense and voila! – msanteler Sep 29 '14 at 04:27
  • @msanteler, The former (`git-fatfiles`) script has emerged when I asked the question on IRC (Freenode/#git). I saved the best version to a file, then posted it as an answer here. (I can't the original author in IRC logs although). – Vi. Sep 30 '14 at 22:04
  • This works very well initially. But when I fetch or pull from the remote again, it just copies all the big files back into the archive. How do I prevent that? – pir Oct 19 '15 at 13:07
  • 1
    @felbo, Then the problem is probably not just in your local repository, but in other repositories as well. Maybe you need to do the procedure everywhere, or force everybody abandon original branches and switch to rewritten branches. It is not easy in a big team and needs cooperation between developers and/or manager intervention. Sometimes just leaving the loadstone inside can be better option. – Vi. Oct 19 '15 at 15:12
  • 1
    This function is great, but it's unimaginably slow. It can't even finish on my computer if I remove the 40 line limit. FYI, I just added an answer with a more efficient version of this function. Check it out if you want to use this logic on a big repository, or if you want to see the sizes summed per file or per folder. – piojo Jul 28 '17 at 07:59
  • A true gem! Thanks. – Marcel Stör Aug 16 '18 at 08:48
  • I've committed a 10Mb image, noticed the mess, resized to 100Kb and committed again with same name. Your script for listing fat-files now lists two files with same name. When using filter-branch, how does it know which one to delete? – fabda01 Dec 25 '18 at 13:46
  • @yellow01, You'll need more advanced solution. Or filter branch starting from the commit where you had the image removed (then rebase the rest on top of it). – Vi. Dec 26 '18 at 09:18
  • How could I use that script? command? - if thats a terminal command, then it did nothing in my case. – Ahmadreza May 08 '19 at 06:30
  • The fastest (and easiest) way to clean up a bloated GIT history is to use the BFG (https://rtyley.github.io/bfg-repo-cleaner/) – Ru887321 Apr 05 '20 at 04:59
69

I recently pulled the wrong remote repository into the local one (git remote add ... and git remote update). After deleting the unwanted remote ref, branches and tags I still had 1.4GB (!) of wasted space in my repository. I was only able to get rid of this by cloning it with git clone file:///path/to/repository. Note that the file:// makes a world of difference when cloning a local repository - only the referenced objects are copied across, not the whole directory structure.

Edit: Here's Ian's one liner for recreating all branches in the new repo:

d1=#original repo
d2=#new repo (must already exist)
cd $d1
for b in $(git branch | cut -c 3-)
do
    git checkout $b
    x=$(git rev-parse HEAD)
    cd $d2
    git checkout -b $b $x
    cd $d1
done
Ian Kelling
  • 8,644
  • 8
  • 31
  • 37
pgs
  • 11,647
  • 2
  • 23
  • 17
  • 1
    wow. THANK YOU. .git = 15M now!! after cloning, here is a little 1 liner for preserving your previous branches. d1=#original repo; d2=#new repo; cd $d1; for b in $(git branch | cut -c 3-); do git checkout $b; x=$(git rev-parse HEAD); cd $d2; git checkout -b $b $x; cd $d1; done – Ian Kelling Jun 25 '09 at 00:31
  • if you check this, you could add the 1 liner to your answer so its formatted as code. – Ian Kelling Jun 25 '09 at 00:36
  • 1
    I foolishly added a bunch of video files to my repo, and had to reset --soft HEAD^ and recommit. The .git/objects dir was huge after that, and this was the only way that got it back down. However I didn't like the way the one liner changed my branch names around (it showed origin/branchname instead of just branchname). So I went a step further and executed some sketchy surgery--I deleted the .git/objects directory from the original, and put in the one from the clone. That did the trick, leaving all of the original branches, refs, etc intact, and everything seems to work (crossing fingers). – Jack Senechal Jan 04 '11 at 12:01
  • 1
    thanks for the tip about the file:// clone, that did the trick for me – adam.wulf Apr 02 '12 at 04:32
  • Be careful, `git` just links to the original when cloning locally (to save space, why have the same stuff twice?). Yes, you get a small clone; no, you _can not_ delete the original, that would break the clone. – vonbrand Mar 22 '13 at 19:56
  • 3
    @vonbrand if you hard link to a file and delete the original file, nothing happens except that a reference counter gets decremented from 2 to 1. Only if that counter gets decremented to 0 the space is freed for other files on the fs. So no, even if the files were hard linked nothing would happen if the original gets deleted. – stefreak Mar 29 '13 at 11:35
  • @IanKelling please add that the new repo dir should already exist. I just messed up my repo because directory #2 didn't exist... – Maarten Wolzak Dec 10 '14 at 14:31
  • OMGolly! Not sure why this worked but this is fantastic. – Dennis Oct 15 '19 at 16:36
68

git gc already does a git repack so there is no sense in manually repacking unless you are going to be passing some special options to it.

The first step is to see whether the majority of space is (as would normally be the case) your object database.

git count-objects -v

This should give a report of how many unpacked objects there are in your repository, how much space they take up, how many pack files you have and how much space they take up.

Ideally, after a repack, you would have no unpacked objects and one pack file but it's perfectly normal to have some objects which aren't directly reference by current branches still present and unpacked.

If you have a single large pack and you want to know what is taking up the space then you can list the objects which make up the pack along with how they are stored.

git verify-pack -v .git/objects/pack/pack-*.idx

Note that verify-pack takes an index file and not the pack file itself. This give a report of every object in the pack, its true size and its packed size as well as information about whether it's been 'deltified' and if so the origin of delta chain.

To see if there are any unusally large objects in your repository you can sort the output numerically on the third of fourth columns (e.g. | sort -k3n).

From this output you will be able to see the contents of any object using the git show command, although it is not possible to see exactly where in the commit history of the repository the object is referenced. If you need to do this, try something from this question.

Community
  • 1
  • 1
CB Bailey
  • 648,528
  • 94
  • 608
  • 638
  • 1
    This found the big objects great. The accepted answer got rid of them. – Ian Kelling Jun 25 '09 at 00:59
  • 2
    The difference between git gc and git repack according to linus torvalds. http://metalinguist.wordpress.com/2007/12/06/the-woes-of-git-gc-aggressive-and-how-git-deltas-work/ – spuder Jan 09 '14 at 00:17
37

Just FYI, the biggest reason why you may end up with unwanted objects being kept around is that git maintains a reflog.

The reflog is there to save your butt when you accidentally delete your master branch or somehow otherwise catastrophically damage your repository.

The easiest way to fix this is to truncate your reflogs before compressing (just make sure that you never want to go back to any of the commits in the reflog).

git gc --prune=now --aggressive
git repack

This is different from git gc --prune=today in that it expires the entire reflog immediately.

John Gietzen
  • 45,925
  • 29
  • 140
  • 183
  • 2
    This one did it for me! I went from about 5gb to 32mb. – Hawkee Sep 28 '16 at 21:03
  • This answer seemed easier to do but unfortunately did not work for me. In my case I was working on a just cloned repository. Is that the reason? – Mert Apr 10 '17 at 08:48
14

If you want to find what files are taking up space in your git repository, run

git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -5

Then, extract the blob reference that takes up the most space (the last line), and check the filename that is taking so much space

git rev-list --objects --all | grep <reference>

This might even be a file that you removed with git rm, but git remembers it because there are still references to it, such as tags, remotes and reflog.

Once you know what file you want to get rid of, I recommend using git forget-blob

https://ownyourbits.com/2017/01/18/completely-remove-a-file-from-a-git-repository-with-git-forget-blob/

It is easy to use, just do

git forget-blob file-to-forget

This will remove every reference from git, remove the blob from every commit in history, and run garbage collection to free up the space.

nachoparker
  • 1,426
  • 15
  • 13
8

The git-fatfiles script from Vi's answer is lovely if you want to see the size of all your blobs, but it's so slow as to be unusable. I removed the 40-line output limit, and it tried to use all my computer's RAM instead of finishing. So I rewrote it: this is thousands of times faster, has added features (optional), and some strange bug was removed--the old version would give inaccurate counts if you sum the output to see the total space used by a file.

#!/usr/bin/perl
use warnings;
use strict;
use IPC::Open2;
use v5.14;

# Try to get the "format_bytes" function:
my $canFormat = eval {
    require Number::Bytes::Human;
    Number::Bytes::Human->import('format_bytes');
    1;
};
my $format_bytes;
if ($canFormat) {
    $format_bytes = \&format_bytes;
}
else {
    $format_bytes = sub { return shift; };
}

# parse arguments:
my ($directories, $sum);
{
    my $arg = $ARGV[0] // "";
    if ($arg eq "--sum" || $arg eq "-s") {
        $sum = 1;
    }
    elsif ($arg eq "--directories" || $arg eq "-d") {
        $directories = 1;
        $sum = 1;
    }
    elsif ($arg) {
        print "Usage: $0 [ --sum, -s | --directories, -d ]\n";
        exit 1;
    } 
}

# the format is [hash, file]
my %revList = map { (split(' ', $_))[0 => 1]; } qx(git rev-list --all --objects);
my $pid = open2(my $childOut, my $childIn, "git cat-file --batch-check");

# The format is (hash => size)
my %hashSizes = map {
    print $childIn $_ . "\n";
    my @blobData = split(' ', <$childOut>);
    if ($blobData[1] eq 'blob') {
        # [hash, size]
        $blobData[0] => $blobData[2];
    }
    else {
        ();
    }
} keys %revList;
close($childIn);
waitpid($pid, 0);

# Need to filter because some aren't files--there are useless directories in this list.
# Format is name => size.
my %fileSizes =
    map { exists($hashSizes{$_}) ? ($revList{$_} => $hashSizes{$_}) : () } keys %revList;


my @sortedSizes;
if ($sum) {
    my %fileSizeSums;
    if ($directories) {
        while (my ($name, $size) = each %fileSizes) {
            # strip off the trailing part of the filename:
            $fileSizeSums{$name =~ s|/[^/]*$||r} += $size;
        }
    }
    else {
        while (my ($name, $size) = each %fileSizes) {
            $fileSizeSums{$name} += $size;
        }
    }

    @sortedSizes = map { [$_, $fileSizeSums{$_}] }
        sort { $fileSizeSums{$a} <=> $fileSizeSums{$b} } keys %fileSizeSums;
}
else {
    # Print the space taken by each file/blob, sorted by size
    @sortedSizes = map { [$_, $fileSizes{$_}] }
        sort { $fileSizes{$a} <=> $fileSizes{$b} } keys %fileSizes;

}

for my $fileSize (@sortedSizes) {
    printf "%s\t%s\n", $format_bytes->($fileSize->[1]), $fileSize->[0];
}

Name this git-fatfiles.pl and run it. To see the disk space used by all revisions of a file, use the --sum option. To see the same thing, but for files within each directory, use the --directories option. If you install the Number::Bytes::Human cpan module (run "cpan Number::Bytes::Human"), the sizes will be formatted: "21M /path/to/file.mp4".

piojo
  • 5,023
  • 1
  • 19
  • 30
4

Are you sure you are counting just the .pack files and not the .idx files? They are in the same directory as the .pack files, but do not have any of the repository data (as the extension indicates, they are nothing more than indexes for the corresponding pack — in fact, if you know the correct command, you can easily recreate them from the pack file, and git itself does it when cloning, as only a pack file is transferred using the native git protocol).

As a representative sample, I took a look at my local clone of the linux-2.6 repository:

$ du -c *.pack
505888  total

$ du -c *.idx
34300   total

Which indicates an expansion of around 7% should be common.

There are also the files outside objects/; in my personal experience, of them index and gitk.cache tend to be the biggest ones (totaling 11M in my clone of the linux-2.6 repository).

CesarB
  • 39,945
  • 6
  • 58
  • 84
3

Other git objects stored in .git include trees, commits, and tags. Commits and tags are small, but trees can get big particularly if you have a very large number of small files in your repository. How many files and how many commits do you have?

Greg Hewgill
  • 828,234
  • 170
  • 1,097
  • 1,237
  • Good question. 19 branches with about 40 files in each. git count-objects -v says "in-pack: 1570". Not sure exactly what that means or how to count how many commits I have. A few hundred I'd guess. – Ian Kelling Jun 23 '09 at 01:00
  • Ok, it doesn't sound like that is the answer then. A few hundred will be insignificant compared to 145 MB. – Greg Hewgill Jun 23 '09 at 01:13
2

Did you try using git repack?

baudtack
  • 24,658
  • 8
  • 51
  • 61
2

before doing git filter-branch & git gc you should review tags that are present in your repo. Any real system which has automatic tagging for things like continuous integration and deployments will make unwated objects still refrenced by these tags , hence gc cant remove them and you will still keep wondering why the size of repo is still so big.

The best way to get rid of all un-wanted stuff is to run git-filter & git gc and then push master to a new bare repo. The new bare repo will have the cleaned up tree.

v_abhi_v
  • 95
  • 4
1

This can happen if you added a big chunk of files accidentally and staged them, not necessarily commit them. This can happen in a rails app when you run bundle install --deployment and then accidentally git add . then you see all the files added under vendor/bundle you unstage them but they already got into git history, so you have to apply Vi's answer and change video/parasite-intro.avi by vendor/bundle then run the second command he provides.

You can see the difference with git count-objects -v which in my case before applying the script had a size-pack: of 52K and after applying it was 3.8K.

Community
  • 1
  • 1
juliangonzalez
  • 3,616
  • 28
  • 43
1

It is worth checking the stacktrace.log. It is basically an error log for tracing commits that failed. I've recently found out that my stacktrace.log is 65.5GB and my app is 66.7GB.

Nes
  • 581
  • 4
  • 11
-1

Create new branch where current commit is the initial commit with all history gone to reduce git objects and history size.

Note: Please read the comment before running the code.

  1. git checkout --orphan latest_branch
  2. git add -A
  3. git commit -a -m “Initial commit message” #Committing the changes
  4. git branch -D master #Deleting master branch
  5. git branch -m master #renaming branch as master
  6. git push -f origin master #pushes to master branch
  7. git gc --aggressive --prune=all # remove the old files
cyperpunk
  • 505
  • 5
  • 11