78

The scenario

Imagine I am forced to work with some of my files always stored inside .zip files. Some of the files inside the zip are small text files and change often, while others are larger but luckily rather static (e.g. images).

If I want to place these zip files inside a git repository, each zip is treated as a blob, so whenever I commit the repository grows by the size of the zip file... even if only one small text file inside changed!

Why this is realistic

MS Word 2007/2010 .docx and Excel .xlsx files are ZIP files...

What I want

Is there, by any chance, a way to tell git to not treat zips as files, but rather as directories and treat their contents as files?

The advantages

But it couldn't work, you say?

I realize that without extra metadata this would lead to some amount of ambiguity: on a git checkout git would have to decide whether to create foo.zip/bar.txt as a file in a regular directory or a zip file. However this could be solved through config options, I would think.

Two ideas how it could be done (if it doesn't exist yet)

  • using a library such as minizip or IO::Compress::Zip inside git
  • somehow adding a filesystem layer such that git actually sees zip files as directories to start with
Community
  • 1
  • 1
Jonas Heidelberg
  • 4,784
  • 1
  • 25
  • 38
  • 2
    The scenario with `.docx` files makes sense, but in many other cases you might want to consider tracking the individual files normally with git and only *building* the resulting `.zip` using an appropriate build tool like `make`. – pixelistik Nov 28 '13 at 10:11
  • Considering that two zip files that look different to each other can hold the exact same data (for example a text file zipped two times with two different compression levels), this becomes much trickier. While it is easy to represent the diff between the two versions of the unzipped files with little information, I guess representing the diff between the two versions of the archive (which is essentially what git has to do) with about as little information would be non-trivial. – HelloGoodbye Dec 23 '13 at 14:53
  • Did you ever end up with an implemented solution of [Jeff's answer](https://stackoverflow.com/a/8001900/321973) or any thing else? I'm wondering about basically the same except [for tar archives](https://stackoverflow.com/q/37000849/321973), which should yield a compatible answer... – Tobias Kienzler May 03 '16 at 09:56
  • SAP's Information Design Tool (IDT) creates a similar file structure for its `UNX` format. It's also recursive: it contains a `BLX` file and a `DFX` file, which are both archives, which correspond to is 'business layer' and 'data foundation', respectively. I'd like to have a solution as well. – craig Mar 02 '17 at 22:22
  • Jetbrains build-in VCS does allow you to look inside zip type files. Very useful, but requires you to review e.g. PRs inside the IDE. Now that Microsoft has taken over, we might see this in the github pr diff as well. – vincent Sep 21 '18 at 12:57

8 Answers8

26

This doesn't exist, but it could easily exist in the current framework. Just as git acts differently with displaying binary or ascii files when performing a diff, it could be told to offer special treatment to certain file types through the configuration interface.

If you don't want to change the code base (although this is kind of a cool idea you've got), you could also script it for yourself by using pre-commit and post-checkout hooks to unzip and store the files, then return them to their .zip state on checkout. You would have to restrict actions to only those files blobs / indexes that are specified by git add.

Either way is a bit of work -- it's just a question of whether the other git commands are aware of what's going on and play nicely.

hoijui
  • 3,076
  • 1
  • 28
  • 34
Jeff Ferland
  • 16,762
  • 5
  • 42
  • 72
  • Hooks do seem like a good direction to look in; I thought briefly of that but wasn't sure whether it could work. The pre-commit hook can modify both the file system and the staging area? – Jonas Heidelberg Nov 03 '11 at 22:50
  • 1
    @Jonas Did you ever end up doing this and is there a chance of you posting a worked solution? I would love to usefully track changes to spreadsheets in git and CSV is just not fit for our purposes. – Ruben Nov 13 '13 at 10:46
  • Note that using scripts that would unzip archived files before committing them to the repository and compress the files again upon checkout, a commit immediately followed by a checkout would be likely to modify the archives, even though the files stored inside of the archive would be unchanged. – HelloGoodbye Dec 23 '13 at 15:06
  • 3
    I just wrote some hooks to do this. Still working on the rough edges, but could be helpful: https://github.com/ckrf/xlsx-git – katriel Apr 24 '15 at 05:39
15

Zippey - A solution using git file filter

My solution is to use a filter to "flatten" the zip file into an monolithic, expanded (may be huge) text file. During git add/commit the zip file will be automatically expanded to this text format for normal text diffing, and during checkout, it is automatically zipped up again.

The text file is composed of records, each represents a file in the zip. So you can think this text file is a text-based image for the original zip. If the file in the zip is text indeed, it is copied into the text file; otherwise, it is base64 encoded before copied into the text format file. This keeps the text file always a text file.

Although this filter does not make each file in the zip a blob, text files are mapped line to line - which is the unit of the diff - while binary files changes can be represented by updates of their corresponding base64. I think this is equivalent to what the OP imagines.

For details and a prototyping code, you can read the following link:

Zippey Git file filter

Also, credit to the place that inspired me about this solution: Description of how file filter works

hoijui
  • 3,076
  • 1
  • 28
  • 34
Sippey
  • 151
  • 1
  • 4
  • This filter is still under development, if you have questions or any suggestions let me know. – Sippey Apr 18 '14 at 09:24
  • 1
    I tried this out and I think it should work well for me. I would just add something to the documentation that the text file list zippey.py has to be modified to include whatever file types you want zippey.py to recognize as text files. – mteng Nov 22 '14 at 01:33
  • Huge files like that are not friendly with many tools. I'm especially thinking of github 50 MB limit – PPC Apr 19 '18 at 10:10
  • I'm no fan of a monolithic file, because it would build a file too large to push to github (100MB), and does not allow fine tracking – PPC Jun 26 '18 at 14:07
  • 1
    It is worth noting that you have no `LICENSE` file or anything equivalent in your repository. [No license = all rights reserved](https://choosealicense.com/no-permission/). – user5532169 Nov 16 '18 at 09:52
  • some small improvements to zippey on [my repo](https://bitbucket.org/hoijui/zippey/src/master/) – hoijui Feb 28 '19 at 11:32
13

Use bup (presented in details in GitMinutes #24)

It is the only git-like system designed to deal with large (even very very large) files, which means every version of a zip file will only increase the repo from its delta (instead of a full additional copy)

The result is an actual git repo, that a regular Git command can read.

I detail how bup differs from Git in "git with large files".


Any other workaround (like git-annex) isn't entirely satisfactory, as detailed in "git-annex with large files".

Community
  • 1
  • 1
VonC
  • 1,042,979
  • 435
  • 3,649
  • 4,283
  • 1
    This seems very much geared towards very large files, the scenario was geared more towards XML such as docx and xlsx (which are frequently fairly small) zipped up. You'd get a smaller repo size with bup, but would you get to diff actual changes in the XML? – Ruben Nov 21 '13 at 19:33
  • @Ruben this is geared toward large files in size or in number. But it isn't much different from git in term of diff. – VonC Nov 21 '13 at 19:56
  • Looks interesting, but can you use it with your actual git repo? – kutschkem Mar 20 '15 at 14:20
  • @kutschkem I don't think so: a bup repo is a git repo (https://raw.githubusercontent.com/bup/bup/master/DESIGN), but the reverse doesn't seem to be true. – VonC Mar 20 '15 at 14:24
7

http://tante.cc/2010/06/23/managing-zip-based-file-formats-in-git/

(Note: per comment from Ruben, this is only about getting a proper diff though, not about committing unzipped files.)

Open your ~/.gitconfig file (create if not existing already) and add the following stanza:

[diff "zip"] textconv = unzip -c -a

What it does is using “unzip -c -a FILENAME” to convert your zipfile into ASCII text (unzip -c unzips to STDOUT). Next thing is to create/modify the file REPOSITORY/.gitattributes and add the following

*.pptx diff=zip

which tells git to use the zip-diffing description from the config for files mathcing the given mask (in this case everything ending with .pptx). Now git diff automatically unzips the files and diffs the ASCII output which is a little better than just “binary files differ”. On the other hand to to the convoluted mess that the corresponding XML of pptx files is, it doesn’t help a lot but for ZIP-files including text (like for example source code archives) this is actually quite handy.

Community
  • 1
  • 1
  • 1
    This is only about getting a proper diff though, not about committing unzipped files.. – Ruben Nov 28 '13 at 12:44
  • Thanks. This answers the question I wanted to solve, of showing changes to the text files inside my gzip files when `git diff`ing. I used `[diff "gzip"] = zcat` and `*.gz diff=gzip`. – spazm Jan 30 '17 at 22:42
7

Rezip ReZipDoc, similar to Zippey by sippey, allows to handle ZIP files in a nicer way with git.

How it works

When adding/committing a ZIP based file, Rezip unpacks it and repacks it without compression, before adding it to the index/commit. In an uncompressed ZIP file, the archived files appear as-is in its content (together with some binary meta-info before each file). If those archived files are plain-text files, this method will play nicely with git.

Benefits

The main benefit of Rezip over Zippey, is that the actual file stored in the repository is still a ZIP file. Thus, in many cases, it will still work as-is with the respective application (for example Open Office), even if it is obtained without going through a re-packing-with-compression filter.

How to use

Install the filter(s) on your system:

mkdir -p ~/bin
cd ~/bin

# Download the filer executable
wget https://github.com/costerwi/rezip/blob/master/Rezip.class

# Install the add/commit filter
git config --global --replace-all filter.rezip.clean "java -cp ~/bin Rezip --store"

# (optionally) Install the checkout filter
    git config --global --add filter.rezip.smudge "java -cp ~/bin Rezip"

Use the filter in your repository, by adding lines like these to your <repo-root>/.gitattributes file:

[attr]textual     diff merge text
[attr]rezip       filter=rezip textual

# MS Office
*.docx  rezip
*.xlsx  rezip
*.pptx  rezip
# OpenOffice
*.odt   rezip
*.ods   rezip
*.odp   rezip
# Misc
*.mcdx  rezip
*.slx   rezip

The textual part is so that these files are actually shown as text files in diffs.

hoijui
  • 3,076
  • 1
  • 28
  • 34
  • This sounds really cool! I haven't had the need for this in a while, so never got around to implementing something, but this would definitely be something I would try out. – Jonas Heidelberg Jan 06 '20 at 09:37
2

Here is my approach:

  • Using git diff filters for replacing the archive files with an content summary

    git config filter.zip.clean "unzip -v %f | tail -n +4 | head -n -2 | awk '{ print \$7,\$8 }' | grep -vE /$ | sort -k 2"
    git config filter.zip.smudge "unzip -v %f | tail -n +4 | head -n -2 | awk '{ print \$7,\$8 }' | grep -vE /$ | sort -k 2"
    
  • Using a pre-commit hook to extract and add the archive content:

    #!/bin/sh
    #
    # Git archive extraction pre commit hook
    #
    # Created: 2021 by Vivien Richter <vivien-richter@outlook.de>
    # License: CC-BY-4.0
    # Version: 1.0.0
    
    # Configuration
    ARCHIVE_EXTENSIONS=$(cat .gitattributes | grep "zip" | tr -d [][:upper:] | cut -d " " -f1 | cut -d. -f2 | head -c -1 | tr "\n" "|")
    
    # Processing
    for STAGED_FILE in $(git diff --name-only --cached | grep -iE "\.($ARCHIVE_EXTENSIONS)$")
    do
        if [ ! -f "$STAGED_FILE" ]; then
            # Deletes the archive content, if the archive itself is removed
            rm -r ".$(basename $STAGED_FILE).content"
        else
            # Extracts archives
            unzip -o $STAGED_FILE -d ".$(basename $STAGED_FILE).content"
        fi
        # Adds extracted or deleted archive content to the stage
        git add ".$(basename $STAGED_FILE).content"
    done
    
  • Using a post-checkout hook for packing the archives again for usage:

    #!/bin/sh
    #
    # Git archive packing post checkout hook
    #
    # Created: 2021 by Vivien Richter <vivien-richter@outlook.de>
    # License: CC-BY-4.0
    # Version: 1.0.0
    
    # Configuration
    ARCHIVE_EXTENSIONS=$(cat .gitattributes | grep "zip" | tr -d [][:upper:] | cut -d " " -f1 | cut -d. -f2 | head -c -1 | tr "\n" "|")
    
    # Processing
    for EXTRACTED_ARCHIVE in $(git ls-tree -dr --full-tree --name-only HEAD | grep -iE "\.($ARCHIVE_EXTENSIONS)\.content$")
    do
        # Gets filename
        FILENAME=$(dirname $EXTRACTED_ARCHIVE)/$(basename $EXTRACTED_ARCHIVE | cut -d. -f2- | awk -F '.content' '{ print $1 }')
        # Removes the dummy archive file
        rm $FILENAME
        # Jumps into the extracted archive
        cd $EXTRACTED_ARCHIVE
        # Creates the real archive file
        zip -r9 ../"$FILENAME" $(find . -type f)
        # Jumps back
        cd ..
    done
    
  • Apply the filter at the .gitattributes file:

    # Macro for all file types that should be treated as ZIP archives.
    [attr]zip filter=zip
    
    # OpenDocument
    *.[oO][dD][tT] zip
    *.[oO][dD][sS] zip
    *.[oO][dD][gG] zip
    *.[oO][dD][pP] zip
    *.[oO][dD][mM] zip
    
    # Krita
    *.[kK][rR][aA] zip
    
    # VRoid Studio
    *.[vV][rR][oO][iI][dD] zip
    *.[fF][vV][pP] zip
    
  • Add some binary treatment to the .gitattributes file:

    # Macro for all binary files that should use Git LFS.
    [attr]bin -text filter=lfs diff=lfs merge=lfs lockable
    
    # Images
    *.[jJ][pP][gG] bin
    *.[jJ][pP][eE][gG] bin
    *.[pP][nN][gG] bin
    *.[aA][pP][nN][gG] bin
    *.[gG][iI][fF] bin
    *.[bB][mM][pP] bin
    *.[tT][gG][aA] bin
    *.[tT][iI][fF] bin
    *.[tT][iI][fF][fF] bin
    *.[sS][vV][gG][zZ] bin
    
  • Add some stuff to the .gitignore file:

    # Auto generated LFS hooks
    .githooks/pre-push
    
    # Temporary files
    *~
    
  • Some configuration by:

    1. Install Git LFS
    2. Prepare LFS by issuing the command git lfs install once.
    3. Install the hooks by issuing the command git config core.hooksPath .githooks.
    4. Apply the checkout hook once by issuing the command .githooks/post-checkout.
    5. Apply the filter once by issuing the command git add -A.

For an example see here: https://github.com/vivi90/git-zip

Important BUGFIXes from 14. March & 16. March 2021

Please edit your commit hook (for me .githooks/pre-commit) as follows:

#!/bin/sh
#
# Git archive extraction pre commit hook
#
# Created: 2021 by Vivien Richter <vivien-richter@outlook.de>
# License: CC-BY-4.0
# Version: 1.0.2

# Configuration
ARCHIVE_EXTENSIONS=$(cat .gitattributes | grep "zip" | tr -d [][:upper:] | cut -d " " -f1 | cut -d. -f2 | head -c -1 | tr "\n" "|")

# Processing
for STAGED_FILE in $(git diff --name-only --cached | grep -iE "\.($ARCHIVE_EXTENSIONS)$")
do
    # Deletes the old archive content
    rm -rf ".$(basename $STAGED_FILE).content"
    # Extracts the archive content, if the archive itself is not removed
    if [ -f "$STAGED_FILE" ]; then
        unzip -o $STAGED_FILE -d "$(dirname $STAGED_FILE)/.$(basename $STAGED_FILE).content"
    fi
    # Adds extracted or deleted archive content to the stage
    git add "$(dirname $STAGED_FILE)/.$(basename $STAGED_FILE).content"
done

Known issues

Sukombu
  • 33
  • 5
2

I think you're going to need to mount a zip file to the filesystem. I haven't used it, but consider FUSE:

http://code.google.com/p/fuse-zip/

There is also ZFS for Windows and Linux:

http://users.telenet.be/tfautre/softdev/zfs/

Brad
  • 146,404
  • 44
  • 300
  • 476
  • If I understand it correctly, fuse-zip could layer between the file system and git, but zfs would have to be built *into* `git`, right? Too bad I'm not always under Linux with that repo, otherwise fuse-zip would be a really nice idea. – Jonas Heidelberg Nov 03 '11 at 21:35
2

Often there are problems with pre-zipped files for applications as they expect the zip compression method and file order to be the one they chose. I believe that open office .odf files have that problem.

That said, if you are simply using any-old-zip as a method for keeping stuff together that you should be able to create a few simple aliases which will unzip and re-zip when required. The very latest Msysgit (aka Git for Windows) now has both zip and unzip on the shell code side so you can use them in aliases.

The project I'm currently working on uses zips as the main local version control / archive, so I'm also trying to get a workable set of aliases for sucking these hundreds of zips into git (and getting them out again ;-) so that the co-workers are happy.

Philip Oakley
  • 11,745
  • 8
  • 42
  • 63
  • 3
    I just did a few tests for Word 2010 - it seems quite tolerant (`deflate` with different word sizes, `deflate64` and changing file order in the zip file produced by 7zip all did not throw Word off). About using aliases, I was hoping to avoid any extra manual step... currently most of my commits go through TortoiseGit. – Jonas Heidelberg Nov 03 '11 at 22:43