1

When using code from unknown third partys on github, I always make sure to check the code that no obvious backdoors that could compromise the security of my system exist.

The specific state of the repository I am reviewing is probably bound to a git tag and a commit hash. As we know, the content of a git tag can easily be changed. So downloading the source code again and trusting it based on the version tag is definitely not secure.

My question is: When dowing a fresh download of the source code, can I trust that if I checkout a specific commit based on it's full commit hash, that this is 100% the same code I reviewed before?

The focus of this question is not on the probability of sha1 collisions occuring at all (as a collision is alot easier to compute than computing a specific sha1 hash - which is - hopefully - pretty much impossible at the moment?) , but whether each and every file is part of this sha1 sum, so that a change would always trigger a different hash.

Chris Maes
  • 26,426
  • 4
  • 84
  • 113
Zulakis
  • 6,930
  • 9
  • 38
  • 60

2 Answers2

5

in short: yes.

on this page you can see how this sha1 sum is formed. It is composed of:

  • The source tree of the commit (which unravels to all the subtrees and blobs)
  • The parent commit sha1
  • The author info
  • The committer info (right, those are different!)
  • The commit message

So every change in every file is contained in the calculation of the sha1sum. AFAIK you can trust that any change to any file would in every case give a different sha1 sum.

EDIT: I started working through one of my commits:

git cat-file commit HEAD

gives:

tree 563ccb5109fbf0a01d99517ca1dbe15db349592d
parent 3c6f0800708aeaaeaba804273406ddcd0b3175ad
...

now git cat-file -p 563ccb5109fbf0a01d99517ca1dbe15db349592d:

100644 blob d8fe4fa70f618843e9ab2df67167b49565c71f25    .gitignore
100644 blob dba1ba3a31837debf7a28eceb194e86916b88cbc    README
040000 tree 37ad71e959c6dadd0e4b7aff8a0c6e85a0eff789    conf
040000 tree 60eca667ab8b5852ecd2dd2d91d198a3956a8b73    etc
040000 tree 634c4c2ec34aec14142b5991bd3a5126110f2cae    sbin
040000 tree 256db03954535d25d5f340603e707207170f199c    spec
040000 tree 9e1e156f88b842da471f52d4c135f391319b4991    usr

and I can continue deeper: git cat-file -p d8fe4fa70f618843e9ab2df67167b49565c71f25:

/.project

(which is the content of my .gitignore file) or git cat-file -p 256db03954535d25d5f340603e707207170f199c:

100644 blob 591367a913adbeb1c86d674d240fb08ab8ccf78b    base.spec

(which is the content of my "spec" directory).

so as you can see, the contents of each and every file is recursively present in the sha1 sum of the file; then in the sha1 sum of the source tree, and finally in the sha1 sum of the commit.

Chris Maes
  • 26,426
  • 4
  • 84
  • 113
  • Thanks! Can you give an example how to generate the (file-)hashes you are using with `git cat-file -p` from the file content? They don't seem to be a `sha1sum` of the file. – Zulakis Feb 22 '16 at 10:56
  • `git hash-object ` – Chris Maes Feb 22 '16 at 11:03
  • remark the anwser of @ChrisMartin : as you said you didn't want to adress the probability of collision, the answer is YES as far as I am concerned. As ChrisMartin points out; this possibility still exists so there is no solid **guarantee** that collision will never occur. – Chris Maes Feb 22 '16 at 11:05
  • 1
    The docs weren't completely transparent on this, but I assume that the hash that `git hash-object ` returns is a sha1sum done over the full file contents + something else? (How would I reproduce it manually?) About the collision: from what we know, I would assume that finding a matching collision it possible, but hasn't been done and would require absurd amounts of (todays) computing power - correct? I know that the hash is not a security, but a integrity feature, but in absence of a peer review I trust, abusing the integrity feature "commit-hash", is probably the best way to go. – Zulakis Feb 22 '16 at 11:14
  • 1
    you can see in this answer: http://stackoverflow.com/a/7225329/2082964 how to recreate that using shasum – Chris Maes Feb 22 '16 at 11:20
  • be careful of submodules. if the repo has a submodule tracking a branch, that submodule code can change without the parent repo git commit hash changing. so using the commit hash is safe as long as you confirm there aren't submodules tracking a branch and you aren't careless with things like `git clone --recursive` – JDiMatteo Oct 23 '17 at 21:45
0

Git hashes everything, so to your headline and bottom line question: yes.


a collision is a lot easier to compute than computing a specific sha1 hash - which is - hopefully - pretty much impossible at the moment?

Correct on both counts. You could even lose the "pretty much" part, the answer to "is it possible to construct a message having a given SHA1 hash code" is properly "lol, no."

jthill
  • 42,819
  • 4
  • 65
  • 113