0

We are trying to build a analytics dataset for our bitbucket repositories. We are relatively new to this and have been trying to work out whether git (or bitbucket) provides unique identifiers for branches and commits.

The hashes are not unique - as two commits with the same 'content' have the same hash (see e.g. What is a Git commit ID?)

We are considering hashing the hash with other attributes to generate our own unique identifier, but would rather use a Git native one if it exists.

Chris Maes
  • 26,426
  • 4
  • 84
  • 113
wvoq
  • 1
  • 1
  • 1
    What other attributes do you have in mind? Could you explain what is the use case? If you re-read the link you've provided you will see that there can be no _two_ commits with the same hash (unless the same author created two same changes at the exact same moment in time with from the same parent commit - highly unlikely?). But as I sad, please provide more info and we'll be able to offer specific advice. – mimikrija Oct 11 '19 at 11:17
  • 1
    "two commits with the same 'content' have the same hash". No they have **not**. This is the mistake here, see Chris' answer. – RomainValeri Oct 11 '19 at 11:52
  • 1
    @RomainValeri Or, equivalently, the mistake is assuming that the content of a commit is the content of *just* the files stored in the commit, but in fact, it also includes the metadata: the hash ID of the parent commit(s) and the two date-and-time stamps and so on. – torek Oct 11 '19 at 15:56
  • @mimikrija - as I say to Chris below, the problem is a practical one. The problem we are having is that when we query the commits we get multiple times multiple rows with different branch hashes - this make it difficult to use the commit hashes as row identifiers. The other 'attributes' in the query - e.g. author - are the same, as you'd expect from the post I linked to. At the moments, we are considering combining the commit and branch hash to get uniqueness. – wvoq Oct 12 '19 at 16:42

1 Answers1

3

Git does create unique hashes for every commit. (theoretically there is a very small chance for collision, but that is very improbable).

As explained in this answer, git uses the following information to generate the hash:

  • The source tree of the commit (which unravels to all the subtrees and blobs)
  • The parent commit sha1
  • The author info (with timestamp)
  • The committer info (right, those are different, also with timestamp)
  • The commit message

since the timestamp is integrated, you will never just have two same hashes for the same "content".

You can however have multiple references pointing to the same commit hash: tags, branches.

Conclusion

Depending on what you want you can:

  • just use the git commit hashes, since they are unique for each commit. If two branches point to the same commit, you will analyze only once, but depending on your need that might not be necessary since it really is the exact same commit.
  • you can combine the git commit hash with the branch name if you really want an analysis for each branch, even if they point to the exact same code.
Chris Maes
  • 26,426
  • 4
  • 84
  • 113
  • thanks for the answer - I'm aware of this, I linked to a post with the same information. The problem we are having is that when we query the commits we get multiple times multiple rows with different branch hashes - this make it difficult to use the commit hashes as row identifiers. Maybe the issue here is, as you say, that there we are getting the references back, rather than the commits. As you say we can combine the commit and branch hash to get uniqueness. – wvoq Oct 12 '19 at 16:37