Appending git history from a new repository to an old one

Question

I wanna sort of "append" git histories from a new repo to an old one. So, I have 2 repositories,

old_repository: remote: remote_old, commits: 400 commits
new_repository: remote: remote_new, commits: 200 commits

These 2 repositories are completely different and are based in different accounts with different remotes. I've added the contents of the new repository to the old repository. Now I want to also "merge" their histories i.e I want to take all the commits from the new repository and append it to that of the old repository.

I tried to look for an answer, but couldn't find something definitive. I do not want to mess around with the old repository because it's contents are used to run on production, which is why I'd like to know what is the safest way to do to. Any help or direction would be really helpful!

Thanks!

Have you checked [this](https://stackoverflow.com/questions/1425892/how-do-you-merge-two-git-repositories) post? — biqarboy, Feb 10 '21 at 04:55
@biqarboy Yes I have. But in this scenario, A is a SUB-PART of B. In my base the NEW is the OLD repository. I do not wish to create sub-trees. — hrishikeshpaul, Feb 10 '21 at 05:03

score 0 · Answer 1 · answered Feb 10 '21 at 14:11

There is no simple, easy, one-size-fits-all answer to this problem, because different people have different constraints and tastes as to what results are acceptable.

The (or a) root problem here is that in Git, history is the set of commits in the repository, in the form of a commit graph ... but at the same time, the commits are numbered by hash ID and the numbers take into account the previous commit graph so far.

Let's make a small illustration. Instead of 400 commits in repo O (for old) and 200 commits in repo N (for new), let's have just four commits in O and just two in N, like this:

A <-B <-C <-D   <-- main

E <-F   <-- main

Obviously commits A through D are the contents of O, while E and F are the contents of N.

It's trivial to put both sets of commits into any repository. There's only one minor issue with this, and that's that both O and N use the branch name main to find the "last" commit (D and F respectively). But branch names don't matter: we can change them to branch-O and branch-N if we like, and put all of this in our build-up-new-history repository:

A--B--C--D   <-- branch-O

E--------F   <-- branch-N

One simple solution to your existing problem—to combine these into a single history—is to use git merge or equivalent to build a merge commit M whose parents, plural, are commits D and F, in either order:

A--B--C--D   <-- branch-O
          \
           M   <-- main
          /
E--------F   <-- branch-N

The snapshot for commit M is up to you. You can build it from scratch, or do whatever you like. For example:

git branch main branch-O
git switch main
git merge --no-commit --allow-unrelated-histories branch-N
git rm -rf .
cp -R /tmp/already-prepared-files/* .
git add .
git commit

will build a new merge commit whose contents are the prepared files. The git rm -rf . step wipes out all the work that git merge did to merge files, including any and all conflicts that git merge encountered. The cp -R step drops the prepared files here, and the git add . step adds them as the content for new merge commit M. (There are more elegant ways to do this: the above is a straightforward, brute-force nuke-and-pave approach that's meant to be easy to understand.)

What's good about this approach is that all the existing commit numbers, in all the existing repositories, are retained. The hash ID of commit A is still valid, because in the assembled repository with merge commit M, commit A is still there, exactly as it was before. The same holds for all other existing commits. But what's bad about this approach is the history bifurcates at this merge. The content of merge commit M was, as we've just seen, totally arbitrary. Somehow you came up with what you claim is the Correct Merge, and just jammed it into place. If your Correct Merge is truly correct, that's fine, but nobody can see how you came up with this Correct Merge, hence nobody can see why it's (supposedly) correct.

You mentioned not liking How do you merge two Git repositories? because:

I do not wish to create sub-trees

but if you read all the answers there, you'll see a variant of the one I just provided above, that uses git merge --allow-unrelated-histories and then tweaks the resulting merge (rather than just wiping it out and replacing it). That, too, works; it has the advantages and disadvantages I outlined here, and if you describe, in the log message for your merge commit, just what you had to do, you can leave a message for future source code archaeologists that explains why your merge is the Correct Merge.

Again, you should go back and read all the answers to that other question, but for now, let's move on to another relatively simple option—relatively simple, but not quite so simple after all, in real repositories. This option amounts to deciding which commits from N should be added to the commits in O.

The simple version is to add all N commits to O. That is, we'd like to have Git start with commit F, work backwards one step to commit E, and then—instead of stopping, which Git would normally do because commit E has no parents—somehow jump to commit D:

A--B--C--D   <-- branch-O
          .
           .
            .
             E--F   <-- branch-N, main

where the three diagonal dots are some kind of ghostly "make Git jump tracks" operator. The train—the viewing of history—starts at F and moves all the way to the end of the track at E, but instead of stopping because we have run out of track, the train now jumps over to the D-C-B-A track.

To make Git do this in one repository, we can use git replace with the --graft option. What this does is to copy commit E to a new and improved replacement commit E', which Git stores in the repository under a special name in the refs/replace/ namespace:

A--B--C--D   <-- branch-O
          \
           E'   <-- refs/replace/<big-ugly-hash-ID>
           :
           E--F   <-- branch-N, main

When git log is doing its thing, showing commits one at a time, it encounters commit F first and shows it. Then it moves back one step to commit E. This time, it notices that there's a refs/replace/ entry with the hash ID of E (some big ugly random-looking string of letters and digits). It's at this point that Git "jumps tracks", as it were: instead of looking at commit E, it looks at this new-and-improved replacement copy E'. This replacement commit is a different commit, so it has a different hash ID; Git finds the replacement through the refs/replace/hash name. And now git log is on this other track, so it shows commit E' instead of commit E, and unlike original commit E, commit E' does have a parent commit: D. So git log goes on to show D, then C, and so on.

This has all the advantages that the original solution had, plus one more: there's no Magic Merge at all. It has one immediately obvious disadvantage, though: the contents of commits F and then E are purely based on whatever was in F and then E, without taking into account any necessary merging that might arise from drift between O and N. If that's not a problem for your case, that's fine; if it is, it's a problem.

This has one other disadvantage as well: cloning this repository does not copy the refs/replace/ commit E'. So in a clone, you will see two separate histories, rather than one single grafted history. If that's a problem for your use case, you must now solve this problem.

There is a simple solution: use git filter-branch, or its new replacement git filter-repo, to make the graft "permanent", as it were. This works by copying all the commits at and after the graft point. That is, we start with this:

A--B--C--D   <-- branch-O
          \
           E'   <-- refs/replace/<big-ugly-hash-ID>
           :
           E--F   <-- branch-N, main

Then we have Git walk through the entire graph—much the same way git log would—to copy that commit to a new-and-improved, or maybe identical-to-original, commit. If the copy really is absolutely, 100%, bit-for-bit identical, we re-use the original. If the copy changes at all, in any way, we use the copy.

The tricky part here is that when we do this walk, we first just list out all the commit hash IDs. That list goes: F, E', D, C, B, A. Then we put this list into topological order, which in this particular case just means reversing it: A, B, C, D, E', and F.

So, first we copy commit A. We make no changes to commit A, so that the copy is commit A again. We repeat for B, C, D, and E'. These all don't change either. But when we go to copy F, the copy we make uses E'—not E—as its parent. So this makes a new-and-improved commit F':

A--B--C--D
          \
           E'-F'

We never bother copying commit E here because we don't "see" it (having jumped tracks during the list-out-commit-hash-IDs step). We don't need to, because we copied that earlier, with git graft: that's what made E' in the first place.

Having finished all the copying, git filter-branch or git filter-repo will now take the names—branch-O, branch-N, and so on—and make them point to the result of copying. Since branch-O used to point to D and the result of copying D is D, branch-O still points to D:

A--B--C--D   <-- branch-O
          \
           E'-F'

The names branch-N and main, however, used to point to F. The result of copying F is F' so these two names are now moved, to point to F':

A--B--C--D   <-- branch-O
          \
           E'-F'  <-- branch-N, main

Commits E and F remain in the repository for a while, but are now useless junk. They won't be copied by git clone, and once we clean up—if we used git filter-branch there's a manual cleanup step you must invoke—those two originals, E and F, will eventually be discarded entirely.

This might be what you want. Its disadvantage is that the new and improved replacement commits, E' and F', have different commit numbers from those in repo N. If you ever take this combined repo C and introduce it to a Git program reading a copy of N, your Git looking at repo C will say: Ooh, new commits! Gimme a copy of E and F so I can add those to my collection! and now you'll have duplicates of all the commits you saved (by copying) from N, because you'll have all the originals back again.

If this disadvantage—the renumbering of all the N commits—is not a problem for you, this may be the approach you'd like.

Conclusion

Your real problem is that you must decide what set of commits you want to have in your new combined repository.

The existing commits, in repositories O and N, are the way they are. They will be that way forever. They will have those hash IDs forever. That's what a hash ID is: it's the identity of a commit. All commits are frozen forever. You can make new and improved commits that have different snapshots and/or different parents; this is new history; the old commits in O and N are the existing history. That's all there is! You can do a lot with that, and the (many) answers in the linked question provide different ways of doing these different things with these histories.

It's up to you to decide what you want done. Then, look there (and here) for ways to achieve it.

Appending git history from a new repository to an old one

1 Answers1

Conclusion