0

Imagine you have one tree with one file. Suppose we have only two possible states for this file, a, and b. If it's missing or not-existent, ø. I'm trying to build a table to understand all the possible git-statuses. I believe what I have makes sense, however, I've marked with ** the areas of question:

head    index   working status
a       a       a       no changes**
a       a       b       unstaged:modified**
a       a       ø       unstaged:deleted**
a       b       a       staged:modified, unstaged:modified
a       b       b       staged:modified
a       b       ø       staged:modified, unstaged:deleted
a       ø       a       no changes**
a       ø       b       unstaged:modified**
a       ø       ø       staged:deleted**
ø       a       a       staged:new file
ø       a       b       staged:new file, unstaged: modified
ø       a       ø       staged:new file, unstaged: deleted
ø       ø       a       untracked

For any of the *, ø, * I almost feel like it depends on the parent tree, and whether or not that is in the index... for example, a, ø, ø it's as if you've removed the blob from the working tree, and also the index. But, what does a removal from the index look like? Is it just the parent tree is added into the staging area with the tree entry removed? If that's the case, then it makes sense that there is no entry in the index for the blob itself.

For any record where index = head, (a,a,a, a,a,b, a,a,ø) I'm assuming this state can't actually occur unless you were playing around with the plumbing commands.

If you see errors in my table, and/or any light shed on this would be great! Thanks in advance.

pyramation
  • 1,483
  • 3
  • 19
  • 34

1 Answers1

1

Think of the index as "the proposed next commit"

Commit-based version control systems such as Mercurial and Git need a way to distinguish between what's in the current commit—which, like any commit, can never be changed—and what will be in the next commit we make, which of course has to be changeable up until we make the commit. Mercurial essentially uses the work-tree for this, but Git adds an extra layer it calls the index. Git is then able to assign some extra properties into the index: a file is tracked if and only if it is in the index, for instance. During merges, the index takes on extra properties (which we shall ignore here :-) ). There's one last complication I will leave for the end.

But, what does a removal from the index look like?

Removing a file from the index amounts to (rather literally) removing the file from the index. Try running git ls-files --stage to see what I mean: for your first row (a, a, a = no changes) you will find that there is a file named a in the index. For your row a, ø, a, the file a is simply no longer in the index (and will therefore not be in a new commit you make now).

As a result, calling a file "staged" could be a bit misleading. If a is not in the index at all (but is in HEAD), the file is "staged for removal", but it's simpler to just say "not in the index". Once a file is not in the index, it's also not tracked, so the work-tree version becomes an untracked file!

This means your a, ø, b entry is also wrong: here the file is staged for removal, and the work-tree variant with b is an untracked file.

The a, a, ø entry is perhaps the most difficult to name. The file is still in the index, so it will be in every commit you make from here forward until you remove it from the index. But, the file is not in the work-tree at all, so you can't see that it is going into commits. If you run git add file while in this state, Git copies the non-existence of the work-tree file into the index by removing the index entry.

(Mercurial has a similar state, as there's a hidden internal data structure called the manifest that plays part of the same role as Git's index. If the file is missing from the work-tree, but is in the manifest, Mercurial calls the file missing. Mercurial tries to treat the work-tree as what goes into the next commit, so you would think that if the file has simply vanished like this, it should vanish from the next commit as well. According to the documentation, Mercurial behaved this way originally, but this was found to be too error-prone.)

Low level tools for poking around

  • Use git ls-tree -r HEAD to view the entire tree of the current commit (if there's only one tree you don't need -r).
  • Use git ls-files --stage to view the entire tree of the current index: the index is like a flattened tree, where if you have a subdirectory (subtree) named dir with files d1 and d2, you get index entries named dir/d1 and dir/d2 (vs the commit, where the top tree will have a subtree named dir and the subtree will have two blobs named d1 and d2).
  • Use the OS's ordinary tools (ls for instance) to view your work-tree. Your work-tree itself has very little significance to git commit, which simply turns the existing index, whatever is in it, into one or more tree objects to store into a new commit. (This all changes you run git commit with file pathname arguments or -a or similar. Here Git may add files to the index, or even switch to a temporary alternate index that it uses just until the commit is done. This depends on whether, when supplying additional paths, you use --include or --only.)

One more wrinkle

Because Git has, and exposes, the index, it can and does expose one more feature, in two different ways. The index has two flag bits per entry called assume-unchanged and skip-worktree. To see these flag bits with git ls-files you must add the --debug argument, but what they do can be described relatively simply—a little bit too simply, it turns out—as:

  • If the assume-unchanged or skip-worktree flags are set on an index entry, Git should just "close its eyes" to what's in the work-tree when doing operations like git status.

  • This can speed Git up a lot, but has certain side effects. The side effects may be what we are using the bits for.

When you run git status, Git runs two git diffs. One compares HEAD vs index, and the second compares index vs work-tree. It's the first diff that determines the first column of a git status --short output, and the second diff that determines the second column.

The assume-unchanged and skip-worktree bits tell Git not to bother comparing the file during the second diff.1 Note that for these bits to be set, the index must have an entry for the file, i.e., the file must be tracked to get skipped over like this. We can probably assume that the index entry matches the HEAD entry (if it doesn't, it will after the next commit!), so the effect of these flag bits is that we never see the file as modified, and git add generally skips over the file as well: it does not copy the work-tree version back into the index.

Our assumption—that the index entry matches the commit—leads us astray in some corner cases though, and is the reason there are two bits. For more about this, see Git - Difference Between 'assume-unchanged' and 'skip-worktree'


1The first diff is very fast because of the special form files (blobs) have when stored in a commit or in the index. Specifically, Git can tell whether the contents of any one file match the contents of any other file just by comparing their hash IDs. If the hash IDs match, the files are the same; if not, the files are different. Git is not looking for a complete diff at this point, but rather just a --name-status style diff: "are the files the same, or not?"

The second diff is much slower because Git must, in the worst case, open and read the entire contents of each file. Even just asking the file system about the file (calling the lstat system call) is much slower than Git's internal compare-hash-IDs trick.

torek
  • 330,127
  • 43
  • 437
  • 552