9

I am trying to extract git logs from a few repositories like this:

git log --pretty=format:%H\t%ae\t%an\t%at\t%s --numstat

For larger repositories (like rails/rails) it takes a solid 35+ seconds to generate the log.

Is there a way to improve this performance?

mjuarez
  • 14,655
  • 10
  • 52
  • 65
George L
  • 1,388
  • 2
  • 22
  • 31
  • 1
    Try `--max-count=30` as [described in the git-log documentation](https://git-scm.com/docs/git-log). Do you really need to see all 56'000 commits to the rails project? – msw Feb 03 '16 at 20:40
  • @msw for this project, unfortunately, yes. – George L Feb 03 '16 at 20:43
  • Git 2.18 (Q2 2018) should improve `git log` performance by *a lot*. See [my answer below](https://stackoverflow.com/a/49826884/6309). – VonC May 10 '18 at 14:31

4 Answers4

13

TLDR; as mentioned in GitMerge 2019:

git config --global core.commitGraph true
git config --global gc.writeCommitGraph true
cd /path/to/repo
git commit-graph write

Actually (see at the end), the first two config are not needed with Git 2.24+ (Q3 2019): they are true by default.


Git 2.18 (Q2 2018) will improve git log performance:

See commit 902f5a2 (24 Mar 2018) by René Scharfe (rscharfe).
See commit 0aaf05b, commit 3d475f4 (22 Mar 2018) by Derrick Stolee (derrickstolee).
See commit 626fd98 (22 Mar 2018) by brian m. carlson (bk2204).
(Merged by Junio C Hamano -- gitster -- in commit 51f813c, 10 Apr 2018)

sha1_name: use bsearch_pack() for abbreviations

When computing abbreviation lengths for an object ID against a single packfile, the method find_abbrev_len_for_pack() currently implements binary search.
This is one of several implementations.
One issue with this implementation is that it ignores the fanout table in the pack-index.

Translate this binary search to use the existing bsearch_pack() method that correctly uses a fanout table.

Due to the use of the fanout table, the abbreviation computation is slightly faster than before.

For a fully-repacked copy of the Linux repo, the following 'git log' commands improved:

* git log --oneline --parents --raw
  Before: 59.2s
  After:  56.9s
  Rel %:  -3.8%

* git log --oneline --parents
  Before: 6.48s
  After:  5.91s
  Rel %: -8.9%

The same Git 2.18 adds a commits graph: Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking.

See commit 7547b95, commit 3d5df01, commit 049d51a, commit 177722b, commit 4f2542b, commit 1b70dfd, commit 2a2e32b (10 Apr 2018), and commit f237c8b, commit 08fd81c, commit 4ce58ee, commit ae30d7b, commit b84f767, commit cfe8321, commit f2af9f5 (02 Apr 2018) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit b10edb2, 08 May 2018)

commit: integrate commit graph with commit parsing

Teach Git to inspect a commit graph file to supply the contents of a struct commit when calling parse_commit_gently().
This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date.

If core.commitGraph is false, then do not check graph files.

In test script t5318-commit-graph.sh, add output-matching conditions on read-only graph operations.

By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks.

Here are some performance results for a copy of the Linux repository where 'master' has 678,653 reachable commits and is behind 'origin/master' by 59,929 commits.

| Command                          | Before | After  | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 |  8.31s |  0.94s | -88%  |
| branch -vv                       |  1.02s |  0.14s | -86%  |
| rev-list --all                   |  5.89s |  1.07s | -81%  |
| rev-list --all --objects         | 66.15s | 58.45s | -11%  |

To know more about commit graph, see "How does 'git log --graph' work?".


The same Git 2.18 (Q2 2018) adds lazy-loading tree.

The code has been taught to use the duplicated information stored in the commit-graph file to learn the tree object name for a commit to avoid opening and parsing the commit object when it makes sense to do so.

See commit 279ffad (30 Apr 2018) by SZEDER Gábor (szeder).
See commit 7b8a21d, commit 2e27bd7, commit 5bb03de, commit 891435d (06 Apr 2018) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit c89b6e1, 23 May 2018)

commit-graph: lazy-load trees for commits

The commit-graph file provides quick access to commit data, including the OID of the root tree for each commit in the graph. When performing a deep commit-graph walk, we may not need to load most of the trees for these commits.

Delay loading the tree object for a commit loaded from the graph until requested via get_commit_tree().
Do not lazy-load trees for commits not in the graph, since that requires duplicate parsing and the relative peformance improvement when trees are not needed is small.

On the Linux repository, performance tests were run for the following command:

git log --graph --oneline -1000

Before: 0.92s
After:  0.66s
Rel %: -28.3%

Git 2.21 (Q1 2019) adds loose cache.

See commit 8be88db (07 Jan 2019), and commit 4cea1ce, commit d4e19e5, commit 0000d65 (06 Jan 2019) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit eb8638a, 18 Jan 2019)

object-store: use one oid_array per subdirectory for loose cache

The loose objects cache is filled one subdirectory at a time as needed.
It is stored in an oid_array, which has to be resorted after each add operation.
So when querying a wide range of objects, the partially filled array needs to be resorted up to 255 times, which takes over 100 times longer than sorting once.

Use one oid_array for each subdirectory.
This ensures that entries have to only be sorted a single time. It also avoids eight binary search steps for each cache lookup as a small bonus.

The cache is used for collision checks for the log placeholders %h, %t and %p, and we can see the change speeding them up in a repository with ca. 100 objects per subdirectory:

$ git count-objects
26733 objects, 68808 kilobytes

Test                        HEAD^             HEAD
--------------------------------------------------------------------
4205.1: log with %H         0.51(0.47+0.04)   0.51(0.49+0.02) +0.0%
4205.2: log with %h         0.84(0.82+0.02)   0.60(0.57+0.03) -28.6%
4205.3: log with %T         0.53(0.49+0.04)   0.52(0.48+0.03) -1.9%
4205.4: log with %t         0.84(0.80+0.04)   0.60(0.59+0.01) -28.6%
4205.5: log with %P         0.52(0.48+0.03)   0.51(0.50+0.01) -1.9%
4205.6: log with %p         0.85(0.78+0.06)   0.61(0.56+0.05) -28.2%
4205.7: log with %h-%h-%h   0.96(0.92+0.03)   0.69(0.64+0.04) -28.1%

Git 2.22 (Apr. 2019) checks errors before using data read from the commit-graph file.

See commit 93b4405, commit 43d3561, commit 7b8ce9c, commit 67a530f, commit 61df89c, commit 2ac138d (25 Mar 2019), and commit 945944c, commit f6761fa (21 Feb 2019) by Ævar Arnfjörð Bjarmason (avar).
(Merged by Junio C Hamano -- gitster -- in commit a5e4be2, 25 Apr 2019)

commit-graph write: don't die if the existing graph is corrupt

When the commit-graph is written we end up calling parse_commit(). This will in turn invoke code that'll consult the existing commit-graph about the commit, if the graph is corrupted we die.

We thus get into a state where a failing "commit-graph verify" can't be followed-up with a "commit-graph write" if core.commitGraph=true is set, the graph either needs to be manually removed to proceed, or core.commitGraph needs to be set to "false".

Change the "commit-graph write" codepath to use a new parse_commit_no_graph() helper instead of parse_commit() to avoid this.
The latter will call repo_parse_commit_internal() with use_commit_graph=1 as seen in 177722b ("commit: integrate commit graph with commit parsing", 2018-04-10, Git v2.18.0-rc0).

Not using the old graph at all slows down the writing of the new graph by some small amount, but is a sensible way to prevent an error in the existing commit-graph from spreading.


With Git 2.24+ (Q3 2019), the commit-graph is active by default:

See commit aaf633c, commit c6cc4c5, commit ad0fb65, commit 31b1de6, commit b068d9a, commit 7211b9e (13 Aug 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit f4f8dfe, 09 Sep 2019)

commit-graph: turn on commit-graph by default

The commit-graph feature has seen a lot of activity in the past year or so since it was introduced.
The feature is a critical performance enhancement for medium- to large-sized repos, and does not significantly hurt small repos.

Change the defaults for core.commitGraph and gc.writeCommitGraph to true so users benefit from this feature by default.


Still with Git 2.24 (Q4 2019), a configuration variable tells "git fetch" to write the commit graph after finishing.

See commit 50f26bd (03 Sep 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 5a53509, 30 Sep 2019)

fetch: add fetch.writeCommitGraph config setting

The commit-graph feature is now on by default, and is being written during 'git gc' by default.
Typically, Git only writes a commit-graph when a 'git gc --auto' command passes the gc.auto setting to actualy do work. This means that a commit-graph will typically fall behind the commits that are being used every day.

To stay updated with the latest commits, add a step to 'git fetch' to write a commit-graph after fetching new objects.
The fetch.writeCommitGraph config setting enables writing a split commit-graph, so on average the cost of writing this file is very small. Occasionally, the commit-graph chain will collapse to a single level, and this could be slow for very large repos.

For additional use, adjust the default to be true when feature.experimental is enabled.


And still with Git 2.24 (Q4 2019), the commit-graph is more robust.

See commit 6abada1, commit fbab552 (12 Sep 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 098e8c6, 07 Oct 2019)

commit-graph: bump DIE_ON_LOAD check to actual load-time

Commit 43d3561 (commit-graph write: don't die if the existing graph is corrupt, 2019-03-25, Git v2.22.0-rc0) added an environment variable we use only in the test suite, $GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD.
But it put the check for this variable at the very top of prepare_commit_graph(), which is called every time we want to use the commit graph.
Most importantly, it comes before we check the fast-path "did we already try to load?", meaning we end up calling getenv() for every single use of the commit graph, rather than just when we load.

getenv() is allowed to have unexpected side effects, but that shouldn't be a problem here; we're lazy-loading the graph so it's clear that at least one invocation of this function is going to call it.

But it is inefficient. getenv() typically has to do a linear search through the environment space.

We could memoize the call, but it's simpler still to just bump the check down to the actual loading step. That's fine for our sole user in t5318, and produces this minor real-world speedup:

[before]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
  Time (mean ± σ):      1.460 s ±  0.017 s    [User: 1.174 s, System: 0.285 s]
  Range (min … max):    1.440 s …  1.491 s    10 runs

[after]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
  Time (mean ± σ):      1.391 s ±  0.005 s    [User: 1.118 s, System: 0.273 s]
  Range (min … max):    1.385 s …  1.399 s    10 runs

Git 2.24 (Q4 2019) also includes a regression fix.

See commit cb99a34, commit e88aab9 (24 Oct 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit dac1d83, 04 Nov 2019)

commit-graph: fix writing first commit-graph during fetch

Reported-by: Johannes Schindelin
Helped-by: Jeff King
Helped-by: Szeder Gábor
Signed-off-by: Derrick Stolee

The previous commit includes a failing test for an issue around fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we fix that bug and set the test to "test_expect_success".

The problem arises with this set of commands when the remote repo at has a submodule. Note that --recurse-submodules is not needed to demonstrate the bug.

$ git clone <url> test
$ cd test
$ git -c fetch.writeCommitGraph=true fetch origin
Computing commit graph generation numbers: 100% (12/12), done.
BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2>
Aborted (core dumped)

As an initial fix, I converted the code in builtin/fetch.c that calls write_commit_graph_reachable() to instead launch a "git commit-graph write --reachable --split" process. That code worked, but is not how we want the feature to work long-term.

That test did demonstrate that the issue must be something to do with internal state of the 'git fetch' process.

The write_commit_graph() method in commit-graph.c ensures the commits we plan to write are "closed under reachability" using close_reachable().
This method walks from the input commits, and uses the UNINTERESTING flag to mark which commits have already been visited. This allows the walk to take O(N) time, where N is the number of commits, instead of O(P) time, where P is the number of paths. (The number of paths can be exponential in the number of commits.)

However, the UNINTERESTING flag is used in lots of places in the codebase. This flag usually means some barrier to stop a commit walk, such as in revision-walking to compare histories.
It is not often cleared after the walk completes because the starting points of those walks do not have the UNINTERESTING flag, and clear_commit_marks() would stop immediately.

This is happening during a 'git fetch' call with a remote. The fetch negotiation is comparing the remote refs with the local refs and marking some commits as UNINTERESTING.

I tested running clear_commit_marks_many() to clear the UNINTERESTING flag inside close_reachable(), but the tips did not have the flag, so that did nothing.

It turns out that the calculate_changed_submodule_paths() method is at fault. Thanks, Peff, for pointing out this detail! More specifically, for each submodule, the collect_changed_submodules() runs a revision walk to essentially do file-history on the list of submodules. That revision walk marks commits UNININTERESTING if they are simplified away by not changing the submodule.

Instead, I finally arrived on the conclusion that I should use a flag that is not used in any other part of the code. In commit-reach.c, a number of flags were defined for commit walk algorithms. The REACHABLE flag seemed like it made the most sense, and it seems it was not actually used in the file.
The REACHABLE flag was used in early versions of commit-reach.c, but was removed by 4fbcca4 ("commit-reach: make can_all_from_reach... linear", 2018-07-20, v2.20.0-rc0).

Add the REACHABLE flag to commit-graph.c and use it instead of UNINTERESTING in close_reachable().
This fixes the bug in manual testing.


Fetching from multiple remotes into the same repository in parallel had a bad interaction with the recent change to (optionally) update the commit-graph after a fetch job finishes, as these parallel fetches compete with each other.

That has been corrected with Git 2.25 (Q1 2020).

See commit 7d8e72b, commit c14e6e7 (03 Nov 2019) by Johannes Schindelin (dscho).
(Merged by Junio C Hamano -- gitster -- in commit bcb06e2, 01 Dec 2019)

fetch: add the command-line option --write-commit-graph

Signed-off-by: Johannes Schindelin

This option overrides the config setting fetch.writeCommitGraph, if both are set.

And:

fetch: avoid locking issues between fetch.jobs/fetch.writeCommitGraph

Signed-off-by: Johannes Schindelin

When both fetch.jobs and fetch.writeCommitGraph is set, we currently try to write the commit graph in each of the concurrent fetch jobs, which frequently leads to error messages like this one:

fatal: Unable to create '.../.git/objects/info/commit-graphs/commit-graph-chain.lock': File exists.

Let's avoid this by holding off from writing the commit graph until all fetch jobs are done.


The code to write split commit-graph file(s) upon fetching computed bogus value for the parameter used in splitting the resulting files, which has been corrected with Git 2.25 (Q1 2020).

See commit 63020f1 (02 Jan 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 037f067, 06 Jan 2020)

commit-graph: prefer default size_mult when given zero

Signed-off-by: Derrick Stolee

In 50f26bd ("fetch: add fetch.writeCommitGraph config setting", 2019-09-02, Git v2.24.0-rc0 -- merge listed in batch #4), the fetch builtin added the capability to write a commit-graph using the "--split" feature.
This feature creates multiple commit-graph files, and those can merge based on a set of "split options" including a size multiple.
The default size multiple is 2, which intends to provide a log_2 N depth of the commit-graph chain where N is the number of commits.

However, I noticed during dogfooding that my commit-graph chains were becoming quite large when left only to builds by 'git fetch'.
It turns out that in split_graph_merge_strategy(), we default the size_mult variable to 2, except we override it with the context's split_opts if they exist.
In builtin/fetch.c, we create such a split_opts, but do not populate it with values.

This problem is due to two failures:

  1. It is unclear that we can add the flag COMMIT_GRAPH_WRITE_SPLIT with a NULL split_opts.
  2. If we have a non-NULL split_opts, then we override the default values even if a zero value is given.

Correct both of these issues.

  • First, do not override size_mult when the options provide a zero value.
  • Second, stop creating a split_opts in the fetch builtin.

Note that git log was broken between Git 2.22 (May 2019) and Git 2.27 (Q2 2020), when using magic pathspec.

The command line parsing of "git log :/a/b/" was broken for about a full year without anybody noticing, which has been corrected.

See commit 0220461 (10 Apr 2020) by Jeff King (peff).
See commit 5ff4b92 (10 Apr 2020) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit 95ca489, 22 Apr 2020)

sha1-name: do not assume that the ref store is initialized

Reported-by: Érico Rolim

c931ba4e ("sha1-name.c``: remove the_repo from handle_one_ref()", 2019-04-16, Git v2.22.0-rc0 -- merge listed in batch #8) replaced the use of for_each_ref() helper, which works with the main ref store of the default repository instance, with refs_for_each_ref(), which can work on any ref store instance, by assuming that the repository instance the function is given has its ref store already initialized.

But it is possible that nobody has initialized it, in which case, the code ends up dereferencing a NULL pointer.

And:

repository: mark the "refs" pointer as private

Signed-off-by: Jeff King

The "refs" pointer in a struct repository starts life as NULL, but then is lazily initialized when it is accessed via get_main_ref_store().
However, it's easy for calling code to forget this and access it directly, leading to code which works some of the time, but fails if it is called before anybody else accesses the refs.

This was the cause of the bug fixed by 5ff4b920eb ("sha1-name: do not assume that the ref store is initialized", 2020-04-09, Git v2.27.0 -- merge listed in batch #3). In order to prevent similar bugs, let's more clearly mark the "refs" field as private.

VonC
  • 1,042,979
  • 435
  • 3,649
  • 4,283
4

My first thought was to improve your IO, but I tested against the rails repository using an SSD and got a similar result: 30 seconds.

--numstat is what's slowing everything down, otherwise git-log can complete in 1 second even with the formatting. Doing a diff is expensive, so if you can remove that from your process that will speed things up immensely. Perhaps do it after the fact.

Otherwise if you filter the log entries using git-log's own search facilities that will reduce the number of entries which need to do a diff. For example, git log --grep=foo --numstat takes just one second.They're in the docs under "Commit Limiting". This can greatly reduce the number of entries git has to format. Revision ranges, date filters, author filters, log message grepping... all this can improve the performance of git-log on a large repository while doing an expensive operation.

Schwern
  • 127,817
  • 21
  • 150
  • 290
4

You are correct, it does take somewhere between 20 and 35 seconds to generate the report on 56'000 commits generating 224'000 lines (15MiB) of output. I actually think that's pretty decent performance but you don't; okay.

Because you are generating a report using a constant format from an unchanging database, you only have to do it once. Afterwards, you can use the cached result of git log and skip the time-consuming generation. For example:

git log --pretty=format:%H\t%ae\t%an\t%at\t%s --numstat > log-pretty.txt

You might wonder how long it takes to search that entire report for data of interest. That's a worthy question:

$ tail -1 log-pretty.txt
30  0   railties/test/webrick_dispatcher_test.rb
$ time grep railties/test/webrick_dispatcher_test.rb log-pretty.txt 
…
30  0   railties/test/webrick_dispatcher_test.rb

real    0m0.012s
…

Not bad, the introduction of a "cache" has reduced the time needed from 35+ seconds to a dozen milliseconds. That's almost 3000 times as fast.

msw
  • 40,500
  • 8
  • 77
  • 106
0

There is another avenue to increase git log performances, and it builds upon commit graphs mentioned in the previous answer.

Git 2.27 (Q2 2020) introduce an extension to the commit-graph to make it efficient to check for the paths that were modified at each commit using Bloom filters.

See commit caf388c (09 Apr 2020), and commit e369698 (30 Mar 2020) by Derrick Stolee (derrickstolee).
See commit d5b873c, commit a759bfa, commit 42e50e7, commit a56b946, commit d38e07b, commit 1217c03, commit 76ffbca (06 Apr 2020), and commit 3d11275, commit f97b932, commit ed591fe, commit f1294ea, commit f52207a, commit 3be7efc (30 Mar 2020) by Garima Singh (singhgarima).
See commit d21ee7d (30 Mar 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 9b6606f, 01 May 2020)

revision.c: use Bloom filters to speed up path based revision walks

Helped-by: Derrick Stolee <dstolee@microsoft.com
Helped-by: SZEDER Gábor
Helped-by: Jonathan Tan
Signed-off-by: Garima Singh

Revision walk will now use Bloom filters for commits to speed up revision walks for a particular path (for computing history for that path), if they are present in the commit-graph file.

We load the Bloom filters during the prepare_revision_walk step, currently only when dealing with a single pathspec.
Extending it to work with multiple pathspecs can be explored and built on top of this series in the future.

While comparing trees in rev_compare_trees(), if the Bloom filter says that the file is not different between the two trees, we don't need to compute the expensive diff.
This is where we get our performance gains.

The other response of the Bloom filter is '`:maybe', in which case we fall back to the full diff calculation to determine if the path was changed in the commit.

We do not try to use Bloom filters when the '--walk-reflogs' option is specified.
The '--walk-reflogs' option does not walk the commit ancestry chain like the rest of the options.
Incorporating the performance gains when walking reflog entries would add more complexity, and can be explored in a later series.

Performance Gains: We tested the performance of git log -- <path> on the git repo, the linux and some internal large repos, with a variety of paths of varying depths.

On the git and linux repos:

  • we observed a 2x to 5x speed up.

On a large internal repo with files seated 6-10 levels deep in the tree:

  • we observed 10x to 20x speed ups, with some paths going up to 28 times faster.

But: Fix (with Git 2.27, Q2 2020) a leak noticed by fuzzer.

See commit fbda77c (04 May 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 95875e0, 08 May 2020)

commit-graph: avoid memory leaks

Signed-off-by: Jonathan Tan
Reviewed-by: Derrick Stolee

A fuzzer running on the entry point provided by fuzz-commit-graph.c revealed a memory leak when parse_commit_graph() creates a struct bloom_filter_settings and then returns early due to error.

Fix that error by always freeing that struct first (if it exists) before returning early due to error.

While making that change, I also noticed another possible memory leak - when the BLOOMDATA chunk is provided but not BLOOMINDEXES.
Also fix that error.


Git 2.27 (Q2 2020) improves bloom filter again:

See commit b928e48 (11 May 2020) by SZEDER Gábor (szeder).
See commit 2f6775f, commit 65c1a28, commit 8809328, commit 891c17c (11 May 2020), and commit 54c337b, commit eb591e4 (01 May 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 4b1e5e5, 14 May 2020)

bloom: de-duplicate directory entries

Signed-off-by: Derrick Stolee

When computing a changed-path Bloom filter, we need to take the files that changed from the diff computation and extract the parent directories. That way, a directory pathspec such as "Documentation" could match commits that change "Documentation/git.txt".

However, the current code does a poor job of this process.

The paths are added to a hashmap, but we do not check if an entry already exists with that path.
This can create many duplicate entries and cause the filter to have a much larger length than it should.
This means that the filter is more sparse than intended, which helps the false positive rate, but wastes a lot of space.

Properly use hashmap_get() before hashmap_add().
Also be sure to include a comparison function so these can be matched correctly.

This has an effect on a test in t0095-bloom.sh.
This makes sense, there are ten changes inside "smallDir" so the total number of paths in the filter should be 11.
This would result in 11 * 10 bits required, and with 8 bits per byte, this results in 14 bytes.


With Git 2.28 (Q3 2020), "git log -L..." now takes advantage of the "which paths are touched by this commit?" info stored in the commit-graph system.

For that, the bloom filter is used.

See commit f32dde8 (11 May 2020) by Derrick Stolee (derrickstolee).
See commit 002933f, commit 3cb9d2b, commit 48da94b, commit d554672 (11 May 2020) by SZEDER Gábor (szeder).
(Merged by Junio C Hamano -- gitster -- in commit c3a0282, 09 Jun 2020)

line-log: integrate with changed-path Bloom filters

Signed-off-by: Derrick Stolee

The previous changes to the line-log machinery focused on making the first result appear faster. This was achieved by no longer walking the entire commit history before returning the early results.
There is still another way to improve the performance: walk most commits much faster. Let's use the changed-path Bloom filters to reduce time spent computing diffs.

Since the line-log computation requires opening blobs and checking the content-diff, there is still a lot of necessary computation that cannot be replaced with changed-path Bloom filters.
The part that we can reduce is most effective when checking the history of a file that is deep in several directories and those directories are modified frequently.
In this case, the computation to check if a commit is TREESAME to its first parent takes a large fraction of the time.
That is ripe for improvement with changed-path Bloom filters.

We must ensure that prepare_to_use_bloom_filters() is called in revision.c so that the bloom_filter_settings are loaded into the struct rev_info from the commit-graph.
Of course, some cases are still forbidden, but in the line-log case the pathspec is provided in a different way than normal.

Since multiple paths and segments could be requested, we compute the struct bloom_key data dynamically during the commit walk. This could likely be improved, but adds code complexity that is not valuable at this time.

There are two cases to care about: merge commits and "ordinary" commits.

  • Merge commits have multiple parents, but if we are TREESAME to our first parent in every range, then pass the blame for all ranges to the first parent.
  • Ordinary commits have the same condition, but each is done slightly differently in the process_ranges_[merge|ordinary]_commit() methods.

By checking if the changed-path Bloom filter can guarantee TREESAME, we can avoid that tree-diff cost. If the filter says "probably changed", then we need to run the tree-diff and then the blob-diff if there was a real edit.

The Linux kernel repository is a good testing ground for the performance improvements claimed here.
There are two different cases to test:

  • The first is the "entire history" case, where we output the entire history to /dev/null to see how long it would take to compute the full line-log history.
  • The second is the "first result" case, where we find how long it takes to show the first value, which is an indicator of how quickly a user would see responses when waiting at a terminal.

To test, I selected the paths that were changed most frequently in the top 10,000 commits using this command (stolen from StackOverflow):

git log --pretty=format: --name-only -n 10000 | sort | \
  uniq -c | sort -rg | head -10

which results in

121 MAINTAINERS
 63 fs/namei.c
 60 arch/x86/kvm/cpuid.c
 59 fs/io_uring.c
 58 arch/x86/kvm/vmx/vmx.c
 51 arch/x86/kvm/x86.c
 45 arch/x86/kvm/svm.c
 42 fs/btrfs/disk-io.c
 42 Documentation/scsi/index.rst

(along with a bogus first result).
It appears that the path arch/x86/kvm/svm.c was renamed, so we ignore that entry. This leaves the following results for the real command time:

|                              | Entire History  | First Result    |
| Path                         | Before | After  | Before | After  |
|------------------------------|--------|--------|--------|--------|
| MAINTAINERS                  | 4.26 s | 3.87 s | 0.41 s | 0.39 s |
| fs/namei.c                   | 1.99 s | 0.99 s | 0.42 s | 0.21 s |
| arch/x86/kvm/cpuid.c         | 5.28 s | 1.12 s | 0.16 s | 0.09 s |
| fs/io_uring.c                | 4.34 s | 0.99 s | 0.94 s | 0.27 s |
| arch/x86/kvm/vmx/vmx.c       | 5.01 s | 1.34 s | 0.21 s | 0.12 s |
| arch/x86/kvm/x86.c           | 2.24 s | 1.18 s | 0.21 s | 0.14 s |
| fs/btrfs/disk-io.c           | 1.82 s | 1.01 s | 0.06 s | 0.05 s |
| Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |

It is worth noting that the least speedup comes for the MAINTAINERS file which is:

  • edited frequently,
  • low in the directory hierarchy, and
  • quite a large file.

All of those points lead to spending more time doing the blob diff and less time doing the tree diff.
Still, we see some improvement in that case and significant improvement in other cases.
A 2-4x speedup is likely the more typical case as opposed to the small 5% change for that file.


With Git 2.29 (Q4 2020), the changed-path Bloom filter is improved using ideas from an independent implementation.

See commit 7fbfe07, commit bb4d60e, commit 5cfa438, commit 2ad4f1a, commit fa79653, commit 0ee3cb8, commit 1df15f8, commit 6141cdf, commit cb9daf1, commit 35a9f1e (05 Jun 2020) by SZEDER Gábor (szeder).
(Merged by Junio C Hamano -- gitster -- in commit de6dda0, 30 Jul 2020)

commit-graph: simplify parse_commit_graph() #1

Signed-off-by: SZEDER Gábor
Signed-off-by: Derrick Stolee

While we iterate over all entries of the Chunk Lookup table we make sure that we don't attempt to read past the end of the mmap-ed commit-graph file, and check in each iteration that the chunk ID and offset we are about to read is still within the mmap-ed memory region. However, these checks in each iteration are not really necessary, because the number of chunks in the commit-graph file is already known before this loop from the just parsed commit-graph header.

So let's check that the commit-graph file is large enough for all entries in the Chunk Lookup table before we start iterating over those entries, and drop those per-iteration checks.
While at it, take into account the size of everything that is necessary to have a valid commit-graph file, i.e. the size of the header, the size of the mandatory OID Fanout chunk, and the size of the signature in the trailer as well.

Note that this necessitates the change of the error message as well.se

And commit-graph:

The Chunk Lookup table stores the chunks' starting offset in the commit-graph file, not their sizes.
Consequently, the size of a chunk can only be calculated by subtracting its offset from the offset of the subsequent chunk (or that of the terminating label).
This is currently implemented in a bit complicated way: as we iterate over the entries of the Chunk Lookup table, we check the id of each chunk and store its starting offset, then we check the id of the last seen chunk and calculate its size using its previously saved offset.
At the moment there is only one chunk for which we calculate its size, but this patch series will add more, and the repeated chunk id checks are not that pretty.

Instead let's read ahead the offset of the next chunk on each iteration, so we can calculate the size of each chunk right away, right where we store its starting offset.

VonC
  • 1,042,979
  • 435
  • 3,649
  • 4,283