Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm/sparse-index-integration #6

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

ffyuanda
Copy link
Owner

No description provided.

@ffyuanda ffyuanda force-pushed the rm/sparse-index-integration branch from 647361d to c59fc38 Compare July 31, 2022 08:58
@ffyuanda ffyuanda changed the title Rm/sparse index integration rm/sparse-index-integration Jul 31, 2022
ffyuanda added 4 commits July 31, 2022 17:28
Turn off command_requires_full_index for `git-rm`.

Test `git-rm` when the target pathspec is in-cone and out-of-cone.

Ensure the sparse index is not expanded when operating inside of
cone area.

Signed-off-by: Shaoxuan Yuan <[email protected]>
pathspec_needs_expanded_index() is reusable when we need to verify if
the index needs to be expanded when the command is utilizing a pathspec
rather than a literal path. Move it for reusability.

Signed-off-by: Shaoxuan Yuan <[email protected]>
Originally, rm a pathspec that is out-of-cone in a sparse-index
environment, Git dies with "pathspec '<x>' did not match any files",
mainly because it does not expand the index so nothing is matched.

Expand the index when the pathspec needs an expanded index, i.e. the
pathspec contains wildcard that may need a full-index or the pathspec
is simply out-of-cone.

Signed-off-by: Shaoxuan Yuan <[email protected]>
The `p2000` tests demonstrate a ~96% execution time reduction for
'git rm' using a sparse index.

Test                                     before  after
-------------------------------------------------------------
2000.74: git rm -f f2/f4/a (full-v3)     0.66    0.88 +33.0%
2000.75: git rm -f f2/f4/a (full-v4)     0.67    0.75 +12.0%
2000.76: git rm -f f2/f4/a (sparse-v3)   1.99    0.08 -96.0%
2000.77: git rm -f f2/f4/a (sparse-v4)   2.06    0.07 -96.6%

Signed-off-by: Shaoxuan Yuan <[email protected]>
@ffyuanda ffyuanda force-pushed the rm/sparse-index-integration branch from c59fc38 to b01a786 Compare July 31, 2022 09:29
`git-rm` a sparse-directory entry within a sparse-index enabled repo
behaves differently from a sparse directory within a sparse-checkout
enabled repo.

For example, in a sparse-index repo, where 'folder1' is a
sparse-directory entry, `git rm -r --sparse folder1` provides this:

	rm 'folder1/'

Whereas in a sparse-checkout repo *without* sparse-index, doing so
provides this:

	rm 'folder1/0/0/0'
	rm 'folder1/0/1'
	rm 'folder1/a'

Because `git rm` a sparse-directory entry does not need to expand the
index, therefore we should accept the current behavior, which is faster
than "expand the sparse-directory entry to match the sparse-checkout
situation".

Modify a previous test so such difference is not considered as an error.

Signed-off-by: Shaoxuan Yuan <[email protected]>
Copy link

@derrickstolee derrickstolee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked in Slack about how we could make git rm -r sparse/dir (where sparse-dir is exactly a sparse directory entry) have the same output without expanding the index (using read_tree[_at]()). This would similarly work for git rm -r sparse/ even if there are some non-sparse directories.

In the case of t1092, I think you could test this behavior with

test_sparse_match git sparse-checkout set deep/deeper1/deepest
test_sparse_match git rm -r deep/deeper2
test_sparse_match git rm -r deep/deeper1

I also have some thoughts on your patch organization:

  1. Add your compatibility tests in t1092 first. Make them all be test_expect_success and never make a change that breaks them. This might mean that you only mark the builtin as ready for the sparse index until everything else is completely implemented.
  2. Move the pathspec_needs_expanded_index() method.
  3. Add the use of pathspec_needs_expanded_index() and drop the ensure_full_index() within git-rm.
  4. Use read_tree[_at]() to give the same error message when deleting a sparse directory.
  5. Add the p2000 test at the same time as the "final" change to mark git rm as compatible with sparse-index. That allows us to demonstrate the performance improvement in the commit that introduces that improvement.

The code you have here is good, but you have some tests to add and a little more code to write. You're very close to submitting this to the list. Feel free to send it there when you think it's ready.

builtin/reset.c Show resolved Hide resolved
Copy link

@vdye vdye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! It looks like this is largely a clean/straightforward sparse index integration, with the only quirk being what was found in the last commit. I added a note there with my thoughts, but ultimately I think you should submit to the mailing list with your current approach, call out in the cover letter why you decided to do it, and collect opinions from reviewers on what they think it should do. If other contributors agree that the message doesn't need to be identical, you can avoid doing a bunch of unnecessary tree traversal work.

The only other note I have is: make sure tests pass after every commit (you can check by running git rebase -i -x "make -j12 DEVELOPER=1 && cd t && ./t1092-sparse-checkout-compatibility.sh" <your base>). The last commit doesn't change a test from test_expect_failure to test_expect_success, so I'm guessing there's something that will break in the middle of the series.

t/t1092-sparse-checkout-compatibility.sh Show resolved Hide resolved
t/t1092-sparse-checkout-compatibility.sh Show resolved Hide resolved
t/t1092-sparse-checkout-compatibility.sh Show resolved Hide resolved
t/t1092-sparse-checkout-compatibility.sh Show resolved Hide resolved
t/t1092-sparse-checkout-compatibility.sh Show resolved Hide resolved
pathspec.c Show resolved Hide resolved
pathspec.h Show resolved Hide resolved
t/perf/p2000-sparse-operations.sh Show resolved Hide resolved
t/t1092-sparse-checkout-compatibility.sh Show resolved Hide resolved
ffyuanda added a commit that referenced this pull request Aug 3, 2022
Enable the sparse index within the `git-rm` command.

The `p2000` tests demonstrate a ~96% execution time reduction for
'git rm' using a sparse index.

Test                                     before  after
-------------------------------------------------------------
2000.74: git rm -f f2/f4/a (full-v3)     0.66    0.88 +33.0%
2000.75: git rm -f f2/f4/a (full-v4)     0.67    0.75 +12.0%
2000.76: git rm -f f2/f4/a (sparse-v3)   1.99    0.08 -96.0%
2000.77: git rm -f f2/f4/a (sparse-v4)   2.06    0.07 -96.6%

----
Also, normalize a behavioral difference of `git-rm` under sparse-index.
See related discussion [1].

`git-rm` a sparse-directory entry within a sparse-index enabled repo
behaves differently from a sparse directory within a sparse-checkout
enabled repo.

For example, in a sparse-index repo, where 'folder1' is a
sparse-directory entry, `git rm -r --sparse folder1` provides this:

        rm 'folder1/'

Whereas in a sparse-checkout repo *without* sparse-index, doing so
provides this:

        rm 'folder1/0/0/0'
        rm 'folder1/0/1'
        rm 'folder1/a'

Because `git rm` a sparse-directory entry does not need to expand the
index, therefore we should accept the current behavior, which is faster
than "expand the sparse-directory entry to match the sparse-checkout
situation".

Modify a previous test so such difference is not considered as an error.

[1] #6 (comment)

Signed-off-by: Shaoxuan Yuan <[email protected]>
ffyuanda added a commit that referenced this pull request Aug 3, 2022
Enable the sparse index within the `git-rm` command.

The `p2000` tests demonstrate a ~96% execution time reduction for
'git rm' using a sparse index.

Test                                     before  after
-------------------------------------------------------------
2000.74: git rm -f f2/f4/a (full-v3)     0.66    0.88 +33.0%
2000.75: git rm -f f2/f4/a (full-v4)     0.67    0.75 +12.0%
2000.76: git rm -f f2/f4/a (sparse-v3)   1.99    0.08 -96.0%
2000.77: git rm -f f2/f4/a (sparse-v4)   2.06    0.07 -96.6%

----
Also, normalize a behavioral difference of `git-rm` under sparse-index.
See related discussion [1].

`git-rm` a sparse-directory entry within a sparse-index enabled repo
behaves differently from a sparse directory within a sparse-checkout
enabled repo.

For example, in a sparse-index repo, where 'folder1' is a
sparse-directory entry, `git rm -r --sparse folder1` provides this:

        rm 'folder1/'

Whereas in a sparse-checkout repo *without* sparse-index, doing so
provides this:

        rm 'folder1/0/0/0'
        rm 'folder1/0/1'
        rm 'folder1/a'

Because `git rm` a sparse-directory entry does not need to expand the
index, therefore we should accept the current behavior, which is faster
than "expand the sparse-directory entry to match the sparse-checkout
situation".

Modify a previous test so such difference is not considered as an error.

[1] #6 (comment)

Signed-off-by: Shaoxuan Yuan <[email protected]>
ffyuanda added a commit that referenced this pull request Aug 6, 2022
Enable the sparse index within the `git-rm` command.

The `p2000` tests demonstrate a ~92% execution time reduction for
'git rm' using a sparse index.

Test                              HEAD~1            HEAD
--------------------------------------------------------------------------
2000.74: git rm ... (full-v3)     0.41(0.37+0.05)   0.43(0.36+0.07) +4.9%
2000.75: git rm ... (full-v4)     0.38(0.34+0.05)   0.39(0.35+0.05) +2.6%
2000.76: git rm ... (sparse-v3)   0.57(0.56+0.01)   0.05(0.05+0.00) -91.2%
2000.77: git rm ... (sparse-v4)   0.57(0.55+0.02)   0.03(0.03+0.00) -94.7%

----
Also, normalize a behavioral difference of `git-rm` under sparse-index.
See related discussion [1].

`git-rm` a sparse-directory entry within a sparse-index enabled repo
behaves differently from a sparse directory within a sparse-checkout
enabled repo.

For example, in a sparse-index repo, where 'folder1' is a
sparse-directory entry, `git rm -r --sparse folder1` provides this:

        rm 'folder1/'

Whereas in a sparse-checkout repo *without* sparse-index, doing so
provides this:

        rm 'folder1/0/0/0'
        rm 'folder1/0/1'
        rm 'folder1/a'

Because `git rm` a sparse-directory entry does not need to expand the
index, therefore we should accept the current behavior, which is faster
than "expand the sparse-directory entry to match the sparse-checkout
situation".

Modify a previous test so such difference is not considered as an error.

[1] #6 (comment)

Helped-by: Victoria Dye <[email protected]>
Helped-by: Derrick Stolee <[email protected]>
Signed-off-by: Shaoxuan Yuan <[email protected]>
ffyuanda added a commit that referenced this pull request Aug 7, 2022
Enable the sparse index within the `git-rm` command.

The `p2000` tests demonstrate a ~92% execution time reduction for
'git rm' using a sparse index.

Test                              HEAD~1            HEAD
--------------------------------------------------------------------------
2000.74: git rm ... (full-v3)     0.41(0.37+0.05)   0.43(0.36+0.07) +4.9%
2000.75: git rm ... (full-v4)     0.38(0.34+0.05)   0.39(0.35+0.05) +2.6%
2000.76: git rm ... (sparse-v3)   0.57(0.56+0.01)   0.05(0.05+0.00) -91.2%
2000.77: git rm ... (sparse-v4)   0.57(0.55+0.02)   0.03(0.03+0.00) -94.7%

----
Also, normalize a behavioral difference of `git-rm` under sparse-index.
See related discussion [1].

`git-rm` a sparse-directory entry within a sparse-index enabled repo
behaves differently from a sparse directory within a sparse-checkout
enabled repo.

For example, in a sparse-index repo, where 'folder1' is a
sparse-directory entry, `git rm -r --sparse folder1` provides this:

        rm 'folder1/'

Whereas in a sparse-checkout repo *without* sparse-index, doing so
provides this:

        rm 'folder1/0/0/0'
        rm 'folder1/0/1'
        rm 'folder1/a'

Because `git rm` a sparse-directory entry does not need to expand the
index, therefore we should accept the current behavior, which is faster
than "expand the sparse-directory entry to match the sparse-checkout
situation".

Modify a previous test so such difference is not considered as an error.

[1] #6 (comment)

Helped-by: Victoria Dye <[email protected]>
Helped-by: Derrick Stolee <[email protected]>
Signed-off-by: Shaoxuan Yuan <[email protected]>
ffyuanda added a commit that referenced this pull request Aug 7, 2022
Enable the sparse index within the `git-rm` command.

The `p2000` tests demonstrate a ~92% execution time reduction for
'git rm' using a sparse index.

Test                              HEAD~1            HEAD
--------------------------------------------------------------------------
2000.74: git rm ... (full-v3)     0.41(0.37+0.05)   0.43(0.36+0.07) +4.9%
2000.75: git rm ... (full-v4)     0.38(0.34+0.05)   0.39(0.35+0.05) +2.6%
2000.76: git rm ... (sparse-v3)   0.57(0.56+0.01)   0.05(0.05+0.00) -91.2%
2000.77: git rm ... (sparse-v4)   0.57(0.55+0.02)   0.03(0.03+0.00) -94.7%

----
Also, normalize a behavioral difference of `git-rm` under sparse-index.
See related discussion [1].

`git-rm` a sparse-directory entry within a sparse-index enabled repo
behaves differently from a sparse directory within a sparse-checkout
enabled repo.

For example, in a sparse-index repo, where 'folder1' is a
sparse-directory entry, `git rm -r --sparse folder1` provides this:

        rm 'folder1/'

Whereas in a sparse-checkout repo *without* sparse-index, doing so
provides this:

        rm 'folder1/0/0/0'
        rm 'folder1/0/1'
        rm 'folder1/a'

Because `git rm` a sparse-directory entry does not need to expand the
index, therefore we should accept the current behavior, which is faster
than "expand the sparse-directory entry to match the sparse-checkout
situation".

Modify a previous test so such difference is not considered as an error.

[1] #6 (comment)

Helped-by: Victoria Dye <[email protected]>
Helped-by: Derrick Stolee <[email protected]>
Signed-off-by: Shaoxuan Yuan <[email protected]>
gitster pushed a commit to gitster/git that referenced this pull request Aug 8, 2022
Enable the sparse index within the `git-rm` command.

The `p2000` tests demonstrate a ~92% execution time reduction for
'git rm' using a sparse index.

Test                              HEAD~1            HEAD
--------------------------------------------------------------------------
2000.74: git rm ... (full-v3)     0.41(0.37+0.05)   0.43(0.36+0.07) +4.9%
2000.75: git rm ... (full-v4)     0.38(0.34+0.05)   0.39(0.35+0.05) +2.6%
2000.76: git rm ... (sparse-v3)   0.57(0.56+0.01)   0.05(0.05+0.00) -91.2%
2000.77: git rm ... (sparse-v4)   0.57(0.55+0.02)   0.03(0.03+0.00) -94.7%

----
Also, normalize a behavioral difference of `git-rm` under sparse-index.
See related discussion [1].

`git-rm` a sparse-directory entry within a sparse-index enabled repo
behaves differently from a sparse directory within a sparse-checkout
enabled repo.

For example, in a sparse-index repo, where 'folder1' is a
sparse-directory entry, `git rm -r --sparse folder1` provides this:

        rm 'folder1/'

Whereas in a sparse-checkout repo *without* sparse-index, doing so
provides this:

        rm 'folder1/0/0/0'
        rm 'folder1/0/1'
        rm 'folder1/a'

Because `git rm` a sparse-directory entry does not need to expand the
index, therefore we should accept the current behavior, which is faster
than "expand the sparse-directory entry to match the sparse-checkout
situation".

Modify a previous test so such difference is not considered as an error.

[1] ffyuanda#6 (comment)

Helped-by: Victoria Dye <[email protected]>
Helped-by: Derrick Stolee <[email protected]>
Signed-off-by: Shaoxuan Yuan <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
gitster pushed a commit to git/git that referenced this pull request Aug 8, 2022
Enable the sparse index within the `git-rm` command.

The `p2000` tests demonstrate a ~92% execution time reduction for
'git rm' using a sparse index.

Test                              HEAD~1            HEAD
--------------------------------------------------------------------------
2000.74: git rm ... (full-v3)     0.41(0.37+0.05)   0.43(0.36+0.07) +4.9%
2000.75: git rm ... (full-v4)     0.38(0.34+0.05)   0.39(0.35+0.05) +2.6%
2000.76: git rm ... (sparse-v3)   0.57(0.56+0.01)   0.05(0.05+0.00) -91.2%
2000.77: git rm ... (sparse-v4)   0.57(0.55+0.02)   0.03(0.03+0.00) -94.7%

----
Also, normalize a behavioral difference of `git-rm` under sparse-index.
See related discussion [1].

`git-rm` a sparse-directory entry within a sparse-index enabled repo
behaves differently from a sparse directory within a sparse-checkout
enabled repo.

For example, in a sparse-index repo, where 'folder1' is a
sparse-directory entry, `git rm -r --sparse folder1` provides this:

        rm 'folder1/'

Whereas in a sparse-checkout repo *without* sparse-index, doing so
provides this:

        rm 'folder1/0/0/0'
        rm 'folder1/0/1'
        rm 'folder1/a'

Because `git rm` a sparse-directory entry does not need to expand the
index, therefore we should accept the current behavior, which is faster
than "expand the sparse-directory entry to match the sparse-checkout
situation".

Modify a previous test so such difference is not considered as an error.

[1] ffyuanda#6 (comment)

Helped-by: Victoria Dye <[email protected]>
Helped-by: Derrick Stolee <[email protected]>
Signed-off-by: Shaoxuan Yuan <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
ffyuanda pushed a commit that referenced this pull request Aug 26, 2022
Since commit fcc07e9 (is_promisor_object(): free tree buffer after
parsing, 2021-04-13), we'll always free the buffers attached to a
"struct tree" after searching them for promisor links. But there's an
important case where we don't want to do so: if somebody else is already
using the tree!

This can happen during a "rev-list --missing=allow-promisor" traversal
in a partial clone that is missing one or more trees or blobs. The
backtrace for the free looks like this:

      #1 free_tree_buffer tree.c:147
      #2 add_promisor_object packfile.c:2250
      #3 for_each_object_in_pack packfile.c:2190
      #4 for_each_packed_object packfile.c:2215
      #5 is_promisor_object packfile.c:2272
      #6 finish_object__ma builtin/rev-list.c:245
      #7 finish_object builtin/rev-list.c:261
      #8 show_object builtin/rev-list.c:274
      #9 process_blob list-objects.c:63
      git#10 process_tree_contents list-objects.c:145
      git#11 process_tree list-objects.c:201
      git#12 traverse_trees_and_blobs list-objects.c:344
      [...]

We're in the middle of walking through the entries of a tree object via
process_tree_contents(). We see a blob (or it could even be another tree
entry) that we don't have, so we call is_promisor_object() to check it.
That function loops over all of the objects in the promisor packfile,
including the tree we're currently walking. When we're done with it
there, we free the tree buffer. But as we return to the walk in
process_tree_contents(), it's still holding on to a pointer to that
buffer, via its tree_desc iterator, and it accesses the freed memory.

Even a trivial use of "--missing=allow-promisor" triggers this problem,
as the included test demonstrates (it's just a vanilla --blob:none
clone).

We can detect this case by only freeing the tree buffer if it was
allocated on our behalf. This is a little tricky since that happens
inside parse_object(), and it doesn't tell us whether the object was
already parsed, or whether it allocated the buffer itself. But by
checking for an already-parsed tree beforehand, we can distinguish the
two cases.

That feels a little hacky, and does incur an extra lookup in the
object-hash table. But that cost is fairly minimal compared to actually
loading objects (and since we're iterating the whole pack here, we're
likely to be loading most objects, rather than reusing cached results).

It may also be a good direction for this function in general, as there
are other possible optimizations that rely on doing some analysis before
parsing:

  - we could detect blobs and avoid reading their contents; they can't
    link to other objects, but parse_object() doesn't know that we don't
    care about checking their hashes.

  - we could avoid allocating object structs entirely for most objects
    (since we really only need them in the oidset), which would save
    some memory.

  - promisor commits could use the commit-graph rather than loading the
    object from disk

This commit doesn't do any of those optimizations, but I think it argues
that this direction is reasonable, rather than relying on parse_object()
and trying to teach it to give us more information about whether it
parsed.

The included test fails reliably under SANITIZE=address just when
running "rev-list --missing=allow-promisor". Checking the output isn't
strictly necessary to detect the bug, but it seems like a reasonable
addition given the general lack of coverage for "allow-promisor" in the
test suite.

Reported-by: Andrew Olsen <[email protected]>
Signed-off-by: Jeff King <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
ffyuanda pushed a commit that referenced this pull request Sep 1, 2022
Fix a memory leak occuring in case of pathspec copy in preload_index.

Direct leak of 8 byte(s) in 8 object(s) allocated from:
    #0 0x7f0a353ead47 in __interceptor_malloc (/usr/lib/gcc/x86_64-pc-linux-gnu/11.3.0/libasan.so.6+0xb5d47)
    #1 0x55750995e840 in do_xmalloc /home/anthony/src/c/git/wrapper.c:51
    #2 0x55750995e840 in xmalloc /home/anthony/src/c/git/wrapper.c:72
    #3 0x55750970f824 in copy_pathspec /home/anthony/src/c/git/pathspec.c:684
    #4 0x557509717278 in preload_index /home/anthony/src/c/git/preload-index.c:135
    #5 0x55750975f21e in refresh_index /home/anthony/src/c/git/read-cache.c:1633
    #6 0x55750915b926 in cmd_status builtin/commit.c:1547
    #7 0x5575090e1680 in run_builtin /home/anthony/src/c/git/git.c:466
    #8 0x5575090e1680 in handle_builtin /home/anthony/src/c/git/git.c:720
    #9 0x5575090e284a in run_argv /home/anthony/src/c/git/git.c:787
    git#10 0x5575090e284a in cmd_main /home/anthony/src/c/git/git.c:920
    git#11 0x5575090dbf82 in main /home/anthony/src/c/git/common-main.c:56
    git#12 0x7f0a348230ab  (/lib64/libc.so.6+0x290ab)

Signed-off-by: Anthony Delannoy <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants