rm/sparse-index-integration #6

ffyuanda · 2022-07-31T08:56:06Z

No description provided.

Turn off command_requires_full_index for `git-rm`. Test `git-rm` when the target pathspec is in-cone and out-of-cone. Ensure the sparse index is not expanded when operating inside of cone area. Signed-off-by: Shaoxuan Yuan <[email protected]>

pathspec_needs_expanded_index() is reusable when we need to verify if the index needs to be expanded when the command is utilizing a pathspec rather than a literal path. Move it for reusability. Signed-off-by: Shaoxuan Yuan <[email protected]>

Originally, rm a pathspec that is out-of-cone in a sparse-index environment, Git dies with "pathspec '<x>' did not match any files", mainly because it does not expand the index so nothing is matched. Expand the index when the pathspec needs an expanded index, i.e. the pathspec contains wildcard that may need a full-index or the pathspec is simply out-of-cone. Signed-off-by: Shaoxuan Yuan <[email protected]>

The `p2000` tests demonstrate a ~96% execution time reduction for 'git rm' using a sparse index. Test before after ------------------------------------------------------------- 2000.74: git rm -f f2/f4/a (full-v3) 0.66 0.88 +33.0% 2000.75: git rm -f f2/f4/a (full-v4) 0.67 0.75 +12.0% 2000.76: git rm -f f2/f4/a (sparse-v3) 1.99 0.08 -96.0% 2000.77: git rm -f f2/f4/a (sparse-v4) 2.06 0.07 -96.6% Signed-off-by: Shaoxuan Yuan <[email protected]>

`git-rm` a sparse-directory entry within a sparse-index enabled repo behaves differently from a sparse directory within a sparse-checkout enabled repo. For example, in a sparse-index repo, where 'folder1' is a sparse-directory entry, `git rm -r --sparse folder1` provides this: rm 'folder1/' Whereas in a sparse-checkout repo *without* sparse-index, doing so provides this: rm 'folder1/0/0/0' rm 'folder1/0/1' rm 'folder1/a' Because `git rm` a sparse-directory entry does not need to expand the index, therefore we should accept the current behavior, which is faster than "expand the sparse-directory entry to match the sparse-checkout situation". Modify a previous test so such difference is not considered as an error. Signed-off-by: Shaoxuan Yuan <[email protected]>

derrickstolee

We talked in Slack about how we could make git rm -r sparse/dir (where sparse-dir is exactly a sparse directory entry) have the same output without expanding the index (using read_tree[_at]()). This would similarly work for git rm -r sparse/ even if there are some non-sparse directories.

In the case of t1092, I think you could test this behavior with

test_sparse_match git sparse-checkout set deep/deeper1/deepest
test_sparse_match git rm -r deep/deeper2
test_sparse_match git rm -r deep/deeper1

I also have some thoughts on your patch organization:

Add your compatibility tests in t1092 first. Make them all be test_expect_success and never make a change that breaks them. This might mean that you only mark the builtin as ready for the sparse index until everything else is completely implemented.
Move the pathspec_needs_expanded_index() method.
Add the use of pathspec_needs_expanded_index() and drop the ensure_full_index() within git-rm.
Use read_tree[_at]() to give the same error message when deleting a sparse directory.
Add the p2000 test at the same time as the "final" change to mark git rm as compatible with sparse-index. That allows us to demonstrate the performance improvement in the commit that introduces that improvement.

The code you have here is good, but you have some tests to add and a little more code to write. You're very close to submitting this to the list. Feel free to send it there when you think it's ready.

builtin/reset.c

vdye

Looks great! It looks like this is largely a clean/straightforward sparse index integration, with the only quirk being what was found in the last commit. I added a note there with my thoughts, but ultimately I think you should submit to the mailing list with your current approach, call out in the cover letter why you decided to do it, and collect opinions from reviewers on what they think it should do. If other contributors agree that the message doesn't need to be identical, you can avoid doing a bunch of unnecessary tree traversal work.

The only other note I have is: make sure tests pass after every commit (you can check by running git rebase -i -x "make -j12 DEVELOPER=1 && cd t && ./t1092-sparse-checkout-compatibility.sh" <your base>). The last commit doesn't change a test from test_expect_failure to test_expect_success, so I'm guessing there's something that will break in the middle of the series.

t/t1092-sparse-checkout-compatibility.sh

pathspec.c

pathspec.h

t/perf/p2000-sparse-operations.sh

t/t1092-sparse-checkout-compatibility.sh

Enable the sparse index within the `git-rm` command. The `p2000` tests demonstrate a ~96% execution time reduction for 'git rm' using a sparse index. Test before after ------------------------------------------------------------- 2000.74: git rm -f f2/f4/a (full-v3) 0.66 0.88 +33.0% 2000.75: git rm -f f2/f4/a (full-v4) 0.67 0.75 +12.0% 2000.76: git rm -f f2/f4/a (sparse-v3) 1.99 0.08 -96.0% 2000.77: git rm -f f2/f4/a (sparse-v4) 2.06 0.07 -96.6% ---- Also, normalize a behavioral difference of `git-rm` under sparse-index. See related discussion [1]. `git-rm` a sparse-directory entry within a sparse-index enabled repo behaves differently from a sparse directory within a sparse-checkout enabled repo. For example, in a sparse-index repo, where 'folder1' is a sparse-directory entry, `git rm -r --sparse folder1` provides this: rm 'folder1/' Whereas in a sparse-checkout repo *without* sparse-index, doing so provides this: rm 'folder1/0/0/0' rm 'folder1/0/1' rm 'folder1/a' Because `git rm` a sparse-directory entry does not need to expand the index, therefore we should accept the current behavior, which is faster than "expand the sparse-directory entry to match the sparse-checkout situation". Modify a previous test so such difference is not considered as an error. [1] #6 (comment) Signed-off-by: Shaoxuan Yuan <[email protected]>

Enable the sparse index within the `git-rm` command. The `p2000` tests demonstrate a ~92% execution time reduction for 'git rm' using a sparse index. Test HEAD~1 HEAD -------------------------------------------------------------------------- 2000.74: git rm ... (full-v3) 0.41(0.37+0.05) 0.43(0.36+0.07) +4.9% 2000.75: git rm ... (full-v4) 0.38(0.34+0.05) 0.39(0.35+0.05) +2.6% 2000.76: git rm ... (sparse-v3) 0.57(0.56+0.01) 0.05(0.05+0.00) -91.2% 2000.77: git rm ... (sparse-v4) 0.57(0.55+0.02) 0.03(0.03+0.00) -94.7% ---- Also, normalize a behavioral difference of `git-rm` under sparse-index. See related discussion [1]. `git-rm` a sparse-directory entry within a sparse-index enabled repo behaves differently from a sparse directory within a sparse-checkout enabled repo. For example, in a sparse-index repo, where 'folder1' is a sparse-directory entry, `git rm -r --sparse folder1` provides this: rm 'folder1/' Whereas in a sparse-checkout repo *without* sparse-index, doing so provides this: rm 'folder1/0/0/0' rm 'folder1/0/1' rm 'folder1/a' Because `git rm` a sparse-directory entry does not need to expand the index, therefore we should accept the current behavior, which is faster than "expand the sparse-directory entry to match the sparse-checkout situation". Modify a previous test so such difference is not considered as an error. [1] #6 (comment) Helped-by: Victoria Dye <[email protected]> Helped-by: Derrick Stolee <[email protected]> Signed-off-by: Shaoxuan Yuan <[email protected]>

Enable the sparse index within the `git-rm` command. The `p2000` tests demonstrate a ~92% execution time reduction for 'git rm' using a sparse index. Test HEAD~1 HEAD -------------------------------------------------------------------------- 2000.74: git rm ... (full-v3) 0.41(0.37+0.05) 0.43(0.36+0.07) +4.9% 2000.75: git rm ... (full-v4) 0.38(0.34+0.05) 0.39(0.35+0.05) +2.6% 2000.76: git rm ... (sparse-v3) 0.57(0.56+0.01) 0.05(0.05+0.00) -91.2% 2000.77: git rm ... (sparse-v4) 0.57(0.55+0.02) 0.03(0.03+0.00) -94.7% ---- Also, normalize a behavioral difference of `git-rm` under sparse-index. See related discussion [1]. `git-rm` a sparse-directory entry within a sparse-index enabled repo behaves differently from a sparse directory within a sparse-checkout enabled repo. For example, in a sparse-index repo, where 'folder1' is a sparse-directory entry, `git rm -r --sparse folder1` provides this: rm 'folder1/' Whereas in a sparse-checkout repo *without* sparse-index, doing so provides this: rm 'folder1/0/0/0' rm 'folder1/0/1' rm 'folder1/a' Because `git rm` a sparse-directory entry does not need to expand the index, therefore we should accept the current behavior, which is faster than "expand the sparse-directory entry to match the sparse-checkout situation". Modify a previous test so such difference is not considered as an error. [1] ffyuanda#6 (comment) Helped-by: Victoria Dye <[email protected]> Helped-by: Derrick Stolee <[email protected]> Signed-off-by: Shaoxuan Yuan <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Since commit fcc07e9 (is_promisor_object(): free tree buffer after parsing, 2021-04-13), we'll always free the buffers attached to a "struct tree" after searching them for promisor links. But there's an important case where we don't want to do so: if somebody else is already using the tree! This can happen during a "rev-list --missing=allow-promisor" traversal in a partial clone that is missing one or more trees or blobs. The backtrace for the free looks like this: #1 free_tree_buffer tree.c:147 #2 add_promisor_object packfile.c:2250 #3 for_each_object_in_pack packfile.c:2190 #4 for_each_packed_object packfile.c:2215 #5 is_promisor_object packfile.c:2272 #6 finish_object__ma builtin/rev-list.c:245 #7 finish_object builtin/rev-list.c:261 #8 show_object builtin/rev-list.c:274 #9 process_blob list-objects.c:63 git#10 process_tree_contents list-objects.c:145 git#11 process_tree list-objects.c:201 git#12 traverse_trees_and_blobs list-objects.c:344 [...] We're in the middle of walking through the entries of a tree object via process_tree_contents(). We see a blob (or it could even be another tree entry) that we don't have, so we call is_promisor_object() to check it. That function loops over all of the objects in the promisor packfile, including the tree we're currently walking. When we're done with it there, we free the tree buffer. But as we return to the walk in process_tree_contents(), it's still holding on to a pointer to that buffer, via its tree_desc iterator, and it accesses the freed memory. Even a trivial use of "--missing=allow-promisor" triggers this problem, as the included test demonstrates (it's just a vanilla --blob:none clone). We can detect this case by only freeing the tree buffer if it was allocated on our behalf. This is a little tricky since that happens inside parse_object(), and it doesn't tell us whether the object was already parsed, or whether it allocated the buffer itself. But by checking for an already-parsed tree beforehand, we can distinguish the two cases. That feels a little hacky, and does incur an extra lookup in the object-hash table. But that cost is fairly minimal compared to actually loading objects (and since we're iterating the whole pack here, we're likely to be loading most objects, rather than reusing cached results). It may also be a good direction for this function in general, as there are other possible optimizations that rely on doing some analysis before parsing: - we could detect blobs and avoid reading their contents; they can't link to other objects, but parse_object() doesn't know that we don't care about checking their hashes. - we could avoid allocating object structs entirely for most objects (since we really only need them in the oidset), which would save some memory. - promisor commits could use the commit-graph rather than loading the object from disk This commit doesn't do any of those optimizations, but I think it argues that this direction is reasonable, rather than relying on parse_object() and trying to teach it to give us more information about whether it parsed. The included test fails reliably under SANITIZE=address just when running "rev-list --missing=allow-promisor". Checking the output isn't strictly necessary to detect the bug, but it seems like a reasonable addition given the general lack of coverage for "allow-promisor" in the test suite. Reported-by: Andrew Olsen <[email protected]> Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Fix a memory leak occuring in case of pathspec copy in preload_index. Direct leak of 8 byte(s) in 8 object(s) allocated from: #0 0x7f0a353ead47 in __interceptor_malloc (/usr/lib/gcc/x86_64-pc-linux-gnu/11.3.0/libasan.so.6+0xb5d47) #1 0x55750995e840 in do_xmalloc /home/anthony/src/c/git/wrapper.c:51 #2 0x55750995e840 in xmalloc /home/anthony/src/c/git/wrapper.c:72 #3 0x55750970f824 in copy_pathspec /home/anthony/src/c/git/pathspec.c:684 #4 0x557509717278 in preload_index /home/anthony/src/c/git/preload-index.c:135 #5 0x55750975f21e in refresh_index /home/anthony/src/c/git/read-cache.c:1633 #6 0x55750915b926 in cmd_status builtin/commit.c:1547 #7 0x5575090e1680 in run_builtin /home/anthony/src/c/git/git.c:466 #8 0x5575090e1680 in handle_builtin /home/anthony/src/c/git/git.c:720 #9 0x5575090e284a in run_argv /home/anthony/src/c/git/git.c:787 git#10 0x5575090e284a in cmd_main /home/anthony/src/c/git/git.c:920 git#11 0x5575090dbf82 in main /home/anthony/src/c/git/common-main.c:56 git#12 0x7f0a348230ab (/lib64/libc.so.6+0x290ab) Signed-off-by: Anthony Delannoy <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

ffyuanda force-pushed the rm/sparse-index-integration branch from 647361d to c59fc38 Compare July 31, 2022 08:58

ffyuanda changed the title ~~Rm/sparse index integration~~ rm/sparse-index-integration Jul 31, 2022

ffyuanda added 4 commits July 31, 2022 17:28

rm: integrate with sparse-index

b5a0ac6

Turn off command_requires_full_index for `git-rm`. Test `git-rm` when the target pathspec is in-cone and out-of-cone. Ensure the sparse index is not expanded when operating inside of cone area. Signed-off-by: Shaoxuan Yuan <[email protected]>

ffyuanda force-pushed the rm/sparse-index-integration branch from c59fc38 to b01a786 Compare July 31, 2022 09:29

derrickstolee reviewed Aug 1, 2022

View reviewed changes

builtin/reset.c Show resolved Hide resolved

vdye reviewed Aug 1, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rm/sparse-index-integration #6

rm/sparse-index-integration #6

ffyuanda commented Jul 31, 2022

derrickstolee left a comment

vdye left a comment •

edited

Loading

rm/sparse-index-integration #6

Are you sure you want to change the base?

rm/sparse-index-integration #6

Conversation

ffyuanda commented Jul 31, 2022

derrickstolee left a comment

Choose a reason for hiding this comment

vdye left a comment • edited Loading

Choose a reason for hiding this comment

vdye left a comment •

edited

Loading