Skip to content

Commit

Permalink
Merge branch 'en/fix-evil-degenerate-merge-pruning'
Browse files Browse the repository at this point in the history
Years ago, the find FreeBSD folks filed an important bug about
git-filter-repo with a clear way to reproduce (even if in a large
repository).  Sadly, I repeatedly got stumped each time I looked into
it.  Turns out it was a royal mess to work through, even if the fixes
aren't all that large.  Weeks and weeks.  My apologies to the FreeBSD
folks who were always very helpful reporters and who didn't get this
proper bugfix for years.

Signed-off-by: Elijah Newren <[email protected]>
  • Loading branch information
newren committed Oct 21, 2024
2 parents 6123159 + ddbc6c3 commit cd14700
Show file tree
Hide file tree
Showing 3 changed files with 183 additions and 46 deletions.
132 changes: 86 additions & 46 deletions git-filter-repo
Original file line number Diff line number Diff line change
Expand Up @@ -514,14 +514,6 @@ class _IDs(object):
else:
return old_id

def reverse_translate(self, new_id):
"""
If new_id is an alternate id mapping, return the list of original ids.
(Since commits get pruned, multiple old commit ids map map to the same
new commit id.)
"""
return self._reverse_translation.get(new_id, [new_id])

def __str__(self):
"""
Convert IDs to string; used for debugging
Expand Down Expand Up @@ -3293,11 +3285,11 @@ class RepoFilter(object):
return old_hash
return new_hash

def _trim_extra_parents(self, orig_parents, parents):
def _maybe_trim_extra_parents(self, orig_parents, parents):
'''Due to pruning of empty commits, some parents could be non-existent
(None) or otherwise redundant. Remove the non-existent parents, and
remove redundant parents so long as that doesn't transform a merge
commit into a non-merge commit.
remove redundant parents ***SO LONG AS*** that doesn't transform a
merge commit into a non-merge commit.
Returns a tuple:
(parents, new_first_parent_if_would_become_non_merge)'''
Expand Down Expand Up @@ -3672,31 +3664,93 @@ class RepoFilter(object):
external_parents = parents
else:
external_parents = [p for p in parents if not isinstance(p, int)]
# The use of 'reversed' is intentional here; there is a risk that we have
# duplicates in parents, and we want to map from parents to the first
# entry we find in orig_parents in such cases.
parent_reverse_dict = dict(zip(reversed(parents), reversed(orig_parents)))

self._graph.record_external_commits(external_parents)
self._orig_graph.record_external_commits(external_parents)
self._graph.add_commit_and_parents(commit.id, parents) # new githash unknown
self._orig_graph.add_commit_and_parents(commit.old_id, orig_parents,
commit.original_id)

# Prune parents (due to pruning of empty commits) if relevant
old_1st_parent = parents[0] if parents else None
parents, new_1st_parent = self._trim_extra_parents(orig_parents, parents)
# Prune parents (due to pruning of empty commits) if relevant, note that
# new_1st_parent is None unless this was a merge commit that is becoming
# a non-merge
prev_1st_parent = parents[0] if parents else None
parents, new_1st_parent = self._maybe_trim_extra_parents(orig_parents,
parents)
commit.parents = parents

# If parents were pruned, then we need our file changes to be relative
# to the new first parent
if parents and old_1st_parent != parents[0]:
#
# Notes:
# * new_1st_parent and new_1st_parent != parents[0] uniquely happens for example when:
# working on merge, selecting subset of files and merge base still
# valid while first parent history doesn't touch any of those paths,
# but second parent history does. prev_1st_parent had already been
# rewritten to the non-None first ancestor and it remains valid.
# self._maybe_trim_extra_parents() avoids removing this first parent
# because it'd make the commit a non-merge. However, if there are
# no file_changes of note, we'll drop this commit and mark
# new_1st_parent as the new replacement. To correctly determine if
# there are no file_changes of note, we need to have the list of
# file_changes relative to new_1st_parent.
# (See t9390#3, "basic -> basic-ten using '--path ten'")
# * prev_1st_parent != parents[0] happens for example when:
# similar to above, but the merge base is no longer valid and was
# pruned away as well. Then parents started as e.g. [None, $num],
# and both prev_1st_parent and new_1st_parent are None, while parents
# after self._maybe_trim_extra_parents() becomes just [$num].
# (See t9390#67, "degenerate merge with non-matching filename".)
# Since $num was originally a second parent, we need to rewrite
# file changes to be relative to parents[0].
# * TODO: We should be getting the changes relative to the new first
# parent even if self._fep is None, BUT we can't. Our method of
# getting the changes right now is an external git diff invocation,
# which we can't do if we just have a fast export stream. We can't
# really work around it by querying the fast-import stream either,
# because the 'ls' directive only allows us to list info about
# specific paths, but we need to find out which paths exist in two
# commits and then query them. We could maybe force checkpointing in
# fast-import, then doing a diff from what'll be the new first parent
# back to prev_1st_parent (which may be None, i.e. empty tree), using
# the fact that in A->{B,C}->D, where D is merge of B & C, the diff
# from C->D == C->A + A->B + B->D, and in these cases A==B, so it
# simplifies to C->D == C->A + B->D, and C is our new 1st parent
# commit, A is prev_1st_commit, and B->D is commit.file_changes that
# we already have. However, checkpointing the fast-import process
# and figuring out how long to wait before we can run our diff just
# seems excessive. For now, just punt and assume the merge wasn't
# "evil" (i.e. that it's remerge-diff is empty, as is true for most
# merges). If the merge isn't evil, no further steps are necessary.
if parents and self._fep and (
prev_1st_parent != parents[0] or
new_1st_parent and new_1st_parent != parents[0]):
# Get the id from the original fast export stream corresponding to the
# new 1st parent. Since pruning of commits can cause multiple old
# commits to map to the same new commit, we have a uniqueness problem,
# but pruning means there are no relevant file changes between the
# commit so we can just take the first old commit id.
new_1st_parent_old_id = _IDS.reverse_translate(parents[0])[0]
# new 1st parent. As noted above, that new 1st parent might be
# new_1st_parent, or if that is None, it'll be parents[0].
will_be_1st = new_1st_parent or parents[0]
old_id = parent_reverse_dict[will_be_1st]
# Now, translate that to a hash
new_1st_parent_old_hash = self._orig_graph.map_to_hash(new_1st_parent_old_id)
will_be_1st_commit_hash = self._orig_graph.map_to_hash(old_id)
# Get the changes from what is going to be the new 1st parent to this
# merge commit. Note that since we are going from the new 1st parent
# to the merge commit, we can just replace the existing
# commit.file_changes rather than getting something we need to combine
# with the existing commit.file_changes. Also, we can just replace
# because prev_1st_parent is an ancestor of will_be_1st_commit_hash
# (or prev_1st_parent is None and first parent history is gone), so
# even if we retain prev_1st_parent and do not prune it, the changes
# will still work given the snapshot-based way fast-export/fast-import
# work.
commit.file_changes = GitUtils.get_file_changes(self._repo_working_dir,
new_1st_parent_old_hash,
will_be_1st_commit_hash,
commit.original_id)

# Save these and filter them
orig_file_changes = set(commit.file_changes)
self._filter_files(commit)

Expand Down Expand Up @@ -4145,8 +4199,8 @@ class RepoFilter(object):
original_hash = old_commit_unrenames.get(old_hash, old_hash)
old_ref_map[refname] = (original_hash, deleted_hash)

batch_check_process = None
batch_check_output_re = re.compile(b'^([0-9a-f]{40}) ([a-z]+) ([0-9]+)$')
new_refs = {}
new_refs_initialized = False
ref_maps = {}
self._orig_graph._ensure_reverse_maps_populated()
for refname, pair in old_ref_map.items():
Expand All @@ -4160,36 +4214,22 @@ class RepoFilter(object):
else:
new_hash = intermediate
else: # Must be either an annotated tag, or a ref whose tip was pruned
if not batch_check_process:
cmd = 'git cat-file --batch-check'.split()
if not new_refs_initialized:
target_working_dir = self._args.target or b'.'
batch_check_process = subproc.Popen(cmd,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
cwd=target_working_dir)
batch_check_process.stdin.write(refname+b"\n")
batch_check_process.stdin.flush()
line = batch_check_process.stdout.readline()
m = batch_check_output_re.match(line)
if m and m.group(2) in (b'tag', b'commit'):
new_hash = m.group(1)
elif line.endswith(b' missing\n'):
new_hash = deleted_hash
new_refs = GitUtils.get_refs(target_working_dir)
if refname in new_refs:
new_hash = new_refs[refname]
else:
raise SystemExit(_("Failed to find new id for %(refname)s "
"(old id was %(old_hash)s)")
% ({'refname': refname, 'old_hash': old_hash})
) # pragma: no cover
new_hash = deleted_hash
ref_maps[refname] = (old_hash, new_hash)
if self._args.source or self._args.target:
new_refs = GitUtils.get_refs(self._args.target or b'.')
if not new_refs_initialized:
target_working_dir = self._args.target or b'.'
new_refs = GitUtils.get_refs(target_working_dir)
for ref, new_hash in new_refs.items():
if ref not in orig_refs and not ref.startswith(b'refs/replace/'):
old_hash = b'0'*len(new_hash)
ref_maps[ref] = (old_hash, new_hash)
if batch_check_process:
batch_check_process.stdin.close()
batch_check_process.wait()

#
# Third, handle first_changes
Expand Down
16 changes: 16 additions & 0 deletions t/t9390-filter-repo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -1875,6 +1875,22 @@ test_expect_success 'degenerate merge with typechange' '
)
'

test_expect_success 'degenerate evil merge' '
test_create_repo degenerate_evil_merge &&
(
cd degenerate_evil_merge &&
cat $DATA/degenerate-evil-merge | git fast-import --quiet &&
git filter-repo --force --subdirectory-filter module-of-interest &&
test_path_is_missing module-of-interest &&
test_path_is_missing other-module &&
test_path_is_missing irrelevant &&
test_path_is_file file1 &&
test_path_is_file file2 &&
test_path_is_file file3
)
'

test_expect_success 'Filtering a blob to make it match previous version' '
test_create_repo remove_unique_bits_of_blob &&
(
Expand Down
81 changes: 81 additions & 0 deletions t/t9390/degenerate-evil-merge
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
feature done
blob
mark :1
data 0

reset refs/heads/master
commit refs/heads/master
mark :2
author Full Name <[email protected]> 2000000000 +0100
committer Full Name <[email protected]> 2000000000 +0100
data 2
A
M 100644 :1 irrelevant
M 100644 :1 module-of-interest/file1
M 100644 :1 module-of-interest/file2
M 100644 :1 module-of-interest/file3
M 100644 :1 other-module/files
M 100644 :1 other-module/are
M 100644 :1 other-module/fun

commit refs/heads/master
mark :3
author Full Name <[email protected]> 2000030000 +0100
committer Full Name <[email protected]> 2000030000 +0100
data 2
B
from :2
D irrelevant
D module-of-interest/file1
D module-of-interest/file2
D module-of-interest/file3

blob
mark :4
data 8
content

commit refs/heads/master
mark :5
author Full Name <[email protected]> 2000040000 +0100
committer Full Name <[email protected]> 2000040000 +0100
data 2
D
from :3
M 100644 :4 other-module/fun

commit refs/heads/master
mark :6
author Full Name <[email protected]> 2000020000 +0100
committer Full Name <[email protected]> 2000020000 +0100
data 2
C
from :2
M 100644 :4 irrelevant

commit refs/heads/master
mark :7
author Full Name <[email protected]> 2000050000 +0100
committer Full Name <[email protected]> 2000050000 +0100
data 31
Merge and ignore the deletions
from :6
merge :5
M 100644 :4 irrelevant
M 100644 :4 other-module/fun

blob
mark :8
data 6
final

commit refs/heads/master
mark :7
author Full Name <[email protected]> 2000060000 +0100
committer Full Name <[email protected]> 2000060000 +0100
data 13
Final change
from :7
M 100644 :8 module-of-interest/file2

done

0 comments on commit cd14700

Please sign in to comment.