From 92b3bece1579dc41636b2e280e3d1d0e856c199c Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Tue, 1 Oct 2024 13:34:31 -0700 Subject: [PATCH 01/10] pyproject.toml: mark filter-repo stable Should have switched this over about 4 years ago, but of course I forgot all about it... Signed-off-by: Elijah Newren --- pyproject.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index 9a689a2b..7157e75f 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,7 +6,7 @@ authors = [ ] readme = "README.md" classifiers = [ - "Development Status :: 4 - Beta", + "Development Status :: 5 - Production/Stable", "Operating System :: OS Independent", "Programming Language :: Python", "License :: OSI Approved :: MIT License", From cc093e87343327c69c61cea9c5c3aeafcb11963b Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Tue, 17 Sep 2024 12:25:24 -0700 Subject: [PATCH 02/10] git-filter-repo.txt: note additional default capabilities git-filter-repo has heeded refs/replace/ since the beginning (although that isn't saying much since git-fast-export does all the real lifting), and has automatically updated the stash since commit 5cfd52f02514 (filter-repo: rewrite the stash too, 2024-07-31). Note both in the documentation. Signed-off-by: Elijah Newren --- Documentation/git-filter-repo.txt | 3 +++ 1 file changed, 3 insertions(+) diff --git a/Documentation/git-filter-repo.txt b/Documentation/git-filter-repo.txt index b3c9726b..1eb2385d 100644 --- a/Documentation/git-filter-repo.txt +++ b/Documentation/git-filter-repo.txt @@ -54,6 +54,9 @@ can be overridden, but they are all on by default): * pruning commits which become empty due to the above filters (also handles edge cases like pruning of merge commits which become degenerate and empty) + * rewriting stashes + * baking the changes made by refs/replace/ refs into the permanent + history and removing the replace refs * stripping of original history to avoid mixing old and new history * repacking the repository post-rewrite to shrink the repo for the user From c4f4f2dec6aa686cf6ad057899286d7d6c8b85ad Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Thu, 3 Oct 2024 13:15:51 -0700 Subject: [PATCH 03/10] git-filter-repo.txt: remove case specific explanation of commit-map file The explanation of the commit-map file from commit f867bb6ad745 (git-filter-repo.txt: document mapping output, 2020-06-18) included an implicit assumption that the repository filtering operation being performed was the excise of some large file(s). While some rewrites may fall into that category, it is not reasonable to assume that all rewrites are of that type. Correct the wording. Signed-off-by: Elijah Newren --- Documentation/git-filter-repo.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/git-filter-repo.txt b/Documentation/git-filter-repo.txt index 1eb2385d..16f1465a 100644 --- a/Documentation/git-filter-repo.txt +++ b/Documentation/git-filter-repo.txt @@ -388,8 +388,8 @@ commits were (or were not) changed. * A header is the first line with the text "old" and "new" * Commit mappings are in no particular order * All commits in range of the rewrite will be listed, even commits - that are unchanged (e.g. because the commit pre-dated when the - large file(s) were introduced to the repo). + that are unchanged (e.g. because the commit pre-dated when files + the filtering operation are removing were introduced to the repo). * An all-zeros hash, or null SHA, represents a non-existent object. When in the "new" column, this means the commit was removed entirely. From d48bab6753036c03e09d9da15e5119cf69bb74c9 Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Thu, 3 Oct 2024 13:45:05 -0700 Subject: [PATCH 04/10] git-filter-repo.txt: correct location of metadata files The explanation of the commit-map and ref-map files from commit f867bb6ad745 (git-filter-repo.txt: document mapping output, 2020-06-18) included an implicit assumption that "${GIT_DIR}" == ".git/". While this may often be true, filtering can also be done in a bare repository, for example, so this is misleading. Correct the wording. Signed-off-by: Elijah Newren --- Documentation/git-filter-repo.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/git-filter-repo.txt b/Documentation/git-filter-repo.txt index 16f1465a..7413dda6 100644 --- a/Documentation/git-filter-repo.txt +++ b/Documentation/git-filter-repo.txt @@ -382,7 +382,7 @@ directory. These files are overwritten unconditionally on every run. Commit map ~~~~~~~~~~ -The `.git/filter-repo/commit-map` file contains a mapping of how all +The `$GIT_DIR/filter-repo/commit-map` file contains a mapping of how all commits were (or were not) changed. * A header is the first line with the text "old" and "new" @@ -397,7 +397,7 @@ commits were (or were not) changed. Reference map ~~~~~~~~~~~~~ -The `.git/filter-repo/ref-map` file contains a mapping of which local +The `$GIT_DIR/filter-repo/ref-map` file contains a mapping of which local references were changed. * A header is the first line with the text "old", "new" and "ref" From 7dc3f3839bc4b89aef771d240d7320c6be98afd2 Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Mon, 21 Oct 2024 09:24:40 -0700 Subject: [PATCH 05/10] filter-repo: make debug ref deletion output easier to scan through Signed-off-by: Elijah Newren --- git-filter-repo | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/git-filter-repo b/git-filter-repo index b3f3fec6..1788560a 100755 --- a/git-filter-repo +++ b/git-filter-repo @@ -3909,7 +3909,7 @@ class RepoFilter(object): refs_to_nuke = set() if refs_to_nuke and self._args.debug: print("[DEBUG] Deleting the following refs:\n "+ - decode(b"\n ".join(refs_to_nuke))) + decode(b"\n ".join(sorted(refs_to_nuke)))) p.stdin.write(b''.join([b"delete %s\n" % x for x in refs_to_nuke])) From 981f80ab8bcb9ecda755ec57aa337cfdb6a2a3f9 Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Thu, 3 Oct 2024 10:50:41 -0700 Subject: [PATCH 06/10] filter-repo: add some clarifying comments about the purpose of the _IDs class Signed-off-by: Elijah Newren --- git-filter-repo | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/git-filter-repo b/git-filter-repo index 1788560a..115000e2 100755 --- a/git-filter-repo +++ b/git-filter-repo @@ -354,17 +354,24 @@ class ProgressWriter(object): class _IDs(object): """ A class that maintains the 'name domain' of all the 'marks' (short int - id for a blob/commit git object). The reason this mechanism is necessary - is because the text of fast-export may refer to an object using a different - mark than the mark that was assigned to that object using IDS.new(). This - class allows you to translate the fast-export marks (old) to the marks - assigned from IDS.new() (new). - - Note that there are two reasons why the marks may differ: (1) The - user manually creates Blob or Commit objects (for insertion into the - stream) (2) We're reading the data from two different repositories - and trying to combine the data (git fast-export will number ids from - 1...n, and having two 1's, two 2's, two 3's, causes issues). + id for a blob/commit git object). There are two reasons this mechanism + is necessary: + (1) the output text of fast-export may refer to an object using a different + mark than the mark that was assigned to that object using IDS.new(). + (This class allows you to translate the fast-export marks, "old" to + the marks assigned from IDS.new(), "new"). + (2) when we prune a commit, its "old" id becomes invalid. Any commits + which had that commit as a parent needs to use the nearest unpruned + ancestor as its parent instead. + + Note that for purpose (1) above, this typically comes about because the user + manually creates Blob or Commit objects (for insertion into the stream). + It could also come about if we attempt to read the data from two different + repositories and trying to combine the data (git fast-export will number ids + from 1...n, and having two 1's, two 2's, two 3's, causes issues; granted, we + this scheme doesn't handle the two streams perfectly either, but if the first + fast export stream is entirely processed and handled before the second stream + is started, this mechanism may be sufficient to handle it). """ def __init__(self): From 1060c0a6b5e4f9a13e066456b5c655286427adcb Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Thu, 3 Oct 2024 10:51:10 -0700 Subject: [PATCH 07/10] filter-repo: fix gnarly bug in _IDs.record_rename When a commit is pruned, we don't output it to the fast-import stream. However, that means the ID of that commit is non-existent, and any commit which had the pruned commit as a parent needs to instead be rewritten to have the first unpruned ancestor of the pruned commit as a parent. The _IDs class helps with this. The _IDs class also helps us when random objects are inserted into the stream, meaning we need to increment the ID of every object in the stream. The `lint-history` script in contrib is a good example that does this. Now, combining the two causes a corner-case problem because of the following: * When we create a commit, we immediately record a rename if the old_id and new id do not match. * If we prune a commit, we record a rename so that commits with it as a parent instead get its most recent unpruned ancestor The problem happens as follows: * Say we have a stream with commit IDs 1, 2, 3, & 4. * The program inserts a blob between each commit, so these commit IDs have to be remapped to 1, 3, 5, & 7. * We prune the third commit. In such a case, when we create the third commit, we would immediately record a remapping of 3->5. Then, when we later discover after callbacks that the commit should be pruned, we call record_rename to update the mapping to instead be 3->3. The old code would notice our little optimization that 3==3 and figure that no rename needs to be recorded. That would leave the fourth commit trying to use "5" as its ancestor, which wouldn't exist. So, while we like avoiding recording renames in general when old_id == new_id, if we already have a rename for old_id, then we absolutely need to update it. Signed-off-by: Elijah Newren --- git-filter-repo | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/git-filter-repo b/git-filter-repo index 115000e2..0fba166f 100755 --- a/git-filter-repo +++ b/git-filter-repo @@ -406,7 +406,7 @@ class _IDs(object): """ Record that old_id is being renamed to new_id. """ - if old_id != new_id: + if old_id != new_id or old_id in self._translation: # old_id -> new_id self._translation[old_id] = new_id From 4fda88e52ae2fb5f7b44ee5a156798e95548f768 Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Mon, 9 Sep 2024 14:54:43 -0700 Subject: [PATCH 08/10] filter-repo: fix bug in _IDs.__str__() If some commits are pruned away, then _IDs._translation will record the mapping from original id -> None This means that _IDs._reverse_translation, needs to have a mapping from None -> [] and thus that at least one key of _IDs._reverse_translation is not an integer. Trying to sort the keys of _IDs._reverse_translation thus makes no sense. Let's print all the numeric entries first, then put None at the end. Signed-off-by: Elijah Newren --- git-filter-repo | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/git-filter-repo b/git-filter-repo index 0fba166f..22ebe681 100755 --- a/git-filter-repo +++ b/git-filter-repo @@ -441,7 +441,12 @@ class _IDs(object): rv += " %d -> %s\n" % (k, self._translation[k]) rv += "Reverse translation:\n" - for k in sorted(self._reverse_translation): + reverse_keys = list(self._reverse_translation.keys()) + if None in reverse_keys: # pragma: no cover + reverse_keys.remove(None) + reverse_keys = sorted(reverse_keys) + reverse_keys.append(None) + for k in reverse_keys: rv += " " + str(k) + " -> " + str(self._reverse_translation[k]) + "\n" return rv From 8bd88034bc8ac6f43cc047ea40cc351e0b83a8be Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Thu, 3 Oct 2024 16:34:32 -0700 Subject: [PATCH 09/10] filter-repo: fix some unfortunate local variable names Referring to hashes with the names of new_id and orig_id are horribly misleading since everywhere else "id" means fast export ids, i.e. integers. Rename these local variables to new_hash and orig_hash. Signed-off-by: Elijah Newren --- git-filter-repo | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/git-filter-repo b/git-filter-repo index 22ebe681..7d960f05 100755 --- a/git-filter-repo +++ b/git-filter-repo @@ -3153,10 +3153,10 @@ class RepoFilter(object): return fi_input, fi_output = self._import_pipes while self._pending_renames: - orig_id, ignore = self._pending_renames.popitem(last=False) - new_id = fi_output.readline().rstrip() - self._commit_renames[orig_id] = new_id - if old_hash == orig_id: + orig_hash, ignore = self._pending_renames.popitem(last=False) + new_hash = fi_output.readline().rstrip() + self._commit_renames[orig_hash] = new_hash + if old_hash == orig_hash: return if limit and len(self._pending_renames) < limit: return From 8a243ae8adb02223f74fcc082c84c76d66853040 Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Thu, 3 Oct 2024 12:34:35 -0700 Subject: [PATCH 10/10] filter-repo: only conditionally heed the already_ran file When users run filter-repo, we first check if they already did a previous run. If they recently ran filter-repo previously, then they are probably just trying to do their history rewrite in multiple steps, so we bypass the fresh clone check for these subsequent runs. However, it is possible that the user finishes their rewrite, continues as normal for months or years, and then decides to do another rewrite. It may be that there is still a $GIT_DIR/filter-repo/already_ran file present from the repository filtering from long ago, and we do not want the presence of that file to result in the fresh clone check being bypassed. Implement a middle ground -- if the already_ran file is older than a day, ask the user whether to consider the current run a continuation of an existing rewrite or if it should be considered a completely new rewrite. Add documentation explaining this as well. Signed-off-by: Elijah Newren --- Documentation/git-filter-repo.txt | 27 +++++++++ git-filter-repo | 16 +++++- t/t9393-rerun.sh | 79 +++++++++++++++++++++++++ t/t9393/simple | 96 +++++++++++++++++++++++++++++++ 4 files changed, 217 insertions(+), 1 deletion(-) create mode 100755 t/t9393-rerun.sh create mode 100644 t/t9393/simple diff --git a/Documentation/git-filter-repo.txt b/Documentation/git-filter-repo.txt index 7413dda6..fec5e013 100644 --- a/Documentation/git-filter-repo.txt +++ b/Documentation/git-filter-repo.txt @@ -405,6 +405,33 @@ references were changed. * An all-zeros hash, or null SHA, represents a non-existent object. When in the "new" column, this means the ref was removed entirely. +Already Ran +~~~~~~~~~~~ + +The `$GIT_DIR/filter-repo/already_ran` file contains a file recording that +git-filter-repo has been run. When this file is present, future runs will +be treated as an extension of the previous filtering operation. + +Concretely, this means: + * The "Fresh Clone" check is bypassed + + This is done because past runs would cause the repository to no longer + look like a fresh clone, and thus fail the fresh clone check, but doing + filtering via multiple invocations of git-filter-repo is an intended + and support usecase. You already passed or bypassed the "Fresh Clone" + check on your initial run. + +However, if the already_ran file exists but is older than 1 day when they +invoke git-filter-repo, the user will be prompted for whether the new run +should be considered a continuation of the previous run. If they do not +answer in the affirmative, then the above bullet will not apply. +This prompt exists because users might do a history rewrite in a repository, +forget about it and leave the $GIT_DIR/filter-repo directory around, and +then some months or years later need to do another rewrite. If commits +have been made public and shared from the previous rewrite, then the next +filter-repo run should not be considered a continuation of the previous +filtering run. + [[FRESHCLONE]] FRESH CLONE SAFETY CHECK AND --FORCE ------------------------------------ diff --git a/git-filter-repo b/git-filter-repo index 7d960f05..f9e72b29 100755 --- a/git-filter-repo +++ b/git-filter-repo @@ -2942,7 +2942,21 @@ class RepoFilter(object): # Determine if this is second or later run of filter-repo tmp_dir = self.results_tmp_dir(create_if_missing=False) - already_ran = os.path.isfile(os.path.join(tmp_dir, b'already_ran')) + ran_path = os.path.join(tmp_dir, b'already_ran') + already_ran = os.path.isfile(ran_path) + if already_ran: + current_time = time.time() + file_mod_time = os.path.getmtime(ran_path) + file_age = current_time - file_mod_time + if file_age > 86400: # file older than a day + msg = (f"The previous run is older than a day ({decode(ran_path)} already exists).\n" + f"See \"Already Ran\" section in the manual for more information.\n" + f"Treat this run as a continuation of filtering in the previous run (Y/N)? ") + response = input(msg) + + if response.lower() != 'y': + os.remove(ran_path) + already_ran = False # Default for --replace-refs if not self._args.replace_refs: diff --git a/t/t9393-rerun.sh b/t/t9393-rerun.sh new file mode 100755 index 00000000..1bc88dce --- /dev/null +++ b/t/t9393-rerun.sh @@ -0,0 +1,79 @@ +#!/bin/bash + +test_description='filter-repo tests with reruns' + +. ./test-lib.sh + +export PATH=$(dirname $TEST_DIRECTORY):$PATH # Put git-filter-repo in PATH + +DATA="$TEST_DIRECTORY/t9393" +DELETED_SHA="0000000000000000000000000000000000000000" # FIXME: sha256 support + +test_expect_success 'a re-run that is treated as a clean slate' ' + test_create_repo clean_slate_rerun && + ( + cd clean_slate_rerun && + git fast-import --quiet <$DATA/simple && + + FIRST_ORPHAN=$(git rev-parse orphan-me~1) && + FINAL_ORPHAN=$(git rev-parse orphan-me) && + FILE_A_CHANGE=$(git rev-list -1 HEAD -- fileA) && + FILE_B_CHANGE=$(git rev-list -1 HEAD -- fileB) && + FILE_C_CHANGE=$(git rev-list -1 HEAD -- fileC) && + FILE_D_CHANGE=$(git rev-list -1 HEAD -- fileD) && + ORIGINAL_TAG=$(git rev-parse v1.0) && + + git filter-repo --invert-paths --path fileB --force && + NEW_FILE_C_CHANGE=$(git rev-list -1 HEAD -- fileC) && + NEW_FILE_D_CHANGE=$(git rev-list -1 HEAD -- fileD) && + FINAL_TAG=$(git rev-parse v1.0) && + + cat <<-EOF | sort >sha-expect && + ${FIRST_ORPHAN} ${FIRST_ORPHAN} + ${FINAL_ORPHAN} ${FINAL_ORPHAN} + ${FILE_A_CHANGE} ${FILE_A_CHANGE} + ${FILE_B_CHANGE} ${DELETED_SHA} + ${FILE_C_CHANGE} ${NEW_FILE_C_CHANGE} + ${FILE_D_CHANGE} ${NEW_FILE_D_CHANGE} + EOF + printf "%-40s %s\n" old new >expect && + cat sha-expect >>expect && + test_cmp <(sort expect) <(sort .git/filter-repo/commit-map) && + + cat <<-EOF | sort -k 3 >sha-expect && + ${FILE_D_CHANGE} ${NEW_FILE_D_CHANGE} $(git symbolic-ref HEAD) + ${FINAL_ORPHAN} ${FINAL_ORPHAN} refs/heads/orphan-me + ${ORIGINAL_TAG} ${FINAL_TAG} refs/tags/v1.0 + EOF + printf "%-40s %-40s %s\n" old new ref >expect && + cat sha-expect >>expect && + test_cmp expect .git/filter-repo/ref-map && + + touch -t 197001010000 .git/filter-repo/already_ran && + echo no | git filter-repo --invert-paths --path fileC --force && + FINAL_FILE_D_CHANGE=$(git rev-list -1 HEAD -- fileD) && + REALLY_FINAL_TAG=$(git rev-parse v1.0) && + + cat <<-EOF | sort >sha-expect && + ${FIRST_ORPHAN} ${FIRST_ORPHAN} + ${FINAL_ORPHAN} ${FINAL_ORPHAN} + ${FILE_A_CHANGE} ${FILE_A_CHANGE} + ${NEW_FILE_C_CHANGE} ${DELETED_SHA} + ${NEW_FILE_D_CHANGE} ${FINAL_FILE_D_CHANGE} + EOF + printf "%-40s %s\n" old new >expect && + cat sha-expect >>expect && + test_cmp <(sort expect) <(sort .git/filter-repo/commit-map) && + + cat <<-EOF | sort -k 3 >sha-expect && + ${NEW_FILE_D_CHANGE} ${FINAL_FILE_D_CHANGE} $(git symbolic-ref HEAD) + ${FINAL_ORPHAN} ${FINAL_ORPHAN} refs/heads/orphan-me + ${FINAL_TAG} ${REALLY_FINAL_TAG} refs/tags/v1.0 + EOF + printf "%-40s %-40s %s\n" old new ref >expect && + cat sha-expect >>expect && + test_cmp expect .git/filter-repo/ref-map + ) +' + +test_done diff --git a/t/t9393/simple b/t/t9393/simple new file mode 100644 index 00000000..3206e02c --- /dev/null +++ b/t/t9393/simple @@ -0,0 +1,96 @@ +feature done +# Simple repo with a few files, and two branches with no common history. +# Note that the original-oid directives are very fake, but make it easy to +# track things. +blob +mark :1 +original-oid 0000000000000000000000000000000000000001 +data 16 +file 1 contents + +blob +mark :2 +original-oid 0000000000000000000000000000000000000002 +data 16 +file 2 contents + +blob +mark :3 +original-oid 0000000000000000000000000000000000000003 +data 16 +file 3 contents + +blob +mark :4 +original-oid 0000000000000000000000000000000000000004 +data 16 +file 4 contents + +reset refs/heads/orphan-me +commit refs/heads/orphan-me +mark :5 +original-oid 0000000000000000000000000000000000000009 +author Little O. Me 1535228562 -0700 +committer Little O. Me 1535228562 -0700 +data 8 +Initial +M 100644 :1 nuke-me + +commit refs/heads/orphan-me +mark :6 +original-oid 000000000000000000000000000000000000000A +author Little 'ol Me 1535229544 -0700 +committer Little 'ol Me 1535229544 -0700 +data 9 +Tweak it +from :5 +M 100644 :4 nuke-me + +reset refs/heads/master +commit refs/heads/master +mark :7 +original-oid 000000000000000000000000000000000000000B +author Little O. Me 1535229523 -0700 +committer Little O. Me 1535229523 -0700 +data 15 +Initial commit +M 100644 :1 fileA + +commit refs/heads/master +mark :8 +original-oid 000000000000000000000000000000000000000C +author Lit.e Me 1535229559 -0700 +committer Lit.e Me 1535229580 -0700 +data 10 +Add fileB +from :7 +M 100644 :2 fileB + +commit refs/heads/master +mark :9 +original-oid 000000000000000000000000000000000000000D +author Little Me 1535229601 -0700 +committer Little Me 1535229601 -0700 +data 10 +Add fileC +from :8 +M 100644 :3 fileC + +commit refs/heads/master +mark :10 +original-oid 000000000000000000000000000000000000000E +author Little Me 1535229618 -0700 +committer Little Me 1535229618 -0700 +data 10 +Add fileD +from :9 +M 100644 :4 fileD + +tag v1.0 +from :10 +original-oid 000000000000000000000000000000000000000F +tagger Little John 1535229637 -0700 +data 5 +v1.0 + +done