Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter-repo: parse blobs after commits #609

Closed
wants to merge 1 commit into from

Conversation

vapaamies
Copy link
Contributor

@vapaamies vapaamies commented Oct 25, 2024

Performance gain on blob callbacks, now they applied only to filtered commits.

Performance gain on blob callbacks, now they applied only to filtered commits.

Signed-off-by: Vladislav Javadov <[email protected]>
@newren
Copy link
Owner

newren commented Oct 25, 2024

Performance gain on blob callbacks, now they applied only to filtered commits.

You didn't test this to verify the performance improvements you claimed, did you?

This doesn't change the order that blobs & commits are emitted by fast-export, it merely changes the order in which we check whether the current line of output from fast-export is a blob/tag/commit/etc. But changing the order in which we check what the current line represents does not change what the current line represents. Since fast-export will always output blobs before the commits that use them, and we parse the information in the order it appears in the stream, we continue parsing blobs before commits with your patch.

Further, the idea of parsing blobs after commits fundamentally makes little sense. Commits have to refer to the blobs used in the commit, otherwise they cannot be written. There's no way for us to have blobs listed later and handle it, unless we completely rearchitected the system to do some kind of deferred resolution. That'd be a pretty big change to the code, likely increase memory pressure and computation overhead, and likely cost more than what you're attempting to save. Especially since...

Even if we created some kind of deferred resolution mechanism so we could parse commits before blobs, doing that would have little performance benefit, because it still forces fast-export to process the blobs we'll end up not using (they'll still be inflated and written to the stream, and parsed by us), and we'll still have to parse it, and we'll likely still write it to the fast-import stream (it just won't be used since the commit will filter out the path), and fast-import will still need to deflate it. I suspect that unless you have an expensive blob callback, its operation is probably in the noise compared to these other steps that you haven't removed.

If you want a way to avoid filtering blobs that are irrelevant (e.g. because you are also using --path directives to trim your repository down), then you have a couple options:

  • Do the filtering in multiple steps. In the first call to git filter-repo, pass the arguments that trim the repository to the relevant subset. In the second, use the blob callback to do whatever modifications you need.
  • Use the new --file-info-callback I'll be providing soon instead of using --blob-callback. That one doesn't make fast-export send all the blobs, but instead looks them up on demand and inserts modifications of those blobs into the stream. Doesn't help you today but should be available "soon".

@newren newren closed this Oct 25, 2024
@vapaamies vapaamies deleted the blobs-after branch October 25, 2024 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants