Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel ItemBlock processing via backup_controller goroutines #8659

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sseago
Copy link
Collaborator

@sseago sseago commented Jan 29, 2025

Thank you for contributing to Velero!

Please add a summary of your change

Parallel ItemBlock processing via backup_controller goroutines as described in phase 2 of https://github.com/vmware-tanzu/velero/blob/main/design/backup-performance-improvements.md

Does your change fix a particular issue?

Fixes #8334

Please indicate you've done the following:

@sseago sseago marked this pull request as draft January 29, 2025 21:22
@sseago
Copy link
Collaborator Author

sseago commented Jan 29, 2025

Currently in draft, as I have not yet tested the changes in a cluster env.

Copy link

codecov bot commented Jan 29, 2025

Codecov Report

Attention: Patch coverage is 93.02326% with 12 lines in your changes missing coverage. Please review.

Project coverage is 59.51%. Comparing base (223e1fc) to head (097941e).
Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
pkg/backup/backup.go 90.43% 9 Missing and 2 partials ⚠️
pkg/controller/backup_controller.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8659      +/-   ##
==========================================
+ Coverage   59.17%   59.51%   +0.34%     
==========================================
  Files         370      371       +1     
  Lines       39812    40127     +315     
==========================================
+ Hits        23558    23881     +323     
  Misses      14753    14753              
+ Partials     1501     1493       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have just one concern otherwise nothing stands out!

pkg/backup/backup.go Show resolved Hide resolved
Copy link
Member

@kaovilai kaovilai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some typos. other than that no objections atm.

Note: I am not as familiar with select-case channel statements.

pkg/backup/backup.go Outdated Show resolved Hide resolved
pkg/backup/backup.go Show resolved Hide resolved
pkg/backup/backup.go Outdated Show resolved Hide resolved
@sseago sseago force-pushed the parallel-itemblocks branch from 5cee460 to d08bb75 Compare February 5, 2025 00:36
@sseago
Copy link
Collaborator Author

sseago commented Feb 5, 2025

Updated in response to PR comments. Will test in cluster tomorrow.

@sseago sseago force-pushed the parallel-itemblocks branch from d08bb75 to dad8770 Compare February 5, 2025 17:44
@sseago sseago force-pushed the parallel-itemblocks branch from dad8770 to 097941e Compare February 6, 2025 00:29
@sseago sseago marked this pull request as ready for review February 6, 2025 00:30
@sseago
Copy link
Collaborator Author

sseago commented Feb 6, 2025

I've tested this in an aws cluster. Two namespaces, 4 pods, 2 volumes, and additional related resources. Using vmware-tanzu/velero-plugin-example#75 to force each item backup to take 5 seconds allowed me to verify parallel processing. Starting with 1 configured worker, up to 16 -- each time dropped the total backup time, and I was able to verify from the logs that the 5-second duration BIA actions were happening in parallel to each other in the expected number.

Summary of backup time/results is below. Note that the differing item count on subsequent backups was mainly due to the inclusion of events. If those had been excluded, the time drop from one backup to the next would be more significant. Also note that with a regular real-world backup of this size, the time differences would be much smaller, as I've introduced an artificial time duration per item here.

The real-world use case where parallel backup will be most pronounced would be a backup with a large number of small PVCs using CSI snapshots or datamover.

mysql-persistent in two namespaces. 5 second BIA delay per item

1 worker: mysql-1worker-01
Started:    2025-02-05 15:08:57 +0000 UTC
Completed:  2025-02-05 15:15:58 +0000 UTC
Elapsed: 07:01
Items backed up:              77

2 workers: mmysql-2worker-01
Started:    2025-02-05 16:04:21 +0000 UTC
Completed:  2025-02-05 16:08:32 +0000 UTC
Elapsed:  04:01
Items backed up:              83

4 workers: mmysql-4worker-01
Started:    2025-02-05 16:33:26 +0000 UTC
Completed:  2025-02-05 16:36:01 +0000 UTC
Elapsed: 02:35
Items backed up:              89

8 workers: mmysql-8worker-01
Started:    2025-02-05 16:48:15 +0000 UTC
Completed:  2025-02-05 16:50:01 +0000 UTC
Elapsed: 01:46
Total items to be backed up:  96

16 workers: mmysql-16worker-01
Started:    2025-02-05 16:58:00 +0000 UTC
Completed:  2025-02-05 16:59:37 +0000 UTC
Elapsed:  01:37
Items backed up:              102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Phase 2 implementation for backup performance improvements
4 participants