Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer resilver only when progress is above a threshold #15810

Merged
merged 1 commit into from
Oct 4, 2024

Conversation

snajpa
Copy link
Contributor

@snajpa snajpa commented Jan 23, 2024

Motivation and Context

#14505
#15249
#15739

Description

Defer resilver only when progress is above a threshold

Restart a resilver from scratch, if the current one in progress is below a new tunable, zfs_resilver_defer_percent (defaulting to 10%).

The original rationale for deferring additional resilvers, when there is already one in progress, was to help achieving data redundancy sooner for the data that gets scanned at the end of the resilver.

But in case the admin wants to attach multiple disks to a single vdev, it wasn't immediately obvious the admin is supposed to run zpool resilver afterwards to reset the deferred resilvers and start a new one from scratch.

How Has This Been Tested?

Manually on a ZVOL-backed single vdev single disk pool, attaching more ZVOLs consecutively at various point of resilver progress to see if it is in fact restarted when progress is below zfs_resilver_defer_percent.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Contributor

@bwatkinson bwatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is interesting; however, there seems to be issues showing up in the ZTS tests currently with this commit.

@snajpa

This comment was marked as outdated.

@snajpa

This comment was marked as outdated.

@snajpa

This comment was marked as outdated.

@snajpa

This comment was marked as outdated.

@snajpa

This comment was marked as outdated.

@snajpa
Copy link
Contributor Author

snajpa commented Jan 24, 2024

I wonder what the difference between the test setup and mine dev setup here is... I get all the zpool_events tests passing just fine... (debian/sid, vanilla 6.6.13)

snajpa@snajpadev:~/zfs (scan-restart-thresh)$ ./scripts/zfs-tests.sh -x -T zpool_events
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/setup (run as root) [00:01] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_clear (run as root) [00:02] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_cliargs (run as root) [00:26] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_follow (run as root) [00:02] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_poolname (run as root) [00:04] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_errors (run as root) [00:10] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_duplicates (run as root) [00:02] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_clear_retained (run as root) [00:01] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/cleanup (run as root) [00:00] [PASS]

Results Summary
PASS       9

Running Time:   00:00:52
Percent passed: 100.0%
Log directory:  /var/tmp/test_results/20240124T175749

Tests with results other than PASS that are expected:

Tests with result of PASS that are unexpected:

Tests with results other than PASS that are unexpected:

@bwatkinson
Copy link
Contributor

I wonder what the difference between the test setup and mine dev setup here is... I get all the zpool_events tests passing just fine... (debian/sid, vanilla 6.6.13)

snajpa@snajpadev:~/zfs (scan-restart-thresh)$ ./scripts/zfs-tests.sh -x -T zpool_events
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/setup (run as root) [00:01] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_clear (run as root) [00:02] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_cliargs (run as root) [00:26] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_follow (run as root) [00:02] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_poolname (run as root) [00:04] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_errors (run as root) [00:10] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_duplicates (run as root) [00:02] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_clear_retained (run as root) [00:01] [PASS]
Test: /home/snajpa/zfs/tests/zfs-tests/tests/functional/cli_root/zpool_events/cleanup (run as root) [00:00] [PASS]

Results Summary
PASS       9

Running Time:   00:00:52
Percent passed: 100.0%
Log directory:  /var/tmp/test_results/20240124T175749

Tests with results other than PASS that are expected:

Tests with result of PASS that are unexpected:

Tests with results other than PASS that are unexpected:

Looking at the failure on the CentOS 8 build it looks like it might be worth running the tag zpool_resilver to see if that can help in tracing the issue down.

@snajpa
Copy link
Contributor Author

snajpa commented Jan 24, 2024

Looking at the failure on the CentOS 8 build it looks like it might be worth running the tag zpool_resilver to see if that can help in tracing the issue down.

I missed that, thank you :) Meanwhile my local test run got stuck right there too (didn't hit the send/recv problem this time)

@snajpa

This comment was marked as outdated.

@snajpa snajpa marked this pull request as draft January 24, 2024 22:26
@snajpa

This comment was marked as outdated.

@snajpa

This comment was marked as outdated.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jan 26, 2024
@snajpa snajpa force-pushed the scan-restart-thresh branch from da4d202 to 129988f Compare February 5, 2024 09:45
@snajpa snajpa force-pushed the scan-restart-thresh branch from 129988f to 26818ad Compare February 18, 2024 17:28
@snajpa snajpa force-pushed the scan-restart-thresh branch from 26818ad to 2869830 Compare September 25, 2024 21:02
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snajpa this looked pretty close to me, just a couple comments inline. You can drop the "Draft" status when you're happy with it.

module/zfs/dsl_scan.c Outdated Show resolved Hide resolved
module/zfs/dsl_scan.c Outdated Show resolved Hide resolved
module/zfs/dsl_scan.c Show resolved Hide resolved
Restart a resilver from scratch, if the current one in progress is
below a new tunable, zfs_resilver_defer_percent (defaulting to 10%).

The original rationale for deferring additional resilvers, when there is
already one in progress, was to help achieving data redundancy sooner
for the data that gets scanned at the end of the resilver.

But in case the admin wants to attach multiple disks to a single vdev,
it wasn't immediately obvious the admin is supposed to run
`zpool resilver` afterwards to reset the deferred resilvers and start
a new one from scratch.

Signed-off-by: Pavel Snajdr <[email protected]>
@snajpa snajpa force-pushed the scan-restart-thresh branch from 2869830 to 6b99e43 Compare October 3, 2024 08:14
@snajpa snajpa marked this pull request as ready for review October 3, 2024 08:22
module/zfs/dsl_scan.c Show resolved Hide resolved
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 3, 2024
@behlendorf behlendorf merged commit 0d77e73 into openzfs:master Oct 4, 2024
20 of 21 checks passed
@amotin
Copy link
Member

amotin commented Oct 31, 2024

@snajpa Could you take a look on the resilver_restart_001 test, it seems to be unstable in the newly added case: https://github.com/openzfs/zfs/actions/runs/11614641502/job/32343310086?pr=16694

@snajpa
Copy link
Contributor Author

snajpa commented Nov 2, 2024

@amotin does it only fail on FreeBSD? I can't get it to fail on Linux 6.11; I guess time to install FreeBSD? \o/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants