[Feature Request] Beesd to run a full dedup cycle and then end #279

JaviVilarroig · 2024-03-30T16:35:21Z

I have a script that creates a backup on external btrfs hard drive.

Once the backup is done, I will like to have the option to run a dedup cycle to remove data redundancy and then umount the volume.

Currently I must manually monitor the journal to wait for the dedup run end before unmounting the volume.

Thanks!

kakra · 2024-03-30T21:54:01Z

Just let it run for 1-2 hours, bees is best effort, it has no concept of a full cycle because while it modifies the file system, it'll add new transactions, thus generate new cycles.

Then inspect your journals if 1-2 hours is enough. If bees falls behind, you'd need to increase the time.

See man timeout.

JaviVilarroig · 2024-03-30T23:43:50Z

In this use case, we are talking of a filesystem that is only receiving the backup and is having no more activity. So, bees is ending it's cycle and all the crawlers end.

No more transactions, no more activity.

I can see that in the logs. But I want to automate the backup script to umount the filesystem.

I know it's a niche use case, but I think it's legitimate :)

kakra · 2024-03-31T09:59:06Z

In this use case, we are talking of a filesystem that is only receiving the backup and is having no more activity.

In this case, it may actually work.

How do you watch the logs? Maybe we could implement something like this in the beesd wrapper?

JaviVilarroig · 2024-03-31T13:56:27Z

I just look at journalctl until I get something similar to this.

mar 31 15:51:14 gondor beesd[50771]: crawl_more[50806]: crawl_more ran out of data after 0.326411s
mar 31 15:51:15 gondor beesd[50771]: crawl_writeback[50815]: Saved crawl state in 0.256s

After that, there's no more activite, that's why I think that must be possible.

I think that adding a flag to the bees binary that makes it quit when runs out of data will do the trick.

I can help with writing a new launcher script or updating the existing one, if needed.

Zygo · 2024-03-31T20:13:16Z

The condition is something like hitting the "ran out of data" condition twice in a row on every crawler, without encountering any new extents in between in any crawler. The trick is that a new extent almost inevitably appears as a result of bees's own activity, so the condition is never met. Maybe something like "fewer than N extents" where N is a command-line option would work.

In BeesRoots::crawl_thread we could add a check to see if any root exists with m_finished false after the first time the crawl_more task runs. If there is, keep looping; otherwise, send SIGTERM to pid 0 to trigger the termination code. Technically that will terminate too early, but if you're going to run bees again the following day, then all the second-pass work will be done the next day, with yesterday's bees's pass 2 data mixed with today's bees's pass 1. There can also be false positives as technically there is nothing synchronizing the state of the crawlers (i.e. a crawler could restart and find new extents while you're iterating over the crawler list).

Somewhere in the issues here, there's a clever script that measures the amount of IO that bees does, and if that drops to zero reads for a few seconds, it terminates the bees process.

JaviVilarroig · 2024-03-31T20:31:15Z

Hummm. I can try the script measuring IO activity.

Thanks for the idea.

mischaelschill · 2024-12-10T17:26:30Z

Could this be done for active filesystems too? Like recording the maximum transid when it starts, and stopping when it reaches this transid?

kakra · 2024-12-10T17:28:06Z

The v0.11 RC has a progress indicator, you could read that from the status file in /run/bees.

Zygo · 2024-12-10T20:34:34Z

Could this be done for active filesystems too? Like recording the maximum transid when it starts, and stopping when it reaches this transid?

That idea doesn't work on the subvol scanners, since a full dedupe cycle is multiple passes (and many, many transids) for those. You'd get "one pass of dedupe"--which is probably plenty, but each pass leaves a little bit of work behind to be completed during the following pass.

But! It does work with the extent scanner, because every pass is a complete, never-look-back dedupe cycle for that scan mode. Once the min_transid hits the startup transid, the scanner can be paused, and when they're all paused, bees exits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Beesd to run a full dedup cycle and then end #279

[Feature Request] Beesd to run a full dedup cycle and then end #279

JaviVilarroig commented Mar 30, 2024

kakra commented Mar 30, 2024

JaviVilarroig commented Mar 30, 2024

kakra commented Mar 31, 2024

JaviVilarroig commented Mar 31, 2024

Zygo commented Mar 31, 2024

JaviVilarroig commented Mar 31, 2024

mischaelschill commented Dec 10, 2024 •

edited

Loading

kakra commented Dec 10, 2024

Zygo commented Dec 10, 2024

[Feature Request] Beesd to run a full dedup cycle and then end #279

[Feature Request] Beesd to run a full dedup cycle and then end #279

Comments

JaviVilarroig commented Mar 30, 2024

kakra commented Mar 30, 2024

JaviVilarroig commented Mar 30, 2024

kakra commented Mar 31, 2024

JaviVilarroig commented Mar 31, 2024

Zygo commented Mar 31, 2024

JaviVilarroig commented Mar 31, 2024

mischaelschill commented Dec 10, 2024 • edited Loading

kakra commented Dec 10, 2024

Zygo commented Dec 10, 2024

mischaelschill commented Dec 10, 2024 •

edited

Loading