-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retry bulk rm #608
retry bulk rm #608
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good - I tried it on my example in #558 and added some additional logging:
- only a few random files end up in
remaining
- all the errors I'm seeing are 429
- the failed files all go through on first retry, and the overall request then succeeds
Thanks!
I did some cleanup here, if people wouldn't mind trying again. |
Thanks @martindurant ! I'll try reproducing tomorrow |
On the latest commit I'm getting failures like this now:
When I use the previous commit, today I am also getting 503 instead of 429 but the failed files are correctly filtered out and retried. |
I made a small change to batch the leftovers, if any, instead of repeating for each batch. I don't see how 503 is escaping, unless you are running out of retries (could add logging here): if code in [200, 204]:
out.append(path)
elif code in errs and i < 5:
remaining.append(path)
else:
... should mean 503 either gets the path back in |
Of course I can't trip any errors now that I try again (regardless of commit). I'll try again tomorrow. When I hit the 503 earlier, I did have a print statement that should have logged if anything went into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had the wrong enumerator for the current retry state. With this change it seems to work well.
Any particular reason for dropping the batchsize
to 20? You can apparently go as high as 2000, although I expect failures may become more likely if this is too big. Don't know if it makes any difference to speed, and this is way faster than gsutil
already.
On batch size, Google says this:
|
I was thinking that with the requests concurrent, smaller batches would be better. Maybe that's wrong, since sending the requests should take about the same bandwidth regardless. We have two batch sizes here: in the outer _rm and the inner _rm_files. Should they be the same? |
Co-authored-by: Sam Levang <[email protected]>
I'm not sure but that seems ok. Then the outer Also should a user be able to configure |
Actually, it's this one: https://github.com/fsspec/filesystem_spec/blob/master/fsspec/asyn.py#L319 And you are right, if _rm_files is not designed to be called from outside, there's no reason it should do its own internal looping. |
Co-authored-by: Sam Levang <[email protected]>
Thanks! |
This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example. This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything. Changes in this PR include: 1. pinning the `gcsfs` version to `<=2024.4.0` before the PR that introduced the bug 2. point to write to subdirectory on the bucket in the example PiperOrigin-RevId: 655683820
This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example. This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything. Changes in this PR include: 1. pinning the `gcsfs` version to `<=2024.4.0` before the PR that introduced the bug 2. point to write to subdirectory on the bucket in the example PiperOrigin-RevId: 655683820
This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example. This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything. Changes in this PR include: 1. pinning the `gcsfs` version to `<=2024.2.0` before the PR that introduced the bug 2. point to write to subdirectory on the bucket in the example PiperOrigin-RevId: 655683820
This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example. This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything. Changes in this PR include: 1. pinning the `gcsfs` version to `<=2024.2.0` before the PR that introduced the bug 2. point to write to subdirectory on the bucket in the example PiperOrigin-RevId: 655683820
This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example. This is a really simple fix for a bug in a recent [push to gcsfs](fsspec/gcsfs#608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything. Changes in this PR include: 1. pinning the `gcsfs` version to `<=2024.2.0` before the PR that introduced the bug 2. point to write to subdirectory on the bucket in the example PiperOrigin-RevId: 656046609
Fixes #558