-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for cloudflare's R2 storage #888
Conversation
I'll fix whitespace reported by pre-commit. @martindurant do you know what causes "Your proposed upload is smaller than the minimum allowed object size." ? |
There are test failures with "Your proposed upload is smaller than the minimum allowed object size". |
In https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html
So apparently a non-last part is <5MB. We should not be reading the whole of the file internal buffer if there are less than blocksize bytes left and Actually, there is a change of behaviour here (I think). Previously, _uload_chunk would send as much data as it has whenever write()/flush() happens. Always splitting this into blocksize chunks (5MB!) may cause many more calls. |
@martindurant please take a look at updated version.
That's correct. If we want to support R2, we need to make a decision about chunk size beforehand. What's your recommendation?
Is there a reliable way to check this? I guess this is tell/seek you talk about, but not all file-like objects have them. |
It sounds like it might need to be a filesystem-wide optional flag between the old behaviour and this new one. |
The internal buffer is an io.BytesIO, so len(buffer.getbuffer()) is a zero-cost way to get the size. |
how about a kwarg |
@martindurant updated suggested implementation. Note: I fetch up to part_max bytes from input now. |
@martindurant please have a look at updated version, it manually creates new buffer. I've also added a test that we discussed. It isn't clear from fsspec what |
Exactly. |
…lock_size, min_upload_size
ping, LMK if you want to add anything (also feel free to edit/rewrite PR the way you like) |
The mamba setup seems to be having trouble, so please add this patch: --- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -24,11 +24,10 @@ jobs:
fetch-depth: 0
- name: Setup conda
- uses: mamba-org/setup-micromamba@v1
+ uses: conda-incubator/setup-miniconda@v3
with:
environment-file: ci/env.yaml
- create-args: >-
- python=${{ matrix.PY }}
+ python-version: ${{ matrix.PY }}
|
feel free to make these/any other changes yourself, Github will allow you to push to this branch |
@martindurant is it possible to use this change without a release happening? I want to use this change but some dependencies use a previous version, I am not even able to install directly from git. Eg. |
Assuming
pip will think it has 2024.3.1, but will use the current code. Of course, you might want to do the same with fsspec too. Note that I intend to make a release perhaps yet this week. |
@geekodour
but formal release would be much nicer :) |
This is in 2024.10.0, no? Or do you mean an update to the |
Ah, great. It's there, just not mentioned in changelog. |
(oops) |
fix #789 (also see for previous discussion)
Cloudflare's R2 storage has an additional restriction during multi-part uploads that all chunks are of the same size, with the exception of the last one. Last can be smaller than previous. I don't think this behavior is well documented, but I've checked it empirically.
This unfortunately means that clever merging of last chunk used by s3fs can't be used, and I suggest this simple implementation (I guess by default boto follows the same chunking logic).
I have tested this to work by uploading/downloading to S3 and R2 random files of lengths: [0, 1, 10, 1024, 6_000_000, 12_000_000, 123_456_789]
BTW @martindurant thanks again for the lib.