Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for cloudflare's R2 storage #888

Merged
merged 12 commits into from
Sep 30, 2024
Merged

Conversation

arogozhnikov
Copy link
Contributor

fix #789 (also see for previous discussion)

Cloudflare's R2 storage has an additional restriction during multi-part uploads that all chunks are of the same size, with the exception of the last one. Last can be smaller than previous. I don't think this behavior is well documented, but I've checked it empirically.

This unfortunately means that clever merging of last chunk used by s3fs can't be used, and I suggest this simple implementation (I guess by default boto follows the same chunking logic).

I have tested this to work by uploading/downloading to S3 and R2 random files of lengths: [0, 1, 10, 1024, 6_000_000, 12_000_000, 123_456_789]

BTW @martindurant thanks again for the lib.

s3fs/core.py Outdated Show resolved Hide resolved
@arogozhnikov
Copy link
Contributor Author

I'll fix whitespace reported by pre-commit.

@martindurant do you know what causes "Your proposed upload is smaller than the minimum allowed object size." ?

@martindurant
Copy link
Member

There are test failures with "Your proposed upload is smaller than the minimum allowed object size".

@martindurant
Copy link
Member

In https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html

Part size | 5 MiB to 5 GiB. There is no minimum size limit on the last part of your multipart upload.

So apparently a non-last part is <5MB. We should not be reading the whole of the file internal buffer if there are less than blocksize bytes left and final is False - they should stay in the buffer for next time.

Actually, there is a change of behaviour here (I think). Previously, _uload_chunk would send as much data as it has whenever write()/flush() happens. Always splitting this into blocksize chunks (5MB!) may cause many more calls.

@arogozhnikov
Copy link
Contributor Author

arogozhnikov commented Aug 3, 2024

@martindurant please take a look at updated version.

Always splitting this into blocksize chunks (5MB!) may cause many more calls.

That's correct. If we want to support R2, we need to make a decision about chunk size beforehand. What's your recommendation?

We should not be reading the whole of the file internal buffer if there are less than blocksize bytes left

Is there a reliable way to check this? I guess this is tell/seek you talk about, but not all file-like objects have them.

@martindurant
Copy link
Member

That's correct. If we want to support R2, we need to make a decision about chunk size beforehand. What's your recommendation?

It sounds like it might need to be a filesystem-wide optional flag between the old behaviour and this new one.

@martindurant
Copy link
Member

Is there a reliable way to check this? I guess this is tell/seek you talk about, but not all file-like objects have them.

The internal buffer is an io.BytesIO, so len(buffer.getbuffer()) is a zero-cost way to get the size.

@arogozhnikov
Copy link
Contributor Author

arogozhnikov commented Aug 3, 2024

It sounds like it might need to be a filesystem-wide optional flag between the old behaviour and this new one.

how about a kwarg fixed_upload_size: int | None?

@arogozhnikov
Copy link
Contributor Author

@martindurant updated suggested implementation.

Note: I fetch up to part_max bytes from input now.

s3fs/core.py Outdated Show resolved Hide resolved
s3fs/core.py Show resolved Hide resolved
s3fs/core.py Show resolved Hide resolved
@arogozhnikov
Copy link
Contributor Author

@martindurant please have a look at updated version, it manually creates new buffer. I've also added a test that we discussed.

It isn't clear from fsspec what self.offset is supposed to mean, so I interpreted it as 'how many bytes were already uploaded'.

@martindurant
Copy link
Member

It isn't clear from fsspec what self.offset is supposed to mean, so I interpreted it as 'how many bytes were already uploaded'.

Exactly.

@arogozhnikov
Copy link
Contributor Author

ping, LMK if you want to add anything (also feel free to edit/rewrite PR the way you like)

@martindurant
Copy link
Member

The mamba setup seems to be having trouble, so please add this patch:

--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -24,11 +24,10 @@ jobs:
           fetch-depth: 0

       - name: Setup conda
-        uses: mamba-org/setup-micromamba@v1
+        uses: conda-incubator/setup-miniconda@v3
         with:
           environment-file: ci/env.yaml
-          create-args: >-
-            python=${{ matrix.PY }}
+          python-version: ${{ matrix.PY }}

@arogozhnikov
Copy link
Contributor Author

feel free to make these/any other changes yourself, Github will allow you to push to this branch

@martindurant martindurant merged commit def207e into fsspec:main Sep 30, 2024
21 checks passed
@geekodour
Copy link

@martindurant is it possible to use this change without a release happening? I want to use this change but some dependencies use a previous version, I am not even able to install directly from git.

Eg. Because datasets (2.20.0) depends on fsspec[http] (>=2023.1.0,<=2024.5.0)

@martindurant
Copy link
Member

Assuming datasets doesn't actually have conflicting calls with the current codebase, you can always install s3fs with --no-deps (I think all the installers accept this flag).
If datasets actually tests the package metadata, you could fool it with

pip install datasets # grabs old version of s3fs
git clone https://github.com/fsspec/s3fs/
cd s3fs
git checkout 2024.3.1
pip install -e .
git checkout main

pip will think it has 2024.3.1, but will use the current code. Of course, you might want to do the same with fsspec too.

Note that I intend to make a release perhaps yet this week.

@arogozhnikov
Copy link
Contributor Author

@geekodour
we use this in requirements file:

s3fs @ git+https://github.com/fsspec/s3fs@def207eef16607643dcdaf02d2dc01b439ccfc8a

but formal release would be much nicer :)

@martindurant
Copy link
Member

formal release would be much nicer

This is in 2024.10.0, no? Or do you mean an update to the datasets requirements?

@arogozhnikov
Copy link
Contributor Author

This is in 2024.10.0, no?

Ah, great. It's there, just not mentioned in changelog.

@martindurant
Copy link
Member

not mentioned in changelog

(oops)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect MultiPartUpload Chunksize
3 participants