Add support for cloudflare's R2 storage #888

arogozhnikov · 2024-08-01T04:28:16Z

fix #789 (also see for previous discussion)

Cloudflare's R2 storage has an additional restriction during multi-part uploads that all chunks are of the same size, with the exception of the last one. Last can be smaller than previous. I don't think this behavior is well documented, but I've checked it empirically.

This unfortunately means that clever merging of last chunk used by s3fs can't be used, and I suggest this simple implementation (I guess by default boto follows the same chunking logic).

I have tested this to work by uploading/downloading to S3 and R2 random files of lengths: [0, 1, 10, 1024, 6_000_000, 12_000_000, 123_456_789]

BTW @martindurant thanks again for the lib.

s3fs/core.py

arogozhnikov · 2024-08-02T17:06:29Z

I'll fix whitespace reported by pre-commit.

@martindurant do you know what causes "Your proposed upload is smaller than the minimum allowed object size." ?

martindurant · 2024-08-02T17:06:36Z

There are test failures with "Your proposed upload is smaller than the minimum allowed object size".

martindurant · 2024-08-02T17:23:47Z

In https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html

Part size | 5 MiB to 5 GiB. There is no minimum size limit on the last part of your multipart upload.

So apparently a non-last part is <5MB. We should not be reading the whole of the file internal buffer if there are less than blocksize bytes left and final is False - they should stay in the buffer for next time.

Actually, there is a change of behaviour here (I think). Previously, _uload_chunk would send as much data as it has whenever write()/flush() happens. Always splitting this into blocksize chunks (5MB!) may cause many more calls.

arogozhnikov · 2024-08-03T01:28:26Z

@martindurant please take a look at updated version.

Always splitting this into blocksize chunks (5MB!) may cause many more calls.

That's correct. If we want to support R2, we need to make a decision about chunk size beforehand. What's your recommendation?

We should not be reading the whole of the file internal buffer if there are less than blocksize bytes left

Is there a reliable way to check this? I guess this is tell/seek you talk about, but not all file-like objects have them.

martindurant · 2024-08-03T17:13:29Z

That's correct. If we want to support R2, we need to make a decision about chunk size beforehand. What's your recommendation?

It sounds like it might need to be a filesystem-wide optional flag between the old behaviour and this new one.

martindurant · 2024-08-03T17:14:41Z

Is there a reliable way to check this? I guess this is tell/seek you talk about, but not all file-like objects have them.

The internal buffer is an io.BytesIO, so len(buffer.getbuffer()) is a zero-cost way to get the size.

arogozhnikov · 2024-08-03T20:38:36Z

It sounds like it might need to be a filesystem-wide optional flag between the old behaviour and this new one.

how about a kwarg fixed_upload_size: int | None?

arogozhnikov · 2024-08-05T17:09:42Z

@martindurant updated suggested implementation.

Note: I fetch up to part_max bytes from input now.

s3fs/core.py

…lush

arogozhnikov · 2024-09-22T08:15:05Z

@martindurant please have a look at updated version, it manually creates new buffer. I've also added a test that we discussed.

It isn't clear from fsspec what self.offset is supposed to mean, so I interpreted it as 'how many bytes were already uploaded'.

martindurant · 2024-09-23T14:21:01Z

It isn't clear from fsspec what self.offset is supposed to mean, so I interpreted it as 'how many bytes were already uploaded'.

Exactly.

s3fs/tests/test_s3fs.py

…lock_size, min_upload_size

arogozhnikov · 2024-09-26T22:57:58Z

ping, LMK if you want to add anything (also feel free to edit/rewrite PR the way you like)

martindurant · 2024-09-29T12:51:40Z

The mamba setup seems to be having trouble, so please add this patch:

--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -24,11 +24,10 @@ jobs:
           fetch-depth: 0

       - name: Setup conda
-        uses: mamba-org/setup-micromamba@v1
+        uses: conda-incubator/setup-miniconda@v3
         with:
           environment-file: ci/env.yaml
-          create-args: >-
-            python=${{ matrix.PY }}
+          python-version: ${{ matrix.PY }}

arogozhnikov · 2024-09-29T16:37:44Z

feel free to make these/any other changes yourself, Github will allow you to push to this branch

geekodour · 2024-10-10T06:58:19Z

@martindurant is it possible to use this change without a release happening? I want to use this change but some dependencies use a previous version, I am not even able to install directly from git.

Eg. Because datasets (2.20.0) depends on fsspec[http] (>=2023.1.0,<=2024.5.0)

martindurant · 2024-10-10T13:19:55Z

Assuming datasets doesn't actually have conflicting calls with the current codebase, you can always install s3fs with --no-deps (I think all the installers accept this flag).
If datasets actually tests the package metadata, you could fool it with

pip install datasets # grabs old version of s3fs
git clone https://github.com/fsspec/s3fs/
cd s3fs
git checkout 2024.3.1
pip install -e .
git checkout main

pip will think it has 2024.3.1, but will use the current code. Of course, you might want to do the same with fsspec too.

Note that I intend to make a release perhaps yet this week.

arogozhnikov · 2024-10-29T23:32:09Z

@geekodour
we use this in requirements file:

s3fs @ git+https://github.com/fsspec/s3fs@def207eef16607643dcdaf02d2dc01b439ccfc8a

but formal release would be much nicer :)

martindurant · 2024-10-30T13:23:35Z

formal release would be much nicer

This is in 2024.10.0, no? Or do you mean an update to the datasets requirements?

arogozhnikov · 2024-10-30T16:22:15Z

This is in 2024.10.0, no?

Ah, great. It's there, just not mentioned in changelog.

martindurant · 2024-10-30T16:23:06Z

not mentioned in changelog

(oops)

add support for cloudflare's R2 storage

77fe091

martindurant reviewed Aug 2, 2024

View reviewed changes

s3fs/core.py Outdated Show resolved Hide resolved

remove trailing whitespace

02765ca

arogozhnikov added 2 commits August 2, 2024 18:03

account for multiple calls of upload_chunk

4509193

satisfy pre-commit requirements

b13c2d9

use fixed_upload_size argument to switch between old and new behavior

3953406

martindurant reviewed Aug 6, 2024

View reviewed changes

s3fs/core.py Outdated Show resolved Hide resolved

use boolean fixed_upload_size

2709308

martindurant reviewed Aug 8, 2024

View reviewed changes

s3fs/core.py Show resolved Hide resolved

s3fs/core.py Show resolved Hide resolved

arogozhnikov added 4 commits August 8, 2024 09:44

add fixed_upload_size to docstring

6058a5c

typo

34eeabc

manually recreate buffer in multipart upload, do not rely on fsspec.f…

61fa0e2

…lush

add test for fs with fixed_upload_size=True

ccb8698

martindurant reviewed Sep 24, 2024

View reviewed changes

s3fs/tests/test_s3fs.py Show resolved Hide resolved

add a test with prime pad sizes to exclude cases of divisibility by b…

1f4370f

…lock_size, min_upload_size

add Martin's patch

7c31604

martindurant merged commit def207e into fsspec:main Sep 30, 2024
21 checks passed

geekodour mentioned this pull request Oct 10, 2024

Datasets conflicts with fsspec 2024.9 huggingface/datasets#7190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for cloudflare's R2 storage #888

Add support for cloudflare's R2 storage #888

arogozhnikov commented Aug 1, 2024

arogozhnikov commented Aug 2, 2024

martindurant commented Aug 2, 2024

martindurant commented Aug 2, 2024

arogozhnikov commented Aug 3, 2024 •

edited

Loading

martindurant commented Aug 3, 2024

martindurant commented Aug 3, 2024

arogozhnikov commented Aug 3, 2024 •

edited

Loading

arogozhnikov commented Aug 5, 2024

arogozhnikov commented Sep 22, 2024

martindurant commented Sep 23, 2024

arogozhnikov commented Sep 26, 2024

martindurant commented Sep 29, 2024

arogozhnikov commented Sep 29, 2024

geekodour commented Oct 10, 2024

martindurant commented Oct 10, 2024

arogozhnikov commented Oct 29, 2024

martindurant commented Oct 30, 2024

arogozhnikov commented Oct 30, 2024

martindurant commented Oct 30, 2024

Add support for cloudflare's R2 storage #888

Add support for cloudflare's R2 storage #888

Conversation

arogozhnikov commented Aug 1, 2024

arogozhnikov commented Aug 2, 2024

martindurant commented Aug 2, 2024

martindurant commented Aug 2, 2024

arogozhnikov commented Aug 3, 2024 • edited Loading

martindurant commented Aug 3, 2024

martindurant commented Aug 3, 2024

arogozhnikov commented Aug 3, 2024 • edited Loading

arogozhnikov commented Aug 5, 2024

arogozhnikov commented Sep 22, 2024

martindurant commented Sep 23, 2024

arogozhnikov commented Sep 26, 2024

martindurant commented Sep 29, 2024

arogozhnikov commented Sep 29, 2024

geekodour commented Oct 10, 2024

martindurant commented Oct 10, 2024

arogozhnikov commented Oct 29, 2024

martindurant commented Oct 30, 2024

arogozhnikov commented Oct 30, 2024

martindurant commented Oct 30, 2024

arogozhnikov commented Aug 3, 2024 •

edited

Loading

arogozhnikov commented Aug 3, 2024 •

edited

Loading