From 8b3b1c191db88777fff99f406aa81cdd81dd6bd7 Mon Sep 17 00:00:00 2001 From: Kel Markert Date: Wed, 24 Jul 2024 13:36:17 -0700 Subject: [PATCH] Update xee dataflow example This updates the xee dataflow example to prevent users from accidentally deleting storage bucket when running the example. This is a really simple fix for a bug in a recent [push to gcsfs](https://github.com/fsspec/gcsfs/pull/608) paired with some [logic in the zarr library for writing datasets](https://github.com/zarr-developers/zarr-python/blob/df4c25f70c8a1e2b43214d7f26e80d34df502e7e/src/zarr/v2/storage.py#L567) which allows users to accidentally remove their bucket if writing to the root of a cloud storage bucket. This is problematic because users may have other data in a cloud storage bucket they may try to write to and accidental deletion of the bucket removes everything. Changes in this PR include: 1. pinning the `gcsfs` version to `<=2024.2.0` before the PR that introduced the bug 2. point to write to subdirectory on the bucket in the example PiperOrigin-RevId: 655683820 --- examples/dataflow/README.md | 2 +- pyproject.toml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/examples/dataflow/README.md b/examples/dataflow/README.md index 083229e..45ec7f9 100644 --- a/examples/dataflow/README.md +++ b/examples/dataflow/README.md @@ -104,7 +104,7 @@ This example is focused on pulling data from Earth Engine, transforming the data ```shell python ee_to_zarr_dataflow.py \ --input NASA/GPM_L3/IMERG_V06 \ - --output gs://xee-out-${PROJECT} \ + --output gs://xee-out-${PROJECT}/output/ \ --target_chunks='time=6' \ --runner DataflowRunner \ --project $PROJECT \ diff --git a/pyproject.toml b/pyproject.toml index fe2af00..a47a897 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -44,7 +44,7 @@ tests = [ dataflow = [ "absl-py", "apache-beam[gcp]", - "gcsfs", + "gcsfs<=2024.2.0", "xarray-beam", ] examples = [