diff --git a/CHANGELOG.md b/CHANGELOG.md index 995a2be77..6d2c16d48 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -33,6 +33,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Dependencies - Remove the numpy dependency upper bound. +- Introduce Multi-Storage Client (MSC) as an optional dependency. ## [0.9.0] - 2024-12-04 diff --git a/examples/multi_storage_client/README.md b/examples/multi_storage_client/README.md new file mode 100644 index 000000000..00fa346ae --- /dev/null +++ b/examples/multi_storage_client/README.md @@ -0,0 +1,98 @@ +# Training from Object Storage using Multi-Storage Client + +## What is Multi-Storage Client (MSC)? + +[Multi-Storage Client](https://github.com/NVIDIA/multi-storage-client) is a Python +library that provides a unified interface for accessing various object stores and +file systems. It makes it easy for ML workloads to use object stores by providing +a familiar file-like interface without sacrificing performance. The library adds +new functionality, such as caching, client-side observability, and leverages the native +SDKs specific to each object store for optimal performance. + +## Getting Started + +### Installation + +```bash +pip install -r requirements.txt +``` + +Or install different extra dependencies based on your object storage backend: + +```bash +# POSIX file systems. +pip install multi-storage-client + +# NVIDIA AIStore. +pip install "multi-storage-client[aistore]" + +# Azure Blob Storage. +pip install "multi-storage-client[azure-storage-blob]" + +# AWS S3 and S3-compatible object stores. +pip install "multi-storage-client[boto3]" + +# Google Cloud Storage (GCS). +pip install "multi-storage-client[google-cloud-storage]" + +# Oracle Cloud Infrastructure (OCI) Object Storage. +pip install "multi-storage-client[oci]" +``` + +### Configuration File + +The MSC configuration file defines profiles which include storage provider configurations. +An example MSC configuration file can be found at [msc_config.yaml](./msc_config.yaml). +In this example, we're pointing to the [CMIP6 archive on AWS](https://registry.opendata.aws/cmip6/). + +## Usage Example + +MSC supports fsspec and integrates with frameworks such as Zarr and Xarray via +the fsspec interface. The following example demonstrates how to use Zarr to +access the CMIP6 dataset stored in AWS S3: + +```bash +export MSC_CONFIG=./msc_config.yaml +python +>>> import zarr +>>> zarr_group = zarr.open("msc://cmip6-pds/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp119/r1i1p1f1/day/tas/gr1/v20180701") +>>> zarr_group.tree() +/ + ├── bnds (2,) float64 + ├── height () float64 + ├── lat (180,) float64 + ├── lat_bnds (180, 2) float64 + ├── lon (288,) float64 + ├── lon_bnds (288, 2) float64 + ├── tas (31390, 180, 288) float32 + ├── time (31390,) int64 + └── time_bnds (31390, 2) float64 +``` + +## Update Existing Code Path with MSC + +For other Modulus’s examples, where Zarr is commonly used in training workflows, +migrating to MSC is a straightforward process involving only configuration changes. +For example, in the [Corrdiff](../generative/corrdiff/) training example, data +currently accessed from the file system can be updated to MSC by modifying the +input path from `/code/2023-01-24-cwb-4years.zarr` to `msc://cwb-diffusions/2023-01-24-cwb-4years.zarr`, +assuming the data stored in local has been moved to a S3 bucket `cwb-diffusions`, +and MSC has a profile `cwb-diffusions` pointing to this S3 bucket. + +### Current code path (Training from File System) + +```bash +input_path = "/code/2023-01-24-cwb-4years.zarr" +zarr.open_consolidated(input_path) +``` + +### Updated code path (Training from Object Store using MSC) + +```bash +input_path = "msc://cwb-diffusions/2023-01-24-cwb-4years.zarr" +zarr.open_consolidated(input_path) +``` + +## Additional Information + +- [Multi-Storage Client Documentation](https://nvidia.github.io/multi-storage-client/) diff --git a/examples/multi_storage_client/msc_config.yaml b/examples/multi_storage_client/msc_config.yaml new file mode 100644 index 000000000..b043fc9ba --- /dev/null +++ b/examples/multi_storage_client/msc_config.yaml @@ -0,0 +1,30 @@ +# SPDX-FileCopyrightText: Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. +# SPDX-FileCopyrightText: All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +# This is an example MSC configuration file for accessing the CMIP6 archive on AWS: +# https://registry.opendata.aws/cmip6/ +profiles: + cmip6-pds: + storage_provider: + type: s3 + options: + region_name: us-west-2 + base_path: cmip6-pds + signature_version: UNSIGNED +cache: + location: /tmp/.cache + size_mb: 5000 diff --git a/examples/multi_storage_client/requirements.txt b/examples/multi_storage_client/requirements.txt new file mode 100644 index 000000000..0c472bafc --- /dev/null +++ b/examples/multi_storage_client/requirements.txt @@ -0,0 +1 @@ +multi-storage-client[boto3] diff --git a/pyproject.toml b/pyproject.toml index 24d4da7cf..8a086d88a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -77,6 +77,10 @@ fignet = [ "webdataset>=0.2", ] +storage = [ + "multi-storage-client>=0.14.0", +] + all = [ "h5py>=3.7.0", "netcdf4>=1.6.3", @@ -94,6 +98,7 @@ all = [ "nvidia-modulus[dev]", "nvidia-modulus[makani]", "nvidia-modulus[fignet]", + "nvidia-modulus[storage]", ]