Skip to content

Commit

Permalink
Introduce Multi-Storage Client (MSC) as an optional dependency
Browse files Browse the repository at this point in the history
  • Loading branch information
dreamtalen committed Jan 13, 2025
1 parent 263d7b1 commit 11bfea4
Show file tree
Hide file tree
Showing 5 changed files with 96 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Dependencies

- Remove the numpy dependency upper bound.
- Add Multi-Storage Client (MSC) as an optional dependency.

## [0.9.0] - 2024-12-04

Expand Down
73 changes: 73 additions & 0 deletions examples/multi_storage_client/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Training from Object Storage using Multi-Storage Client

## What is Multi-Storage Client (MSC)?

[Multi-Storage Client](https://github.com/NVIDIA/multi-storage-client) is a Python
library that provides a unified interface for accessing various object stores and
file systems. It makes it easy for ML workloads to use object stores by providing
a familiar file-like interface without sacrificing performance. The library adds
new functionality, such as caching, client-side observability, and leverages the native
SDKs specific to each object store for optimal performance.

## Getting Started

### Installation

```bash
pip install -r requirements.txt
```
Or install different extra dependencies based on your object storage backend:
```bash
# POSIX file systems.
pip install multi-storage-client

# NVIDIA AIStore.
pip install "multi-storage-client[aistore]"

# Azure Blob Storage.
pip install "multi-storage-client[azure-storage-blob]"

# AWS S3 and S3-compatible object stores.
pip install "multi-storage-client[boto3]"

# Google Cloud Storage (GCS).
pip install "multi-storage-client[google-cloud-storage]"

# Oracle Cloud Infrastructure (OCI) Object Storage.
pip install "multi-storage-client[oci]"
```

### Configuration File

MSC configuration file defines profiles which include storage provider configurations.
An example MSC configuration file could be found at [msc_config.yaml](./msc_config.yaml).
In this example, the data is stored in the bucket `cwb-diffusions` in a S3-compatible
object store and credentials are inferred from the environment variables `S3_KEY` and `S3_SECRET`.

## Update Code Path with MSC

For Modulus’s use cases, where Zarr is commonly used in training workflows,
migrating to MSC is a straightforward process involving only configuration changes.
For example, in the [Corrdiff](../generative/corrdiff/) training example, data
currently accessed from file system can be updated to MSC by modifying the
input path from `/code/2023-01-24-cwb-4years.zarr` to `msc://cwb-diffusions/2023-01-24-cwb-4years.zarr`,
with the MSC configuration file defined in [msc_config.yaml](./msc_config.yaml).
This assumes the data stored in local file has been moved to a S3 bucket `cwb-diffusions`.

### Current code path (Training from File System):

```bash
input_path = "/code/2023-01-24-cwb-4years.zarr"
zarr.open_consolidated(input_path)
```

### Updated code path (Training from Object Store using MSC):

```bash
input_path = "msc://cwb-diffusions/2023-01-24-cwb-4years.zarr"
zarr.open_consolidated(input_path)
```

## Additional Information

- [Multi-Storage Client Documentation](https://nvidia.github.io/multi-storage-client/)
19 changes: 19 additions & 0 deletions examples/multi_storage_client/msc_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# This is an example MSC configuration file for accessing the cwb datasets stored
# in an S3-compatible bucket cwb-diffusions.
# The credentials are inferred from the environment variables S3_KEY and S3_SECRET.
profiles:
cwb-diffusions:
storage_provider:
type: s3
options:
region_name: us-east-1
endpoint_url: https://pbss.s8k.io
base_path: cwb-diffusions
credentials_provider:
type: S3Credentials
options:
access_key: ${S3_KEY}
secret_key: ${S3_SECRET}
cache:
location: /tmp/.cache
size_mb: 5000
1 change: 1 addition & 0 deletions examples/multi_storage_client/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
multi-storage-client[boto3]
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ dev = [
"interrogate==1.5.0",
"coverage==6.5.0",
"ruff==0.0.290",
"multi-storage-client>=0.12.2",
]

makani = [
Expand Down Expand Up @@ -94,6 +95,7 @@ all = [
"nvidia-modulus[dev]",
"nvidia-modulus[makani]",
"nvidia-modulus[fignet]",
"multi-storage-client[boto3]",
]


Expand Down

0 comments on commit 11bfea4

Please sign in to comment.