Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Multi-Storage Client (MSC) as an optional dependency #754

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dreamtalen
Copy link

@dreamtalen dreamtalen commented Jan 9, 2025

Modulus Pull Request

Description

This PR introduces Multi-Storage Client (MSC) as an optional dependency for Modulus, with examples.

The Multi-Storage Client (MSC) is a unified, high-performance Python client designed to seamlessly interface with various object and file storage systems. It supports:

  • Cloud Object Stores: AWS S3, Azure Blob Storage, Google Cloud Storage (GCS), Oracle Cloud Infrastructure (OCI) Object Storage.
  • NVIDIA AIStore.
  • POSIX file systems.

MSC provides a generic interface to interact with objects and files across various storage services. This lets you spend less time learning each storage service's unique interface and lets you change where data is stored without having to change how your code accesses it.

We have successfully completed several PoCs (ICON, CorrDiff) using MSC training from S3-compatiable object stores on Modulus workloads.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

@dreamtalen dreamtalen force-pushed the msc-image branch 2 times, most recently from 08a450a to 11bfea4 Compare January 13, 2025 22:20
@dreamtalen dreamtalen changed the title Add Multi-Storage Client (MSC) as an optional dependency Introduce Multi-Storage Client (MSC) as an optional dependency Jan 13, 2025
@dreamtalen dreamtalen force-pushed the msc-image branch 3 times, most recently from 941e074 to 40e72aa Compare January 13, 2025 22:26
@dreamtalen dreamtalen marked this pull request as ready for review January 13, 2025 22:29
@ktangsali ktangsali self-assigned this Jan 13, 2025
@ktangsali ktangsali requested a review from akshaysubr January 13, 2025 23:34
@dreamtalen
Copy link
Author

/blossom-ci

@akshaysubr akshaysubr requested a review from NickGeneva January 27, 2025 20:13
pyproject.toml Outdated
@@ -59,6 +59,7 @@ dev = [
"interrogate==1.5.0",
"coverage==6.5.0",
"ruff==0.0.290",
"multi-storage-client>=0.12.2",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this only in the dev optional dependency list? Wouldn't it be better to put it in some storage specific optional dependency list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, moved it to a storage dependency list

pyproject.toml Outdated
@@ -94,6 +95,7 @@ all = [
"nvidia-modulus[dev]",
"nvidia-modulus[makani]",
"nvidia-modulus[fignet]",
"multi-storage-client[boto3]",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only install boto3 and not other backends in the all dep list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out! I removed the boto3 here as it should be installed in the example folder instead.

type: s3
options:
region_name: us-east-1
endpoint_url: https://pbss.s8k.io
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this reference from here. A modulus example should be runnable by anyone without any specific network domain requirements. Please refactor to point this example to a publicly available zarr dataset. Here are a couple of dataset suggestions:

  1. CMIP6 archive on AWS: https://registry.opendata.aws/cmip6/
  2. ARCO ERA5 dataset on google cloud: https://github.com/google-research/arco-era5?tab=readme-ov-file#025-pressure-and-surface-level-data

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added examples of CMIP6.

@dreamtalen dreamtalen force-pushed the msc-image branch 2 times, most recently from e479897 to e20346a Compare January 29, 2025 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants