Clowder Dataset Loader

This Dagster project automates the process of loading datasets into Clowder, a data management system. It scans specified directories for new datasets and triggers jobs to upload them to Clowder. Features • Automatic scanning of directory structures for new datasets • Sensor-based triggering of dataset upload jobs • Configurable dataset volume paths • Integration with Clowder API for dataset uploads

Getting started

First, install your Dagster code location as a Python package. By using the --editable flag, pip will install your Python package in "editable mode" so that as you develop, local code changes will automatically apply.

pip install -e ".[dev]"

Environment variables

Set up the following environment variables: • CLOWDER_URL: URL of your Clowder instance • CLOWDER_TOKEN: Authentication token for Clowder API • SPACE_ID: ID of the Clowder space where datasets will be uploaded • DATASET_DIRECTORY: Path to the root directory containing datasets

Then, start the Dagster UI web server:

dagster dev

Open http://localhost:3000 with your browser to see the project.

Project Structure

The main components of this project are: • DatasetVolume: A configurable resource that represents the directory structure of datasets • file_sensor: A sensor that detects new datasets and triggers upload jobs • load_clowder: A job definition for loading datasets into Clowder

How It Works

1.	The `file_sensor` periodically scans the specified dataset directory
2.	When new datasets are detected, it creates run requests for each dataset
3.	The `load_clowder` job is triggered for each run request
4.	Datasets are uploaded to Clowder using the provided API credentials

ClowderResource

The ClowderResource is a configurable resource that handles communication with the Clowder API. Configuration • clowder_url: The base URL of the Clowder instance • clowder_token: Authentication token for Clowder API • space_id: ID of the Clowder space to use • timeout: Request timeout (default: 5 seconds) • retries: Number of retries for failed requests (default: 3) • ssl: Whether to verify SSL certificates (default: True) Methods • _get: Performs GET requests to the Clowder API • _post: Performs POST requests to the Clowder API • get_or_create_space: Retrieves or creates a Clowder space • create_dataset: Creates a new dataset in Clowder • add_files_to_dataset: Adds files to a Clowder dataset

Clowder Dataset Asset

The clowder_dataset asset is responsible for creating datasets and uploading files to Clowder. Configuration The asset uses a DatasetRequest configuration class with the following fields: • category: Category for the dataset • dataset_title: Title of the dataset (default: “Test Dataset”) • files: List of file paths to upload Functionality 1. Creates a new dataset in Clowder 2. Uploads specified files to the created dataset 3. Returns metadata about the uploaded files

Development

Adding new Python dependencies

You can specify new Python dependencies in setup.py.

Unit testing

Tests are in the dagster_clowder_tests directory and you can run tests using pytest:

pytest dagster_clowder_tests

Schedules and sensors

If you want to enable Dagster Schedules or Sensors for your jobs, the Dagster Daemon process must be running. This is done automatically when you run dagster dev.

Once your Dagster Daemon is running, you can start turning on schedules and sensors for your jobs.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dagster_clowder		dagster_clowder
dagster_clowder_tests		dagster_clowder_tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clowder Dataset Loader

Getting started

Environment variables

Project Structure

How It Works

ClowderResource

Clowder Dataset Asset

Development

Adding new Python dependencies

Unit testing

Schedules and sensors

About

Releases

Packages

Languages

License

BenGalewsky/dagster-clowder

Folders and files

Latest commit

History

Repository files navigation

Clowder Dataset Loader

Getting started

Environment variables

Project Structure

How It Works

ClowderResource

Clowder Dataset Asset

Development

Adding new Python dependencies

Unit testing

Schedules and sensors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages