Skip to content

Commit

Permalink
[SageMaker] Add GraphStorm SageMaker Pipeline creation and execution (#…
Browse files Browse the repository at this point in the history
…1108)

*Issue #, if available:*

*Description of changes:*

* Add support for customizable SageMaker pipelines.
* We use one file to do the parameter parsing and validation, one to
create the pipeline and one to execute.
* I used GenAI to create the first README, then hand-edited. 

By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

---------

Co-authored-by: xiang song(charlie.song) <[email protected]>
  • Loading branch information
2 people authored and jalencato committed Jan 13, 2025
1 parent d5ae70a commit 2a99ca2
Show file tree
Hide file tree
Showing 9 changed files with 2,157 additions and 11 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Full argument list of the ``gconstruct.construct_graph`` command
* **-\-add-reverse-edges**: boolean value to decide whether to add reverse edges for the given graph. Adding this argument sets it to true; otherwise, it defaults to false. It is **strongly** suggested to include this argument for graph construction, as some nodes in the original data may not have in-degrees, and thus cannot update their presentations by aggregating messages from their neighbors. Adding this arugment helps prevent this issue.
* **-\-output-format**: the format of constructed graph, options are ``DGL``, ``DistDGL``. Default is ``DistDGL``. It also accepts multiple graph formats at the same time separated by an space, for example ``--output-format "DGL DistDGL"``. The output format is explained in the :ref:`Output <gcon-output-format>` section above.
* **-\-num-parts**: an integer value that specifies the number of graph partitions to produce. This is only valid if the output format is ``DistDGL``.
* **-\-part-method**: the partition method to use during partitioning. We support 'metis' or 'random'.
* **-\-skip-nonexist-edges**: boolean value to decide whether skip edges whose endpoint nodes don't exist. Default is true.
* **-\-ext-mem-workspace**: the directory where the tool can store intermediate data during graph construction. Suggest to use high-speed SSD as the external memory workspace.
* **-\-ext-mem-feat-size**: the minimal number of feature dimensions that features can be stored in external memory. Default is 64.
Expand Down
1 change: 1 addition & 0 deletions python/graphstorm/gconstruct/construct_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -928,6 +928,7 @@ def process_graph(args):
help="The number of graph partitions. " + \
"This is only valid if the output format is DistDGL.")
argparser.add_argument("--part-method", type=str, default='metis',
choices=['metis', 'random'],
help="The partition method. Currently, we support 'metis' and 'random'.")
argparser.add_argument("--skip-nonexist-edges", action='store_true',
help="Skip edges that whose endpoint nodes don't exist.")
Expand Down
277 changes: 277 additions & 0 deletions sagemaker/pipeline/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,277 @@
# GraphStorm SageMaker Pipeline

This project provides a set of tools to create and execute SageMaker pipelines for GraphStorm, a library for large-scale graph neural networks. The pipeline automates the process of graph construction, partitioning, training, and inference using Amazon SageMaker.

## Table of Contents

1. [Overview](#overview)
2. [Prerequisites](#prerequisites)
3. [Project Structure](#project-structure)
4. [Installation](#installation)
5. [Usage](#usage)
- [Creating a Pipeline](#creating-a-pipeline)
- [Executing a Pipeline](#executing-a-pipeline)
6. [Pipeline Components](#pipeline-components)
7. [Configuration](#configuration)
8. [Advanced Usage](#advanced-usage)
9. [Troubleshooting](#troubleshooting)

## Overview

This project simplifies the process of running GraphStorm workflows on Amazon SageMaker. It provides scripts to:

1. Define and create SageMaker pipelines for GraphStorm tasks
2. Execute these pipelines with customizable parameters
3. Manage different stages of graph processing, including construction, partitioning, training, and inference

## Prerequisites

- Python 3.8+
- AWS account with appropriate permissions. See the official
[SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-access.html) docs
for detailed permissions needed to create and run SageMaker Pipelines.
- Familiarity with SageMaker AI and
[SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html).
- Basic understanding of graph neural networks and [GraphStorm](https://graphstorm.readthedocs.io/en/latest/index.html).

## Project Structure

The project consists of three main Python scripts:

1. `create_sm_pipeline.py`: Defines the structure of a SageMaker pipeline
2. `pipeline_parameters.py`: Manages the configuration and parameters for the pipeline
3. `execute_sm_pipeline.py`: Executes created pipelines

## Installation

To construct and execute GraphStorm SageMaker pipelines you need the code
available and a Python environment with the SageMaker SDK and `boto3` installed.

1. Clone the GraphStorm repository:
```
git clone https://github.com/awslabs/graphstorm.git
cd graphstorm/sagemaker/pipeline
```

2. Install the required dependencies:
```
pip install sagemaker boto3
```

## Usage

### Creating a Pipeline

To create a new SageMaker pipeline for GraphStorm:

```bash
python create_sm_pipeline.py \
--graph-construction-config-filename my_gconstruct_config.json \
--graph-name my-graph \
--graphstorm-pytorch-cpu-image-url 123456789012.dkr.ecr.us-west-2.amazonaws.com/graphstorm:sagemaker-cpu \
--input-data-s3 s3://input-bucket/data \
--instance-count 2 \
--jobs-to-run gconstruct train inference \
--output-prefix s3://output-bucket/results \
--pipeline-name my-graphstorm-pipeline \
--region us-west-2 \
--role arn:aws:iam::123456789012:role/SageMakerExecutionRole \
--train-inference-task node_classification \
--train-yaml-s3 s3://config-bucket/train.yaml
```

This command creates a new pipeline with the specified configuration. The pipeline will
include one GConstruct job, one training job and one inference job.
The `--role` argument is required to provide the execution role SageMaker will use to
run the jobs, and the `--graphstorm-pytorch-cpu-image-url` is needed to provide
the Docker image to use during training and GConstruct.
It will use the configuration defined in `s3://input-bucket/data/my_gconstruct_config.json`
to construct the graph and the train config file at `s3://config-bucket/train.yaml`
to run training and inference.

The `--instance-count` parameter determines the number of workers and partitions we will create and use
during partitioning/training.

You can customize various aspects of the pipeline using additional command-line arguments. Refer to the script's help message for a full list of options:

```bash
python create_sm_pipeline.py --help
```

### Executing a Pipeline

To execute a created pipeline:

```bash
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--region us-west-2
```

You can override the default pipeline parameters during execution:

```bash
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--region us-west-2 \
--instance-count 4 \
--gpu-instance-type ml.g4dn.12xlarge
```

For a full list of execution options:

```bash
python execute_sm_pipeline.py --help
```

For more fine-grained execution options, like selective execution, please refer to
[SageMaker AI documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-selective-ex.html).

## Pipeline Components

The GraphStorm SageMaker pipeline can include the following steps:

1. **Graph Construction (GConstruct)**: Builds the partitioned graph from input data in a single instance.
2. **Graph Processing (GSProcessing)**: Processes the graph data using PySpark, preparing it for distributed graph partitioning.
3. **Graph Partitioning (DistPart)**: Partitions the graph using multiple instances.
4. **GraphBolt Conversion**: Converts the partitioned data (usually generated from DistPart) to GraphBolt format.
5. **Training**: Trains the graph neural network model.
6. **Inference**: Runs inference on the trained model.

Each step is configurable and can be customized based on your specific requirements.

## Configuration

The pipeline's behavior is controlled by various configuration parameters, including:

- AWS configuration (region, roles, image URLs)
- Instance configuration (instance types, counts)
- Task configuration (graph name, input/output locations)
- Training and inference configurations

### AWS Configuration
- `--execution-role`: SageMaker execution IAM role ARN. (Required)
- `--region`: AWS region. (Required)
- `--graphstorm-pytorch-cpu-image-uri`: GraphStorm GConstruct/dist_part/train/inference CPU ECR image URI. (Required)
- `--graphstorm-pytorch-gpu-image-uri`: GraphStorm GConstruct/dist_part/train/inference GPU ECR image URI.
- `--gsprocessing-pyspark-image-uri`: GSProcessing SageMaker PySpark ECR image URI.

### Instance Configuration
- `--instance-count` / `--num-parts`: Number of worker instances/partitions for partition, training, inference. (Required)
- `--cpu-instance-type`: CPU instance type. (Default: ml.m5.4xlarge)
- `--gpu-instance-type`: GPU instance type. (Default: ml.g5.4xlarge)
- `--train-on-cpu`: Run training and inference on CPU instances instead of GPU. (Flag)
- `--graph-construction-instance-type`: Instance type for graph construction.
- `--gsprocessing-instance-count`: Number of GSProcessing instances.
- `--volume-size-gb`: Additional volume size for SageMaker instances in GB. (Default: 100)

### Task Configuration
- `--graph-name`: Name of the graph. (Required)
- `--input-data-s3`: S3 path to the input graph data. (Required)
- `--output-prefix-s3`: S3 prefix for the output data. (Required)
- `--pipeline-name`: Name for the pipeline.
- `--base-job-name`: Base job name for SageMaker jobs. (Default: 'gs')
- `--jobs-to-run`: Space-separated string of jobs to run in the pipeline.
Possible values are: "gconstruct", "gsprocessing", "dist_part", "gb_convert", "train", "inference" (Required).
- `--log-level`: Logging level for the jobs. (Default: INFO)
- `--step-cache-expiration`: Expiration time for the step cache. (Default: 30d)
- `--update-pipeline`: Update an existing pipeline instead of creating a new one. (Flag)

### Graph Construction Configuration
- `--graph-construction-config-filename`: Filename for the graph construction config.
- `--graph-construction-args`: Parameters to be passed directly to the GConstruct job.

### Partition Configuration
- `--partition-algorithm`: Partitioning algorithm. (Default: random)
- `--partition-output-json`: Name for the output JSON file that describes the partitioned data. (Default: metadata.json)
- `--partition-input-json`: Name for the JSON file that describes the input data for partitioning. (Default: updated_row_counts_metadata.json)

### Training Configuration
- `--model-output-path`: S3 path for model output.
- `--num-trainers`: Number of trainers to use during training/inference. (Default: 4)
- `--train-inference-task-type`: Task type for training and inference. (Required)
- `--train-yaml-s3`: S3 path to train YAML configuration file.
- `--use-graphbolt`: Whether to use GraphBolt for GConstruct, training and inference. (Default: false)

### Inference Configuration
- `--inference-yaml-s3`: S3 path to inference YAML configuration file.
- `--inference-model-snapshot`: Which model snapshot to choose to run inference with.
- `--save-predictions`: Whether to save predictions to S3 during inference. (Flag)
- `--save-embeddings`: Whether to save embeddings to S3 during inference. (Flag)

### Script Paths
- `--dist-part-script`: Path to DistPartition SageMaker entry point script.
- `--gb-convert-script`: Path to GraphBolt partition conversion script.
- `--train-script`: Path to training SageMaker entry point script.
- `--inference-script`: Path to inference SageMaker entry point script.
- `--gconstruct-script`: Path to GConstruct SageMaker entry point script.
- `--gsprocessing-script`: Path to GSProcessing SageMaker entry point script.

## Advanced Usage

### Using GraphBolt

To use GraphBolt for improved performance:

```bash
python create_sm_pipeline.py \
... \
--use-graphbolt true
```

When you choose GSProcessing for graph construction, and want to use GraphBolt, you will need to include a `gb_convert` step in your
job sequence, i.e. to get a partitioned graph you will need the sequence `"gsprocessing dist_part gb_convert [train] [inference]"`.

### Custom Job Sequences

You can customize the sequence of jobs in the pipeline using the `--jobs-to-run` argument when creating the pipeline. For example:

```bash
python create_sm_pipeline.py \
... \
--jobs-to-run gsprocessing dist_part gb_convert train inference \
--use-graphbolt true
```

This will create a pipeline that uses GSProcessing to process and prepare the data for partitioning,
uses DistPart to partition the data, converts the partitioned data to the GraphBolt format,
then runs a train and an inference job in sequence.
You can use this job sequence when your graph is too large to partition on one instance using
GConstruct. 10B+ edges is the suggested threshold to move to distributed partitioning, or if your
features are larger than 1TByte.

### Asynchronous Execution

To start a pipeline execution without waiting for it to complete:

```bash
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--region us-west-2 \
--async-execution
```

### Local Execution

For testing purposes, you can execute the pipeline locally:

```bash
python execute_sm_pipeline.py \
--pipeline-name my-graphstorm-pipeline \
--local-execution
```

Note that local execution requires a GPU if the pipeline is configured to use GPU instances.

## Troubleshooting

- Ensure all required AWS permissions are correctly set up
- Check SageMaker execution logs for detailed error messages
- Verify that all S3 paths are correct and accessible
- Ensure that the specified EC2 instance types are available in your region

See also [Troubleshooting Amazon SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-troubleshooting.html)

For more detailed information about GraphStorm, refer to the [GraphStorm documentation](https://graphstorm.readthedocs.io/).

If you encounter any issues or have questions, please open an issue in the project's [GitHub repository](https://github.com/awslabs/graphstorm/issues).
Loading

0 comments on commit 2a99ca2

Please sign in to comment.