diff --git a/examples/sagemaker-pipelines-graphbolt/Dockerfile.processing b/examples/sagemaker-pipelines-graphbolt/Dockerfile.processing index 0470c21b2b..093ca98d01 100644 --- a/examples/sagemaker-pipelines-graphbolt/Dockerfile.processing +++ b/examples/sagemaker-pipelines-graphbolt/Dockerfile.processing @@ -4,7 +4,7 @@ FROM public.ecr.aws/ubuntu/ubuntu:22.04 ENV DEBIAN_FRONTEND=noninteractive # Install Python and other dependencies -RUN apt update && apt install -y \ +RUN apt-get update && apt-get install -y \ axel \ curl \ python3 \ @@ -13,9 +13,9 @@ RUN apt update && apt install -y \ unzip \ && rm -rf /var/lib/apt/lists/* - +# Copy and install ripunzip COPY ripunzip_2.0.0-1_amd64.deb ripunzip_2.0.0-1_amd64.deb -RUN apt install -y ./ripunzip_2.0.0-1_amd64.deb +RUN apt-get install -y ./ripunzip_2.0.0-1_amd64.deb RUN python3 -m pip install --no-cache-dir --upgrade pip==24.3.1 && \ python3 -m pip install --no-cache-dir \ @@ -25,11 +25,11 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip==24.3.1 && \ tqdm==4.67.1 \ tqdm-loggable==0.2 +# Install aws cli RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \ && unzip awscliv2.zip \ && ./aws/install - # Copy processing scripts COPY process_papers100M.sh /opt/ml/code/ COPY convert_ogb_papers100M_to_gconstruct.py /opt/ml/code/ diff --git a/examples/sagemaker-pipelines-graphbolt/README.md b/examples/sagemaker-pipelines-graphbolt/README.md new file mode 100644 index 0000000000..69337a3eae --- /dev/null +++ b/examples/sagemaker-pipelines-graphbolt/README.md @@ -0,0 +1,531 @@ +# Faster distributed graph neural network training with GraphStorm 0.4 + +GraphStorm is a low-code enterprise graph machine learning (ML) framework that provides ML practitioners a simple way of building, training and deploying graph ML solutions on industry-scale graph data. While GraphStorm can run efficiently on single instances for small graphs, it truly shines when scaling to enterprise-level graphs in distributed mode using a cluster of EC2 instances or Amazon SageMaker. + +GraphStorm 0.4 introduced integration with DGL-GraphBolt, a new graph storage and sampling framework that uses a compact graph representation and pipelined sampling to reduce memory requirements and speed up Graph Neural Network (GNN) training by up to 3x. In this example we'll show how GraphStorm 0.4 brings training and inference speedups of up to 3x. + +In this example, you will: + +1. Learn how to use SageMaker Pipelines with GraphStorm. +2. Understand how GraphBolt enhances GraphStorm's performance in distributed settings. +3. Follow a hands-on example of using GraphStorm with GraphBolt on Amazon SageMaker for distributed training. + +## Background: challenges of graph training + +Before diving into our hands-on example, it's important to understand some challenges associated with graph training, especially as graphs grow in size and complexity: + +1. Memory Constraints: As graphs grow larger, they may no longer fit into the memory of a single machine. A graph with 1B nodes with 512 features per node and 10B edges will require more than 4TB of memory to store, even with optimal representation. This necessitates distributed processing and more efficient graph representation. +2. Graph Sampling: In GNN mini-batch training, you need to sample neighbors for each node to propagate their representations. For multi-layer GNNs, this can lead to exponential growth in the number of nodes sampled, potentially visiting the entire graph for a single node's representation. Efficient sampling methods become necessary. +3. Remote Data Access: When training on multiple machines, retrieving node features and sampling neighborhoods from other machines will significantly impact performance due to network latency. For example, reading a 1024-feature vector from main memory will take around 3μs, while reading that vector from a remote key/value store would take 50-100x longer. + +GraphStorm and GraphBolt help address these challenges through efficient graph representations, smart sampling techniques, and sophisticated partitioning algorithms like ParMETIS. + + +## GraphBolt: pipeline-driven graph sampling + + +GraphBolt is a new data loading and graph sampling framework developed by the [DGL](https://www.dgl.ai/) team. It streamlines the operations needed to sample efficiently from a heterogeneous graph and fetch the corresponding features. + +GraphBolt introduces a new, more compact graph structure representation for heterogeneous graphs, called fused Compressed Sparse Column (fCSC). This can reduce the memory cost of storing a heterogeneous graph by up to 56%, allowing users to fit larger graphs in memory and potentially use smaller, more cost-efficient instances for GNN model training. + + +### Integration with GraphStorm: + +GraphStorm 0.4.0 seamlessly integrates with GraphBolt, allowing users to leverage these performance improvements in their GNN workflows. This integration enables GraphStorm to handle larger graphs more efficiently and accelerate both training and inference processes. + +The integration of GraphBolt into GraphStorm's workflow means that users can now: + +1. Load and process larger graphs with fewer hardware resources. +2. Achieve faster training and inference times with more efficient graph sampling framework. +3. Utilize GPU resources more effectively for graph learning. + +### Performance improvements: + +Our benchmarks show significant improvements in both memory usage and training speed when using GraphStorm with GraphBolt: + + +* We've observed up to 1.8x training speedup on the [ogbn-papers 100M dataset](https://ogb.stanford.edu/docs/nodeprop/#ogbn-papers100M), with 111M nodes and 3.2B edges +* At the same time, memory usage for storing graph structures has been reduced by up to 56% in heterogeneous graphs like ogbn-papers. + +## Example model development lifecycle for GraphStorm on SageMaker + +Figure 1: GraphStorm SageMaker architecture. + +A common model development process is to perform model exploration locally on a subset of your full data, and once satisfied with the results train the full scale model. GraphStorm and SageMaker Pipelines allows you to do that by creating a model pipeline you can execute locally to retrieve model metrics, and when ready execute your pipeline on the full data, and produce models, predictions and graph embeddings to use in downstream tasks. In the next section you'll learn how to set up such pipelines for GraphStorm. + +## Set up environment for SageMaker distributed training + +You'll be using SageMaker Bring-Your-Own-Container (BYOC) to launch processing and training jobs. You need to create a PyTorch Docker image for distributed training, and we'll use the same image to process and prepare the graph for training. +You will use SageMaker Pipelines to automate jobs needed for GNN training. As a prerequisite, you'll need to have access to a [SageMaker Domain](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html) to access [SageMaker Studio](https://aws.amazon.com/sagemaker-ai/studio/) and [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html). + +### Create a SageMaker Domain + +In order to use SageMaker Studio you will need to have a SageMaker Domain available. If you don't have one already, follow the steps in the [quick setup](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html) to create one: + +1. Sign in to the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/). +2. Open the left navigation pane. +3. Under **Admin configurations**, choose **Domains**. +4. Choose **Create domain**. +5. Choose **Set up for single user (Quick setup**). Your domain and user profile are created automatically. + +### Set up appropriate roles to use with SageMaker Pipelines + +To set up the SageMaker Pipelines you will need permissions to create ECR repositories, pull and push to them, pull from the AWS ECR Public Gallery, launch SageMaker jobs, manage SageMaker Pipelines, and interact with data on S3. We will create a role for Amazon EC2 on the AWS console, which will also create an associated instance profile to use with an EC2 instance. + +You will also need access to a SageMaker execution that your jobs assume during execution. You can use the [Amazon SageMaker Role Manager](https://docs.aws.amazon.com/sagemaker/latest/dg/role-manager.html) to streamline the creation of the necessary roles. + + +### Set up the pipeline management environment + +For this example you can either use your existing development environment or set up a new EC2 instance. If you plan to use a new instance to prepare the large-scale data for this example, ensure it has at least 300GB of disk space available. +To set up an EC2 instance with the appropriate environment: + + +1. Launch an EC2 instance: + +```bash +# Use an Ubuntu PyTorch 2.4.0 DLAMI (Ubuntu 22.04) +aws ec2 run-instances \ + --image-id "ami-0907e5206d941612f" \ + --instance-type "m6in.4xlarge" \ + --key-name my-key-name \ + --block-device-mappings '[{ + "DeviceName": "/dev/sdf", + "Ebs": { + "VolumeSize": 300, + "VolumeType": "gp3", + "DeleteOnTermination": true + } + }]' +``` + +This command creates an instance using the "Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4.1 (Ubuntu 22.04) 20241116" AMI, in the default VPC with the default security group. Make your instance accessible through SSH, using an appropriate security group or the [AWS Systems Session Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html), and log in to the instance. You can also use the [AWS Console](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/tutorial-launch-my-first-ec2-instance.html) to create a new EC2 instance. + +> NOTE: You may need to update the --image-id to the latest available. See https://docs.aws.amazon.com/dlami/latest/devguide/find-dlami-id.html for instructions. + +Once logged in, you can set up your Python environment to run GraphStorm + +```bash +conda init +eval $SHELL +conda create -y --name gsf python=3.10 +conda activate gsf + +# Install dependencies +pip install sagemaker boto3 ogb pyarrow + +# Clone the GraphStorm repository to access the example code +git clone https://github.com/awslabs/graphstorm.git ~/graphstorm +cd ~/graphstorm/examples/sagemaker-pipelines-graphbolt +``` + +### Download and prepare datasets + +In this example you will use two related datasets to demonstrate the scalability of GraphStorm. The Open Graph Benchmark (OGB) project hosts a number of graph datasets that can be used to benchmark the performance of graph learning systems. In this example you will use two citation network datasets, the ogbn-arxiv dataset for a small-scale demo, and the ogbn-papers100M dataset for a demonstration of GraphStorm's large-scale learning capabilities. + +Because the two datasets have similar schemas and the same task (node classification) they allow us to emulate a typical data science pipeline, where we first do some model development and testing on a smaller dataset locally, and once ready launch SageMaker jobs to train on the full-scale data. + + +#### Prepare the ogbn-arxiv dataset + +You'lll download the smaller-scale [ogbn-arxiv](https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv) dataset to run a local test before launching larger scale SageMaker jobs on AWS. This dataset has ~170K nodes and ~1.2M edges. You will use the following script to download the arxiv data and prepare them for GraphStorm. + + +```bash +# Provide the S3 bucket to use for output +BUCKET_NAME= +``` + + +You will use this script to directly download, transform and upload the data to S3: + + +```bash +python convert_ogb_arxiv_to_gconstruct.py \ + --output-prefix s3://$BUCKET_NAME/ogb-arxiv-input +``` + +This will create the tabular graph data on S3 which you can verify by running + + +```bash +aws s3 ls s3://$BUCKET_NAME/ogb-arxiv-input/ + PRE edges/ + PRE nodes/ + PRE splits/ +2024-12-11 02:13:27 1269 gconstruct_config_arxiv.json +``` + +Finally you'll also upload GraphStorm training configuration files for arxiv to use for training and inference + +``` +# Upload the training configurations to S3 +aws s3 cp ~/graphstorm/training_scripts/gsgnn_np/arxiv_nc.yaml \ + s3://$BUCKET_NAME/yaml/arxiv_nc_train.yaml +aws s3 cp ~/graphstorm/inference_scripts/np_infer/arxiv_nc.yaml \ + s3://$BUCKET_NAME/yaml/arxiv_nc_inference.yaml +``` + +**Prepare the ogbn-papers100M dataset on SageMaker** + +The papers-100M dataset is a large-scale graph dataset, with 111M nodes and ~3.2B edges when we add reverse edges. The data size is ~57GB so to make efficient use of our AWS resources we'll download and unzip the data in parallel, using multiple threads and upload directly to S3. To do so we will use the [axel](https://github.com/axel-download-accelerator/axel) and [ripunzip](https://github.com/google/ripunzip/) libraries. + +You can either run this job as a SageMaker processing job or you can run the processing locally in the background while you work on building the GraphStorm Docker image and training a local model for the ogbn-arxiv dataset. + +To run this process as a SageMaker Processing step, follow the steps below. You can launch and let the job execute in the background while proceeding through the rest of the steps, you can come back to this dataset later. + + +```bash +# Navigate to the example code and ensure Docker is installed +cd ~/graphstorm/examples/sagemaker-pipelines-graphbolt +sudo apt update +sudo apt install Docker.io +docker -v + +# Build and push a Docker image to download and process the papers100M data +bash build_and_push_papers100M_image.sh +# This creates an ECR repository at +# $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/papers100m-processor + +# Run a SageMaker job to do the processing and upload the output to S3 +SAGEMAKER_EXECUTION_ROLE= +ACCOUNT_ID= +REGION=us-east-1 +python sagemaker_convert_papers100M.py \ + --output-bucket $BUCKET_NAME \ + --execution-role-arn $SAGEMAKER_EXECUTION_ROLE \ + --region $REGION \ + --instance-type ml.m5.4xlarge \ + --image-uri $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/papers100m-processor +``` + +This will produce the processed data at `s3://$BUCKET_NAME/ogb-papers100M-input` which can then be used as input to GraphStorm. + + +#### [Optional] Prepare the ogbn-papers100M dataset locally + +If you prefer to pre-process the data locally, you can use the commands below on an Ubuntu 22.04 instance. + +```bash +# Install axel for parallel downloads +sudo apt update +sudo apt -y install axel + +# Download and install ripunzip for parallel unzipping +curl -L -O https://github.com/google/ripunzip/releases/download/v2.0.0/ripunzip_2.0.0-1_amd64.deb +sudo apt install -y ./ripunzip_2.0.0-1_amd64.deb + +# Download and unzip data using multiple threads, this will take 10-20 minutes +mkdir ~/papers100M-raw-data +cd ~/papers100M-raw-data +axel -n 16 http://snap.stanford.edu/ogb/data/nodeproppred/papers100M-bin.zip +ripuznip unzip-file papers100M-bin.zip +ripunzip unzip-file papers100M-bin/raw/data.npz && rm papers100M-bin/raw/data.npz + +# Install process script dependencies +python -m pip install \ + numpy==1.26.4 \ + psutil==6.1.0 \ + pyarrow==18.1.0 \ + tqdm==4.67.1 \ + tqdm-loggable==0.2 + + +# Process and upload to S3, this will take around 20 minutes +python convert_ogb_papers100M_to_gconstruct.py \ + --input-dir ~/papers100M-raw-data + --output-dir s3://$BUCKET_NAME/ogb-papers100M-input +``` + +### Build a GraphStorm Docker Image + +Next you will build and push the GraphStorm PyTorch Docker image that you'll use to run the graph construction, training and inference jobs. If you have the papers-100M data downloading in the background, open a new terminal to build and push the GraphStorm image. + + +```bash +# Ensure Docker is installed +sudo apt update +sudo apt install -y Docker.io +docker -v + +# Enter you account ID here +ACCOUNT_ID= +REGION=us-east-1 + +cd ~/graphstorm + +bash ./docker/build_graphstorm_image.sh --environment sagemaker --device cpu + +bash docker/push_graphstorm_image.sh -e sagemaker -r $REGION -a $ACCOUNT_ID -d cpu +# This will push an image to +# ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/graphstorm:sagemaker-cpu + +# Install sagemaker with support for local mode +pip install sagemaker[local] +``` + +Next, you will create a SageMaker Pipeline to run the jobs that are necessary to train GNN models with GraphStorm. + +## Create SageMaker Pipeline + +In this section, you will create a [Sagemaker Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-overview.html) on AWS SageMaker. The pipeline will run the following jobs in sequence: + +* Launch GConstruct Processing job. This prepares and partitions the data for distributed training.. +* Launch GraphStorm Training Job. This will train the model and create model output on S3. +* Launch GraphStorm Inference Job. This will generate predictions and embeddings for every node in the input. + +```bash +PIPELINE_NAME="ogbn-arxiv-gs-pipeline" +BUCKET_NAME="my-s3-bucket" +bash deploy_papers100M_pipeline.sh \ + --account "" \ + --bucket-name $BUCKET_NAME --role "" \ + --pipeline-name $PIPELINE_NAME \ + --use-graphbolt false +``` + +### Inspect pipeline + +Running the above will create a SageMaker Pipeline configured to run 3 SageMaker jobs in sequence: + +* A GConstruct job that converts the tabular file input to a binary partitioned graph on S3. +* A GraphStorm training job that trains a node classification model and saves the model to S3. +* A GraphStorm inference job that produces predictions for all nodes in the test set, and creates embeddings for all nodes. + +To review the pipeline, navigate to [SageMaker AI Studio](https://us-east-1.console.aws.amazon.com/sagemaker/home?region=us-east-1#/studio-landing) on the AWS Console, select the domain and user profile you used to create the pipeline in the drop-down menus on the top right, then select **Open Studio**. + +On the left navigation menu, select **Pipelines**. There should be a pipeline named **ogbn-arxiv-gs-pipeline**. Select that, which will take you to the **Executions** tab for the pipeline. Select **Graph** to view the pipeline steps. + +### Execute SageMaker pipeline locally for ogbn-arxiv + +The ogbn-arxiv data are small enough that you can execute the pipeline locally. Execute the following command to start a local execution of the pipeline: + + +```bash +PIPELINE_NAME="ogbn-arxiv-gs-pipeline" +cd ~/graphstorm/sagemaker/pipeline +python execute_sm_pipeline.py \ + --pipeline-name $PIPELINE_NAME \ + --region us-east-1 \ + --local-execution | tee arxiv-local-logs.txt +``` + +Note that we save the log output to `arxiv-local-logs.txt`. We'll use that later to analyze the training speed. + +Once the pipeline finishes it will print a message like + +``` +Pipeline execution 655b9357-xxx-xxx-xxx-4fc691fcce94 SUCCEEDED +``` + +You can inspect its output on S3. Every pipeline execution will be under the prefix `s3://$BUCKET_NAME/pipelines-output/ogbn-arxiv-gs-pipeline/` + +Every pipeline execution that shares the same input arguments will be under a randomized execution-identifying output path. +Note that the particular execution subpath might be different in your case. + +```bash +aws s3 ls s3://$BUCKET_NAME/pipelines-output/ogbn-arxiv-gs-pipeline/ + +# 761a4ff194198d49469a3bb223d5f26e + +# There should only be one execution subpath, copy that into a new env variable +EXECUTION_SUBPATH="761a4ff194198d49469a3bb223d5f26e" +aws s3 ls --recursive \ + s3://$BUCKET_NAME/pipelines-output/ogbn-arxiv-gs-pipeline/$EXECUTION_SUBPATH + +# gconstruct: +# data_transform_new.json edge_label_stats.json edge_mapping.pt node_label_stats.json node_mapping.pt ogbn-arxiv.json part0 part1 + +# inference: +# embeddings predictions + +# model: +# epoch-0 epoch-1 epoch-2 epoch-3 epoch-4 epoch-5 epoch-6 epoch-7 epoch-8 epoch-9 + +``` + +You'll be able to see the output of each step in the pipeline. The GConstruct job created the partitioned graph, the training job created models for 10 epochs, and the inference job created embeddings for the nodes and predictions for the nodes in the test set. + +You can inspect the mean epoch and evaluation time using the provided `analyze_training_time.py` script and the log file you created: + + +```bash +python analyze_training_time.py --log-file arxiv-local-logs.txt + +Reading logs from file: arxiv-logs.txt + +=== Training Epochs Summary === +Total epochs completed: 10 +Average epoch time: 7.43 seconds + +=== Evaluation Summary === +Total evaluations: 11 +Average evaluation time: 2.25 seconds +``` + +Note that these numbers will vary depending on your instance type. + +### Create GraphBolt Pipeline + +Now that you have established a baseline for performance you can create another pipeline that uses the GraphBolt graph representation to compare the performance. + +You can use the same pipeline creation script, but change two variables, providing a new pipeline name, and setting `USE_GRAPHBOLT` to `“true”`. + + +```bash +# Deploy the GraphBolt-enabled pipeline +PIPELINE_NAME="ogbn-arxiv-gs-graphbolt-pipeline" +BUCKET_NAME="my-s3-bucket" +bash deploy_arxiv_pipeline.sh \ + --account "" \ + --bucket-name $BUCKET_NAME --role "" \ + --pipeline-name $PIPELINE_NAME \ + --use-graphbolt true +# Execute the pipeline locally +python execute_sm_pipeline.py \ + --pipeline-name $PIPELINE_NAME \ + --region us-east-1 \ + --local-execution | tee arxiv-local-gb-logs.txt +``` + +Analyzing the training logs you can see the per-epoch time has dropped somewhat: + +```bash +python analyze_training_time.py --log-file arxiv-local-gb-logs.txt + +Reading logs from file: arxiv-gb-logs.txt + +=== Training Epochs Summary === +Total epochs completed: 10 +Average epoch time: 6.83 seconds + +=== Evaluation Summary === +Total evaluations: 11 +Average evaluation time: 1.99 seconds +``` + +For such a small graph the performance gains are modest, around 13% per epoch time. Moving on to large data however, the potential gains are much larger. In the next section you will create a pipeline and train a model for `papers-100M`, a citation graph with 111M nodes and 3.2B edges. + +## Create SageMaker Pipeline for distributed training + +Once the papers-100M data have finished processing and exist on S3, either through your local job or the SageMaker Processing job, you can set up a pipeline to train a model on that dataset. + +### Build the GraphStorm GPU image + +For this job you will use large GPU instances, so you will build and push the GPU image this time: + + +```bash +cd ~/graphstorm + +bash ./docker/build_graphstorm_image.sh --environment sagemaker --device gpu + +bash docker/push_graphstorm_image.sh -e sagemaker -r $REGION -a $ACCOUNT_ID -d gpu +``` + +### Deploy and execute pipelines for papers-100M + +Before you deploy your new pipeline, upload the training YAML configuration for papers-100M to S3: + + +```bash +aws s3 cp \ + ~/graphstorm/training_scripts/gsgnn_np/papers_100M_nc.yaml \ + s3://$BUCKET_NAME/yaml/ +``` + + +Now you are ready to deploy your initial pipeline for papers-100M + +```bash +PIPELINE_NAME="ogb-papers100M-pipeline" +bash deploy_papers100M_pipeline.sh \ + --account \ + --bucket-name --role \ + --pipeline-name $PIPELINE_NAME \ + --use-graphbolt false +``` + +Execute the pipeline and let it run the background. + +```bash +python execute_sm_pipeline.py \ + --pipeline-name $PIPELINE_NAME \ + --region us-east-1 + --async-execution +``` + +>Note that your account needs to meet the required quotas for the requested instances. Here the defaults are set to four `ml.g5.48xlarge` for training jobs and one `ml.r5.24xlarge` instance for a processing job. To adjust your SageMaker service quotas you can use the [Service Quotas console UI](https://us-east-1.console.aws.amazon.com/servicequotas/home/services/sagemaker/quotas). To run both pipelines in parallel you will need 8 x $TRAIN_GPU_INSTANCE and 2 x $GCONSTRUCT_INSTANCE. + + +Next, you can deploy and execute another pipeline, now with GraphBolt enabled: + +```bash +PIPELINE_NAME="ogb-papers100M-graphbolt-pipeline" +bash deploy_papers100M_pipeline.sh \ + --account \ + --bucket-name --role \ + --pipeline-name $PIPELINE_NAME \ + --use-graphbolt true + +# Execute the GraphBolt-enabled pipeline on SageMaker +python execute_sm_pipeline.py \ + --pipeline-name $PIPELINE_NAME \ + --region us-east-1 \ + --async-execution +``` + +### Compare performance for GraphBolt-enabled training + +Once both pipelines have finished executing, which should take approximately 4 hours, you can compare the training times for both cases. To do so you will need to find the pipeline execution display names for the two papers-100M pipelines. + +The easiest way to do so is through the Studio pipeline interface. In the Pipelines page you visited previously, there should be two new pipelines named **ogb-papers100M-pipeline** and **ogb-papers100M-graphbolt-pipeline**. Select **ogb-papers100M-pipeline**, which will take you to the **Executions** tab for the pipeline. Copy the name of the latest successful execution and use that to run the training analysis script: + + +```bash +python analyze_training_time.py \ + --pipeline-name papers-100M-gs-pipeline \ + --execution-name execution-1734404366941 +``` + +Your output will look like + +```bash +== Training Epochs Summary === +Total epochs completed: 15 +Average epoch time: 73.95 seconds + +=== Evaluation Summary === +Total evaluations: 15 +Average evaluation time: 15.07 seconds +``` + +Now do the same for the GraphBolt-enabled pipeline: + +```bash +python analyze_training_time.py \ + --pipeline-name papers-100M-gs-graphbolt-pipeline \ + --execution-name execution-1734463209078 +``` + +You will see the improved per-epoch and evaluation times: + +```bash +== Training Epochs Summary === +Total epochs completed: 15 +Average epoch time: 54.54 seconds + +=== Evaluation Summary === +Total evaluations: 15 +Average evaluation time: 4.13 seconds +``` + +Without loss in accuracy, the latest version of GraphStorm achieved a **~1.4x speedup per epoch, and a 3.6x speedup in evaluation time!** + +## Conclusion: Accelerate Your Graph ML with GraphStorm + +This example showcased how GraphStorm 0.4, integrated with DGL-GraphBolt, significantly speeds up large-scale graph neural network training and inference. + +We encourage ML practitioners working with large graph data to try GraphStorm. Its low-code interface simplifies building, training, and deploying graph ML solutions on AWS, allowing you to focus on modeling rather than infrastructure. + +To get started, visit the GraphStorm [documentation](https://graphstorm.readthedocs.io/en/) and GraphStorm [Github repository](https://github.com/awslabs/graphstorm). diff --git a/examples/sagemaker-pipelines-graphbolt/analyze_training_time.py b/examples/sagemaker-pipelines-graphbolt/analyze_training_time.py index c17674b825..f477ecb5ad 100644 --- a/examples/sagemaker-pipelines-graphbolt/analyze_training_time.py +++ b/examples/sagemaker-pipelines-graphbolt/analyze_training_time.py @@ -259,17 +259,9 @@ def print_training_summary( if epochs_data: total_epochs = len(epochs_data) avg_time = sum(e["time"] for e in epochs_data) / total_epochs - min_time = min(epochs_data, key=lambda x: x["time"]) - max_time = max(epochs_data, key=lambda x: x["time"]) print(f"Total epochs completed: {total_epochs}") print(f"Average epoch time: {avg_time:.2f} seconds") - print( - f"Fastest epoch: Epoch {min_time['epoch']} ({min_time['time']:.2f} seconds)" - ) - print( - f"Slowest epoch: Epoch {max_time['epoch']} ({max_time['time']:.2f} seconds)" - ) if verbose: print("\nEpoch Details:") @@ -283,17 +275,9 @@ def print_training_summary( if eval_data: total_evals = len(eval_data) avg_eval_time = sum(e["time"] for e in eval_data) / total_evals - min_eval = min(eval_data, key=lambda x: x["time"]) - max_eval = max(eval_data, key=lambda x: x["time"]) print(f"Total evaluations: {total_evals}") print(f"Average evaluation time: {avg_eval_time:.2f} seconds") - print( - f"Fastest evaluation: Step {min_eval['step']} ({min_eval['time']:.2f} seconds)" - ) - print( - f"Slowest evaluation: Step {max_eval['step']} ({max_eval['time']:.2f} seconds)" - ) if verbose: print("\nEvaluation Details:") diff --git a/examples/sagemaker-pipelines-graphbolt/build_and_push_papers100M_image.sh b/examples/sagemaker-pipelines-graphbolt/build_and_push_papers100M_image.sh index 4c6fadaee3..fceb5dadef 100644 --- a/examples/sagemaker-pipelines-graphbolt/build_and_push_papers100M_image.sh +++ b/examples/sagemaker-pipelines-graphbolt/build_and_push_papers100M_image.sh @@ -9,11 +9,49 @@ cleanup() { } -ACCOUNT=$(aws sts get-caller-identity --query Account --output text) -REGION=$(aws configure get region) -REGION=${REGION:-us-east-1} +die() { + local msg=$1 + local code=${2-1} # default exit status 1 + msg "$msg" + exit "$code" +} + +parse_params() { + # default values of variables set from params + ACCOUNT=$(aws sts get-caller-identity --query Account --output text || true) + REGION=$(aws configure get region || true) + REGION=${REGION:-"us-east-1"} + + while :; do + case "${1-}" in + -h | --help) usage ;; + -x | --verbose) set -x ;; + -a | --account) + ACCOUNT="${2-}" + shift + ;; + -r | --region) + REGION="${2-}" + shift + ;; + -?*) die "Unknown option: $1" ;; + *) break ;; + esac + shift + done + + # check required params and arguments + [[ -z "${ACCOUNT-}" ]] && die "Missing required parameter: -a/--account " + [[ -z "${REGION-}" ]] && die "Missing required parameter: -r/--region " + + return 0 +} + +parse_params "$@" + IMAGE=papers100m-processor +# Download ripunzip to copy to image curl -L -O https://github.com/google/ripunzip/releases/download/v2.0.0/ripunzip_2.0.0-1_amd64.deb # Auth to AWS public ECR gallery @@ -22,7 +60,6 @@ aws ecr-public get-login-password --region $REGION | docker login --username AWS # Build and tag image docker build -f Dockerfile.processing -t $IMAGE . - # Create repository if it doesn't exist echo "Getting or creating container repository: $IMAGE" if ! $(aws ecr describe-repositories --repository-names $IMAGE --region ${REGION} > /dev/null 2>&1); then diff --git a/examples/sagemaker-pipelines-graphbolt/deploy_arxiv_pipeline.sh b/examples/sagemaker-pipelines-graphbolt/deploy_arxiv_pipeline.sh new file mode 100644 index 0000000000..e43b4f4335 --- /dev/null +++ b/examples/sagemaker-pipelines-graphbolt/deploy_arxiv_pipeline.sh @@ -0,0 +1,129 @@ +#!/bin/env bash +set -euox pipefail + +SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd -P) + +msg() { + echo >&2 -e "${1-}" +} + +die() { + local msg=$1 + local code=${2-1} # default exit status 1 + msg "$msg" + exit "$code" +} + +parse_params() { + # default values of variables set from params + ACCOUNT=$(aws sts get-caller-identity --query Account --output text || true) + REGION=$(aws configure get region || true) + REGION=${REGION:-"us-east-1"} + PIPELINE_NAME="" + + + while :; do + case "${1-}" in + -h | --help) usage ;; + -x | --verbose) set -x ;; + -r | --role) + ROLE="${2-}" + shift + ;; + -a | --account) + ACCOUNT="${2-}" + shift + ;; + -b | --bucket) + BUCKET_NAME="${2-}" + shift + ;; + -n | --pipeline-name) + PIPELINE_NAME="${2-}" + shift + ;; + -g | --use-graphbolt) + USE_GRAPHBOLT="${2-}" + shift + ;; + -?*) die "Unknown option: $1" ;; + *) break ;; + esac + shift + done + + # check required params and arguments + [[ -z "${ACCOUNT-}" ]] && die "Missing required parameter: -a/--account " + [[ -z "${BUCKET-}" ]] && die "Missing required parameter: -b/--bucket " + [[ -z "${ROLE-}" ]] && die "Missing required parameter: -r/--role " + [[ -z "${USE_GRAPHBOLT-}" ]] && die "Missing required parameter: -g/--use-graphbolt " + + return 0 +} + +cleanup() { + trap - SIGINT SIGTERM ERR EXIT + # script cleanup here +} + +parse_params "$@" + +DATASET_S3_PATH="s3://${BUCKET_NAME}/ogb-arxiv-input" +OUTPUT_PATH="s3://${BUCKET_NAME}/pipelines-output" +GRAPH_NAME="ogbn-arxiv" +INSTANCE_COUNT="2" +REGION="us-east-1" +NUM_TRAINERS=4 + +PARTITION_OUTPUT_JSON="$GRAPH_NAME.json" +PARTITION_ALGORITHM="metis" +GCONSTRUCT_INSTANCE="ml.r5.4xlarge" +GCONSTRUCT_CONFIG="gconstruct_config_arxiv.json" + +TRAIN_CPU_INSTANCE="ml.m5.4xlarge" +TRAIN_YAML_S3="s3://$BUCKET_NAME/yaml/arxiv_nc_train.yaml" +INFERENCE_YAML_S3="s3://$BUCKET_NAME/yaml/arxiv_nc_inference.yaml" + +TASK_TYPE="node_classification" +INFERENCE_MODEL_SNAPSHOT="epoch-9" + +JOBS_TO_RUN="gconstruct train inference" +GSF_CPU_IMAGE_URI=${ACCOUNT}.dkr.ecr.$REGION.amazonaws.com/graphstorm:sagemaker-cpu +GSF_GPU_IMAGE_URI=${ACCOUNT}.dkr.ecr.$REGION.amazonaws.com/graphstorm:sagemaker-gpu +VOLUME_SIZE=50 + +if [[ -z "${PIPELINE_NAME-}" ]]; then + if [[ $USE_GRAPHBOLT == "true" ]]; then + PIPELINE_NAME="ogbn-arxiv-gs-graphbolt-pipeline" + else + PIPELINE_NAME="ogbn-arxiv-gs-pipeline" + fi +fi + +python3 $SCRIPT_DIR/../../sagemaker/pipeline/create_sm_pipeline.py \ + --cpu-instance-type ${TRAIN_CPU_INSTANCE} \ + --graph-construction-args "--num-processes 8" \ + --graph-construction-instance-type ${GCONSTRUCT_INSTANCE} \ + --graph-construction-config-filename ${GCONSTRUCT_CONFIG} \ + --graph-name ${GRAPH_NAME} \ + --graphstorm-pytorch-cpu-image-url "${GSF_CPU_IMAGE_URI}" \ + --graphstorm-pytorch-gpu-image-url "${GSF_GPU_IMAGE_URI}" \ + --inference-model-snapshot "${INFERENCE_MODEL_SNAPSHOT}" \ + --inference-yaml-s3 ${INFERENCE_YAML_S3} \ + --input-data-s3 ${DATASET_S3_PATH} \ + --instance-count ${INSTANCE_COUNT} \ + --jobs-to-run ${JOBS_TO_RUN} \ + --num-trainers ${NUM_TRAINERS} \ + --output-prefix-s3 "${OUTPUT_PATH}" \ + --pipeline-name "${PIPELINE_NAME}" \ + --partition-output-json ${PARTITION_OUTPUT_JSON} \ + --partition-algorithm ${PARTITION_ALGORITHM} \ + --region ${REGION} \ + --role "${ROLE}" \ + --train-on-cpu \ + --train-inference-task ${TASK_TYPE} \ + --train-yaml-s3 "${TRAIN_YAML_S3}" \ + --save-embeddings \ + --save-predictions \ + --volume-size-gb ${VOLUME_SIZE} \ + --use-graphbolt "${USE_GRAPHBOLT}" diff --git a/examples/sagemaker-pipelines-graphbolt/deploy_papers100M_pipeline.sh b/examples/sagemaker-pipelines-graphbolt/deploy_papers100M_pipeline.sh new file mode 100644 index 0000000000..d85a2edd1c --- /dev/null +++ b/examples/sagemaker-pipelines-graphbolt/deploy_papers100M_pipeline.sh @@ -0,0 +1,139 @@ +#!/bin/env bash +set -euox pipefail + +SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd -P) + + +msg() { + echo >&2 -e "${1-}" +} + +die() { + local msg=$1 + local code=${2-1} # default exit status 1 + msg "$msg" + exit "$code" +} + +parse_params() { + # default values of variables set from params + ACCOUNT=$(aws sts get-caller-identity --query Account --output text || true) + REGION=$(aws configure get region || true) + REGION=${REGION:-"us-east-1"} + PIPELINE_NAME="" + + + while :; do + case "${1-}" in + -h | --help) usage ;; + -x | --verbose) set -x ;; + -r | --role) + ROLE="${2-}" + shift + ;; + -a | --account) + ACCOUNT="${2-}" + shift + ;; + -b | --bucket) + BUCKET_NAME="${2-}" + shift + ;; + -n | --pipeline-name) + PIPELINE_NAME="${2-}" + shift + ;; + -g | --use-graphbolt) + USE_GRAPHBOLT="${2-}" + shift + ;; + -?*) die "Unknown option: $1" ;; + *) break ;; + esac + shift + done + + # check required params and arguments + [[ -z "${ACCOUNT-}" ]] && die "Missing required parameter: -a/--account " + [[ -z "${BUCKET-}" ]] && die "Missing required parameter: -b/--bucket " + [[ -z "${ROLE-}" ]] && die "Missing required parameter: -r/--role " + [[ -z "${USE_GRAPHBOLT-}" ]] && die "Missing required parameter: -g/--use-graphbolt " + + return 0 +} + +cleanup() { + trap - SIGINT SIGTERM ERR EXIT + # script cleanup here +} + +parse_params "$@" + +if [[ ${USE_GRAPHBOLT} == "true" || ${USE_GRAPHBOLT} == "false" ]]; then + : # Do nothing +else + die "-g/--use-graphbolt parameter needs to be one of 'true' or 'false', got ${USE_GRAPHBOLT}" +fi + + +JOBS_TO_RUN="gconstruct train inference" + +OUTPUT_PATH="s3://${BUCKET_NAME}/pipelines-output" +GRAPH_NAME="papers-100M" +INSTANCE_COUNT="4" + +CPU_INSTANCE_TYPE="ml.r5.24xlarge" +TRAIN_GPU_INSTANCE="ml.g5.48xlarge" +GCONSTRUCT_INSTANCE="ml.r5.24xlarge" +NUM_TRAINERS=8 + +GSF_CPU_IMAGE_URI=${ACCOUNT}.dkr.ecr.$REGION.amazonaws.com/graphstorm:sagemaker-cpu +GSF_GPU_IMAGE_URI=${ACCOUNT}.dkr.ecr.$REGION.amazonaws.com/graphstorm:sagemaker-gpu + +GCONSTRUCT_CONFIG="gconstruct_config_papers100m.json" +GRAPH_CONSTRUCTION_ARGS="--add-reverse-edges False --num-processes 16" + +PARTITION_OUTPUT_JSON="metadata.json" +PARTITION_OUTPUT_JSON="$GRAPH_NAME.json" +PARTITION_ALGORITHM="metis" +TRAIN_YAML_S3="s3://$BUCKET_NAME/yaml/papers100M_nc.yaml" +INFERENCE_YAML_S3="s3://$BUCKET_NAME/yaml/papers100M_nc.yaml" +TASK_TYPE="node_classification" +INFERENCE_MODEL_SNAPSHOT="epoch-14" +VOLUME_SIZE=400 + +if [[ -z "${PIPELINE_NAME-}" ]]; then + if [[ $USE_GRAPHBOLT == "true" ]]; then + PIPELINE_NAME="papers100M-gs-graphbolt-pipeline" + else + PIPELINE_NAME="papers100M-gs-pipeline" + fi +fi + +python3 $SCRIPT_DIR/../../sagemaker/pipeline/create_sm_pipeline.py \ + --execution-role "${ROLE}" \ + --cpu-instance-type ${CPU_INSTANCE_TYPE} \ + --gpu-instance-type ${TRAIN_GPU_INSTANCE} \ + --graph-construction-args "${GRAPH_CONSTRUCTION_ARGS}" \ + --graph-construction-instance-type ${GCONSTRUCT_INSTANCE} \ + --graph-construction-config-filename ${GCONSTRUCT_CONFIG} \ + --graph-name ${GRAPH_NAME} \ + --graphstorm-pytorch-cpu-image-url "${GSF_CPU_IMAGE_URI}" \ + --graphstorm-pytorch-gpu-image-url "${GSF_GPU_IMAGE_URI}" \ + --inference-model-snapshot "${INFERENCE_MODEL_SNAPSHOT}" \ + --inference-yaml-s3 "${INFERENCE_YAML_S3}" \ + --input-data-s3 "${DATASET_S3_PATH}" \ + --instance-count ${INSTANCE_COUNT} \ + --jobs-to-run "${JOBS_TO_RUN}" \ + --num-trainers ${NUM_TRAINERS} \ + --output-prefix-s3 "${OUTPUT_PATH}" \ + --pipeline-name "${PIPELINE_NAME}" \ + --partition-output-json ${PARTITION_OUTPUT_JSON} \ + --partition-algorithm ${PARTITION_ALGORITHM} \ + --region ${REGION} \ + --train-inference-task ${TASK_TYPE} \ + --train-yaml-s3 "${TRAIN_YAML_S3}" \ + --save-embeddings \ + --save-predictions \ + --volume-size-gb ${VOLUME_SIZE} \ + --use-graphbolt ${USE_GRAPHBOLT}