Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graphstorm images utilisation in SageMaker TuningJobs #1072

Open
milianru opened this issue Oct 21, 2024 · 6 comments · May be fixed by #1133
Open

graphstorm images utilisation in SageMaker TuningJobs #1072

milianru opened this issue Oct 21, 2024 · 6 comments · May be fixed by #1133
Assignees
Labels
0.4.1 help wanted Extra attention is needed sagemaker
Milestone

Comments

@milianru
Copy link

Hello,

I was facing some issues with colliding model artifacts using the graphstorm training image in a TuningJob.

I was wondering, if the graphstorm docker images are in general intended for usage in Sagemaker TuningJobs? And if so, is there an example on how to orchestrate model artifacts of each training job?

When we constructed a HyperparameterTuner with a PyTorch estimator using the training images, the result was first unexpected. All training jobs wrote their artifacts to --model-artifact-s3, without building up the S3 prefixes for each training run - as I knew it from other training images.

After diving a bit into the code and the AWS documentation I found that Sagemaker moves everything stored under '/opt/ml/code to the output prefix specified by the estimator object. In contrast to this common interface the graphstorm container implement the logic controlled by '--model-artifact-s3, which works perfectly fine for a single training job, but results in model artifacts collisions for a TuningJob.

We found a way to workaround the issue by creating a custom training entry script. It which basically extended the regular train_entry.py with a final copy of the training artifacts to /opt/ml/model. Having the model artifacts on the containter stored there, allowed the TuniningJob to orchestrate all training artifacts successfully.

I was wondering, whether the model storage in graphstorm/sagemaker/sagemaker_train.py under /tmp/gsgnn_model/ is prefered over /opt/ml/model for a different reason (here)? If this is the case, would it be possible to define a new argument for the local storage directory or similar?

In case this is something that you don't have on your agenda, but you see the purpose, I could also assist with a PR.

@classicsong classicsong added the help wanted Extra attention is needed label Oct 23, 2024
@thvasilo
Copy link
Contributor

Hi @milianru I think this actually a shortcoming of our SageMaker training output implementation.

If you're willing to contribute a fix I can shepherd the PR and try to get it merged for the next release.

@classicsong
Copy link
Contributor

classicsong commented Oct 23, 2024

Hi, @milianru, for the place where GraphStorm stores the model artifacts, we use /tmp/gsgnn_model/ because GraphStorm will upload the model to an S3 bucket (defined by --model-artifact-s3) by itself. GraphStorm does not rely on SageMaker to upload the model artifact (as sometime the artifact can be very large).

You are free to adujst the training entry script if it works :).
Another way is to see if HyperparameterTuner can automatically generate arguments for --model-artifact-s3.

@milianru
Copy link
Author

Hi @thvasilo,
thanks for your response. I would have liked to contribute, but I am not allowed to participate in OpenSource using my company‘s cloud resources. Even though the change would be little I cannot execute an end2end test run.

@milianru
Copy link
Author

Hi @classicsong,
The idea with modifying the ‘—model-artifact-s3‘ in the hyperparameter tuning, was also something that I had in mind, while finding a good workaround for us.
However I did not find directly documentation on how this can be achieved.
But maybe it is worth to spent some time researching about options (eg. to count ParameterRanges up or similar).

@classicsong
Copy link
Contributor

classicsong commented Nov 13, 2024

Hi @thvasilo, thanks for your response. I would have liked to contribute, but I am not allowed to participate in OpenSource using my company‘s cloud resources. Even though the change would be little I cannot execute an end2end test run.

Hi, @milianru
Does your company have AWS SA to support your work? The SA may be able to help you find the SageMaker expert.
The SA can also help create a ticket to the GraphStorm team.

@classicsong
Copy link
Contributor

Hi @classicsong, The idea with modifying the ‘—model-artifact-s3‘ in the hyperparameter tuning, was also something that I had in mind, while finding a good workaround for us. However I did not find directly documentation on how this can be achieved. But maybe it is worth to spent some time researching about options (eg. to count ParameterRanges up or similar).

Another workaround is you can change the sagemaker_train.py to customize the S3 path in
https://github.com/awslabs/graphstorm/blob/main/python/graphstorm/sagemaker/sagemaker_train.py#L232.
You can add the sagemaker job name into the S3 path.

The sagemaker job name can be found from the train_env (https://github.com/awslabs/graphstorm/blob/main/python/graphstorm/sagemaker/sagemaker_train.py#L178)
Here is the doc of the training environment parameter:
https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md#sm_training_env. There is a field called "job_name".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.4.1 help wanted Extra attention is needed sagemaker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants