SageMaker training job fails in production after working for some time #4221

tom-whitehead · 2023-10-23T10:03:36Z

tom-whitehead
Oct 23, 2023

Hello,

We have SageMaker training jobs that run tens of times a day in production as part of a pipeline. These work absolutely fine always exactly 9 days before they all start failing. I have to manually redeploy the pipeline to get the jobs to succeed again, until the error repeats.

In the logs, I get the following boto3 exception:

botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

Which is shortly preceded by the following stack trace:

Traceback (most recent call last):
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_containers/_trainer.py", line 84, in train
    entrypoint()
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_sklearn_container/training.py", line 39, in main
    train(environment.Environment())
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_sklearn_container/training.py", line 31, in train
    entry_point.run(uri=training_environment.module_dir,
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_training/entry_point.py", line 92, in run
    files.download_and_extract(uri=uri, path=environment.code_dir)
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_training/files.py", line 138, in download_and_extract
    s3_download(uri, dst)
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_training/files.py", line 174, in s3_download
    s3.Bucket(bucket).download_file(key, dst)
  File "/miniconda3/lib/python3.8/site-packages/boto3/s3/inject.py", line 277, in bucket_download_file
    return self.meta.client.download_file(
  File "/miniconda3/lib/python3.8/site-packages/boto3/s3/inject.py", line 190, in download_file
    return transfer.download_file(
  File "/miniconda3/lib/python3.8/site-packages/boto3/s3/transfer.py", line 320, in download_file
    future.result()
  File "/miniconda3/lib/python3.8/site-packages/s3transfer/futures.py", line 103, in result
    return self._coordinator.result()
  File "/miniconda3/lib/python3.8/site-packages/s3transfer/futures.py", line 266, in result
    raise self._exception
  File "/miniconda3/lib/python3.8/site-packages/s3transfer/tasks.py", line 269, in _main
    self._submit(transfer_future=transfer_future, **kwargs)
  File "/miniconda3/lib/python3.8/site-packages/s3transfer/download.py", line 354, in _submit
    response = client.head_object(
  File "/miniconda3/lib/python3.8/site-packages/botocore/client.py", line 508, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/miniconda3/lib/python3.8/site-packages/botocore/client.py", line 915, in _make_api_call
    raise error_class(parsed_response, operation_name)

It seems that SageMaker is trying to download our training code into the container. except the files are not found in S3 where they're supposed to be. I'm not sure how they can be there for a week or so and then suddenly not. Given that this doesn't relate to any of our code, it's very difficult to debug as we cannot add log lines to figure out what is happening.

Any help greatly appreciated! Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SageMaker training job fails in production after working for some time #4221

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

SageMaker training job fails in production after working for some time #4221

tom-whitehead Oct 23, 2023

Replies: 0 comments

tom-whitehead
Oct 23, 2023