SageMaker training job fails in production after working for some time #4221
Unanswered
tom-whitehead
asked this question in
Help
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
We have SageMaker training jobs that run tens of times a day in production as part of a pipeline. These work absolutely fine always exactly 9 days before they all start failing. I have to manually redeploy the pipeline to get the jobs to succeed again, until the error repeats.
In the logs, I get the following boto3 exception:
Which is shortly preceded by the following stack trace:
It seems that SageMaker is trying to download our training code into the container. except the files are not found in S3 where they're supposed to be. I'm not sure how they can be there for a week or so and then suddenly not. Given that this doesn't relate to any of our code, it's very difficult to debug as we cannot add log lines to figure out what is happening.
Any help greatly appreciated! Thank you.
Beta Was this translation helpful? Give feedback.
All reactions