Replies: 1 comment 1 reply
-
Hey, how are you setting the TS_DEFAULT_RESPONSE_TIMEOUT config? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've deployed a custom model with an async endpoint. I want to process video files with it because videos can have ~5-10 minutes I can't load all frames to memory. Of course, I want to make an inference on each frame.
I've written
input_fn
- download video file from s3 using boto and creates generator which loads video frames with a given batch size - return a generator - written with OpenCVpredict_fn
- iterate over generator batched frames and generate prediction using model - save prediction in listoutput_fn
- transform prediction into json format, gzip all to reduce the sizeEndpoint works well, but the problem is concurrency. The sagemaker endpoint processes request after request (from cloudwatch and s3 save file time). I don't know why this happens.
max_concurrent_invocations_per_instance
is set to 1000. Other settings from PyTorch serving are as follows:And still, it doesn't work. So how can I create an async inference endpoint with PyTorch to get concurrency?
Beta Was this translation helpful? Give feedback.
All reactions