Unable to create/update endpoint for a model created from a training job #2592

Pryanga · 2021-08-24T15:40:14Z

Pryanga
Aug 24, 2021

I have trained a model, deployed it successfully by just running the notebook (https://github.com/huggingface/notebooks/blob/master/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb)
However, I am trying to rerun the training job and deploy the new model at the endpoint. I am able to create a new trainingjob (whose specs are identical) to the training job obtained from the example notebook. I am able to create a new model and new endpoint configuration. But I get the following error when I update the endpoint with the new endpoint configuration:

The primary container for production variant did not pass the ping health check

How I create models, endpoint configuration, and update endpoint:

I have the following code in my AWS Lambda to trigger the creation process:
Using Python 3.6

def deploy_model(sm_client, train_job_name, train_job_descr):
    ##  create model
    model_name = os.environ['model_name_prefix']+str(datetime.datetime.today()).replace(' ', '-').replace(':', '-').rsplit('.')[0]
    # model_name = os.environ['model_name']
    model_creation_descr = sm_client.create_model(
        ModelName=model_name,
        PrimaryContainer={
            'Image': train_job_descr['AlgorithmSpecification']['TrainingImage'] ,
                
            'Mode': 'SingleModel',
            'ModelDataUrl': train_job_descr['ModelArtifacts']['S3ModelArtifacts'] ,
            'Environment': {
                'SAGEMAKER_CONTAINER_LOG_LEVEL': '20',
                'SAGEMAKER_REGION': 'ap-southeast-1' }
                },
        ExecutionRoleArn=train_job_descr['RoleArn']
    )
    print("\nModel Creation Response:\n")
    print(model_creation_descr)
        
    ## create endpoint configuration
    endpoint_config_name = os.environ['endpoint_config_name_prefix'] + str(datetime.datetime.today()).replace(' ', '-').replace(':', '-').rsplit('.')[0]
    endpoint_config_response = sm_client.create_endpoint_config(
                    EndpointConfigName= endpoint_config_name,
                    ProductionVariants=[
                        {
                            'VariantName': 'AllTraffic',
                            'ModelName': model_name,
                            'InitialInstanceCount': 1,
                            'InstanceType': 'ml.g4dn.xlarge',
                            'InitialVariantWeight': 1.0,
                        },
                    ],
        )
    print("\Endpoint Configeration Creation Response:\n")
    print(endpoint_config_response)
        
    ## update endpoint if an endpoint already exists
    endpoint_descrip = sm_client.describe_endpoint(EndpointName=os.environ['endpoint_name'])
    if endpoint_descrip['EndpointStatus'] == 'InService':
        # sm_client.update_endpoint(
        #         EndpointName=os.environ['endpoint_name'],
        #         EndpointConfigName='hpt-imdb-senti-ep-config2021-08-24-14-06-07',
        #     )
    else:
        print(f"Endpoint: {endpoint_descrip} is not in service")
                
    return

Other Approaches:
I received the same error at the endpoint when I tried to create a model and an endpoint configuration using AWS Sagemaker GUI.

Kindly Assist

Pryanga · 2021-08-25T09:53:34Z

Pryanga
Aug 25, 2021
Author

I resolved the problem. The issue is at sm_client.create_model():

PrimaryContainer={
            'Image': train_job_descr['AlgorithmSpecification']['TrainingImage'] ,

I have used Deep Learning Container Images for training the model. So for the PrimaryContainer['Image'] (above code), I have to pass an inference image and not a training image.

AS long as my training image does not change, my inference image should not change. I chose the correct inference image from the AWS Sagemkaer Deep Learning Container Images and pass it to the create_model API.

PrimaryContainer={
            'Image': "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.7.1-transformers4.6.1-cpu-py36-ubuntu18.04",

PROBLEM RESOLVED :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to create/update endpoint for a model created from a training job #2592

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Unable to create/update endpoint for a model created from a training job #2592

Pryanga Aug 24, 2021

Replies: 1 comment

Pryanga Aug 25, 2021 Author

Pryanga
Aug 24, 2021

Pryanga
Aug 25, 2021
Author