Kmeans input error on RecordIO/Protobuf encoded sparse matrix from HashingVectorizer output #2757

tc64 · 2021-11-10T22:39:38Z

tc64
Nov 10, 2021

I'm using a custom entrypoint on the scikit-learn pre-built container to use scikit-learns HashingVectorizer. Doing a batch transform outputs data to s3.

When trying to run Amazon's KMeans algo on that output file, I get the following error:

The header of the MXNet RecordIO record at position 334 in the dataset does not start with a valid magic number.

Full error message:

starting train job:10
training artifacts will be uploaded to: s3://sagemaker-studio-254827122652-9bjw2ki1mrk/taylorc/data/ubuntu-dialogue/kmeans-lowlevel-2021-11-10-20-43-58
InProgress
Training job ended with status: Failed
Training failed with the following error: ClientError: Unable to read data channel 'train'. Requested content-type is 'application/x-recordio-protobuf'. Please verify the data matches the requested content-type. (caused by MXNetError)

Caused by: [20:47:57] /opt/brazil-pkg-cache/packages/AIAlgorithmsCppLibs/AIAlgorithmsCppLibs-2.0.4399.0/AL2_x86_64/generic-flavor/src/src/aialgs/io/iterator_base.cpp:100: (Input Error) The header of the MXNet RecordIO record at position 334 in the dataset does not start with a valid magic number.

Stack trace returned 10 entries:
[bt] (0) /opt/amazon/lib/libaialgs.so(+0x9d1b) [0x7f23d53b4d1b]
[bt] (1) /opt/amazon/lib/libaialgs.so(+0xa549) [0x7f23d53b5549]
[bt] (2) /opt/amazon/lib/libaialgs.so(aialgs::iterator_base::Next()+0x448) [0x7f23d53c2128]
[bt] (3) /opt/amazon/lib/libmxnet.so(MXDataIterNext+0x21) [0x7f23bc868051]
[bt] (4) /opt/amazon/lib/libffi.so.6(ffi_call_unix64+0x4c) [0x7f23d5699078]
[bt] (5) /opt/amazon/lib/libffi.so.6(ffi_call+0x186) [0x7f23d5698206]
[bt] (6) /opt/amazon/python3.7/lib/python3.7/lib-dynlo

Note that my dataset has 334 datapoints, so it seems like perhaps it's the last line of the file where the problem is. I have seen this issue come up on stack overflow and MXNet forum and but cannot find a solution, and not sure how to debug. On SO there is advice to upload the file to s3 using a different method, but I don't think that works for this use case (pushing the data to s3 from the sklearn container).

The relevant part of the inference code from the entrypoint script is here:

def output_fn(prediction, accept):
    if accept == "application/x-recordio-protobuf":
        print(f'prediction type: {type(prediction)}')
        print(prediction)
        pred32 = prediction.astype('float32')
        print(f'pred32 type: {type(pred32)}')
        print(pred32)
        rio_bytes = array_to_recordio_protobuf(pred32)
        return worker.Response(rio_bytes, accept, mimetype=accept)

The float32 conversion is because prior to doing that I got an error saying float64 cannot be used for KMeans.
The code in my notebook used to for triggering training and transforms is here:

from sagemaker.sklearn.estimator import SKLearn
instance_type = 'ml.m5.4xlarge'
model_data_uri = f's3://{bucket_name}/taylorc/model/ubuntu-dialogue/'

estimator = SKLearn(
    entry_point='scikit_learn_tfidf.py',
    framework_version = "0.20.0",
    py_version = 'py3',
    instance_type= instance_type,                     
    role=role, 
    output_path=model_data_uri,
    base_job_name='sklearn-tfidf',
    hyperparameters={'ngram_range': '1_1'}
)
estimator.fit({'train': training_data_uri})

transformer = estimator.transformer(
    instance_count=1, 
    instance_type=instance_type,
    assemble_with='Line',
    accept='application/x-recordio-protobuf',
    strategy='SingleRecord',
    max_payload=100,
    max_concurrent_transforms=None,
)

transformer.transform(s3_path_to_test_data, content_type='text/csv')

from time import gmtime, strftime
output_time = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_folder = 'kmeans-lowlevel-' + output_time
K = range(10, 11) # change the range to be used for k
INSTANCE_COUNT = 1
run_parallel_jobs = False#True #make this false to run jobs one at a time, especially if you do not want 
#create too many EC2 instances at once to avoid hitting into limits.
job_names = []


# launching jobs for all k
for k in K:
    print('starting train job:' + str(k))
    output_location = f's3://{bucket_name}/{data_folder}/{output_folder}'
    print('training artifacts will be uploaded to: {}'.format(output_location))
    job_name = output_folder + str(k)

    create_training_params = \
    {
        "AlgorithmSpecification": {
            "TrainingImage": '382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:latest',
            "TrainingInputMode": "File"
        },
        "RoleArn": role,
        "OutputDataConfig": {
            "S3OutputPath": output_location
        },
        "ResourceConfig": {
            "InstanceCount": INSTANCE_COUNT,
            "InstanceType": "ml.c4.8xlarge",
            "VolumeSizeInGB": 50
        },
        "TrainingJobName": job_name,
        "HyperParameters": {
            "k": str(k),
            "feature_dim": "1048576",
            "mini_batch_size": "10"
        },
        "StoppingCondition": {
            "MaxRuntimeInSeconds": 60 * 60
        },
            "InputDataConfig": [
            {
                "ChannelName": "train",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": f'{transformer.output_path}/',
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },

                "CompressionType": "None",
                "RecordWrapperType": "None"
            }
        ]
    }

    sagemaker = boto3.client('sagemaker')

    sagemaker.create_training_job(**create_training_params)

    status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    print(status)
    if not run_parallel_jobs:
        try:
            sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
        finally:
            status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
            print("Training job ended with status: " + status)
            if status == 'Failed':
                message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
                print('Training failed with the following error: {}'.format(message))
                raise Exception('Training job failed')
    
    job_names.append(job_name)

If anyone has an idea about how I could go about debugging this further, please advise. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kmeans input error on RecordIO/Protobuf encoded sparse matrix from HashingVectorizer output #2757

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Kmeans input error on RecordIO/Protobuf encoded sparse matrix from HashingVectorizer output #2757

tc64 Nov 10, 2021

Replies: 0 comments

tc64
Nov 10, 2021