You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using a custom entrypoint on the scikit-learn pre-built container to use scikit-learns HashingVectorizer. Doing a batch transform outputs data to s3.
When trying to run Amazon's KMeans algo on that output file, I get the following error:
The header of the MXNet RecordIO record at position 334 in the dataset does not start with a valid magic number.
Full error message:
starting train job:10
training artifacts will be uploaded to: s3://sagemaker-studio-254827122652-9bjw2ki1mrk/taylorc/data/ubuntu-dialogue/kmeans-lowlevel-2021-11-10-20-43-58
InProgress
Training job ended with status: Failed
Training failed with the following error: ClientError: Unable to read data channel 'train'. Requested content-type is 'application/x-recordio-protobuf'. Please verify the data matches the requested content-type. (caused by MXNetError)
Caused by: [20:47:57] /opt/brazil-pkg-cache/packages/AIAlgorithmsCppLibs/AIAlgorithmsCppLibs-2.0.4399.0/AL2_x86_64/generic-flavor/src/src/aialgs/io/iterator_base.cpp:100: (Input Error) The header of the MXNet RecordIO record at position 334 in the dataset does not start with a valid magic number.
Stack trace returned 10 entries:
[bt] (0) /opt/amazon/lib/libaialgs.so(+0x9d1b) [0x7f23d53b4d1b]
[bt] (1) /opt/amazon/lib/libaialgs.so(+0xa549) [0x7f23d53b5549]
[bt] (2) /opt/amazon/lib/libaialgs.so(aialgs::iterator_base::Next()+0x448) [0x7f23d53c2128]
[bt] (3) /opt/amazon/lib/libmxnet.so(MXDataIterNext+0x21) [0x7f23bc868051]
[bt] (4) /opt/amazon/lib/libffi.so.6(ffi_call_unix64+0x4c) [0x7f23d5699078]
[bt] (5) /opt/amazon/lib/libffi.so.6(ffi_call+0x186) [0x7f23d5698206]
[bt] (6) /opt/amazon/python3.7/lib/python3.7/lib-dynlo
Note that my dataset has 334 datapoints, so it seems like perhaps it's the last line of the file where the problem is. I have seen this issue come up on stack overflow and MXNet forum and but cannot find a solution, and not sure how to debug. On SO there is advice to upload the file to s3 using a different method, but I don't think that works for this use case (pushing the data to s3 from the sklearn container).
The relevant part of the inference code from the entrypoint script is here:
The float32 conversion is because prior to doing that I got an error saying float64 cannot be used for KMeans.
The code in my notebook used to for triggering training and transforms is here:
fromsagemaker.sklearn.estimatorimportSKLearninstance_type='ml.m5.4xlarge'model_data_uri=f's3://{bucket_name}/taylorc/model/ubuntu-dialogue/'estimator=SKLearn(
entry_point='scikit_learn_tfidf.py',
framework_version="0.20.0",
py_version='py3',
instance_type=instance_type,
role=role,
output_path=model_data_uri,
base_job_name='sklearn-tfidf',
hyperparameters={'ngram_range': '1_1'}
)
estimator.fit({'train': training_data_uri})
transformer=estimator.transformer(
instance_count=1,
instance_type=instance_type,
assemble_with='Line',
accept='application/x-recordio-protobuf',
strategy='SingleRecord',
max_payload=100,
max_concurrent_transforms=None,
)
transformer.transform(s3_path_to_test_data, content_type='text/csv')
fromtimeimportgmtime, strftimeoutput_time=strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_folder='kmeans-lowlevel-'+output_timeK=range(10, 11) # change the range to be used for kINSTANCE_COUNT=1run_parallel_jobs=False#True #make this false to run jobs one at a time, especially if you do not want #create too many EC2 instances at once to avoid hitting into limits.job_names= []
# launching jobs for all kforkinK:
print('starting train job:'+str(k))
output_location=f's3://{bucket_name}/{data_folder}/{output_folder}'print('training artifacts will be uploaded to: {}'.format(output_location))
job_name=output_folder+str(k)
create_training_params= \
{
"AlgorithmSpecification": {
"TrainingImage": '382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:latest',
"TrainingInputMode": "File"
},
"RoleArn": role,
"OutputDataConfig": {
"S3OutputPath": output_location
},
"ResourceConfig": {
"InstanceCount": INSTANCE_COUNT,
"InstanceType": "ml.c4.8xlarge",
"VolumeSizeInGB": 50
},
"TrainingJobName": job_name,
"HyperParameters": {
"k": str(k),
"feature_dim": "1048576",
"mini_batch_size": "10"
},
"StoppingCondition": {
"MaxRuntimeInSeconds": 60*60
},
"InputDataConfig": [
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": f'{transformer.output_path}/',
"S3DataDistributionType": "FullyReplicated"
}
},
"CompressionType": "None",
"RecordWrapperType": "None"
}
]
}
sagemaker=boto3.client('sagemaker')
sagemaker.create_training_job(**create_training_params)
status=sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print(status)
ifnotrun_parallel_jobs:
try:
sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
finally:
status=sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print("Training job ended with status: "+status)
ifstatus=='Failed':
message=sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
print('Training failed with the following error: {}'.format(message))
raiseException('Training job failed')
job_names.append(job_name)
If anyone has an idea about how I could go about debugging this further, please advise. Thanks.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'm using a custom entrypoint on the scikit-learn pre-built container to use scikit-learns
HashingVectorizer
. Doing a batch transform outputs data to s3.When trying to run Amazon's KMeans algo on that output file, I get the following error:
The header of the MXNet RecordIO record at position 334 in the dataset does not start with a valid magic number.
Full error message:
Note that my dataset has 334 datapoints, so it seems like perhaps it's the last line of the file where the problem is. I have seen this issue come up on stack overflow and MXNet forum and but cannot find a solution, and not sure how to debug. On SO there is advice to upload the file to s3 using a different method, but I don't think that works for this use case (pushing the data to s3 from the sklearn container).
The relevant part of the inference code from the entrypoint script is here:
The float32 conversion is because prior to doing that I got an error saying float64 cannot be used for KMeans.
The code in my notebook used to for triggering training and transforms is here:
If anyone has an idea about how I could go about debugging this further, please advise. Thanks.
Beta Was this translation helpful? Give feedback.
All reactions