Using Pytorch image produces an argparse bug #2819

VictorJouault · 2022-01-03T22:27:12Z

VictorJouault
Jan 3, 2022

Hi,

I am trying to run a job on Sagemaker that requires both MXNet (because I am using GluonTS, which requires MXNet) and Pytorch. However, no matter the deep learning container image I select (from here), the job fails and I cannot manage to solve it.

In particular, when using a Pytorch container (either 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker or "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04"), the following message is printed at the beginning of the script:

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

This seems to be causing a problem with the argparser when calling the experiment script. Even though the function call worked with another image, it doesn't work anymore and produces the error message below.

�[34mTraceback (most recent call last):
  File "experiment.py", line 358, in <module>
    args.hyper_params = json.loads(args.hyper_params)
  File "/opt/conda/lib/python3.8/json/__init__.py", line 357, in loads�[0m
�[34mreturn _default_decoder.decode(s)
  File "/opt/conda/lib/python3.8/json/decoder.py", line 337, in decode�[0m
�[34mobj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.8/json/decoder.py", line 353, in raw_decode�[0m
�[34mobj, end = self.scan_once(s, idx)�[0m
�[34mjson.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)�[0m
�[34m2022-01-03 21:58:00,347 sagemaker-training-toolkit ERROR    Reporting training FAILURE�[0m
�[34m2022-01-03 21:58:00,347 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:�[0m
�[34mExitCode 1�[0m
�[34mErrorMessage ""�[0m
�[34mCommand "/opt/conda/bin/python3.8 experiment.py --data-bucket sagemaker-us-east-1-XXX --data-prefix sample_dataset --estimator CustomEstimator --hyper-params {"prediction_length": 168, "context_length": 672, "trainer_kwargs": {"max_epochs": 200}} --job-config {}"�[0m

However, I also am not able to use the MXNet image because of a Horovod bug ..

Any idea is appreciated, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Pytorch image produces an argparse bug #2819

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Using Pytorch image produces an argparse bug #2819

VictorJouault Jan 3, 2022

Replies: 0 comments

VictorJouault
Jan 3, 2022