InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_8' #11

anseey · 2016-12-30T18:22:41Z

distributed/cancer_classifier.py works in only one docker container.

It works in one container:

# both in 127.17.0.3
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.3:8223 --job_name=ps --task_index=0
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.3:8223 --job_name=worker --task_index=0

But it not work in two containers:

# ps in 127.17.0.3
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.4:8223 --job_name=ps --task_index=0`
# worker in 127.17.0.4
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.4:8223 --job_name=worker --task_index=0

the error msg I got in the worker:

I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> 127.17.0.3:8222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:211] Started server with target: grpc://localhost:8222
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py:344 in __init__.: __init__ (from tensorflow.python.training.summary_io) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.FileWriter. The interface and behavior is the same; this is just a rename.
I tensorflow/core/distributed_runtime/master_session.cc:993] Start master session 91acfc1008531f4d with config:

Traceback (most recent call last):
  File "cancer_classifier_new.py", line 241, in <module>
    tf.app.run(main=main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "cancer_classifier_new.py", line 209, in main
    with sv.managed_session(server.target) as sess:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 974, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 802, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 963, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 720, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 227, in prepare_session
    config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 173, in _restore_checkpoint
    saver.restore(sess, ckpt.model_checkpoint_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1388, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device to node 'save/RestoreV2_8': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:worker/replica:0/task:0/cpu:0
	 [[Node: save/RestoreV2_8 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save/Const, save/RestoreV2_8/tensor_names, save/RestoreV2_8/shape_and_slices)]]

Caused by op u'save/RestoreV2_8', defined at:
  File "cancer_classifier_new.py", line 241, in <module>
    tf.app.run(main=main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "cancer_classifier_new.py", line 191, in main
    saver = tf.train.Saver()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1000, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1030, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 624, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 361, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 200, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_8': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:worker/replica:0/task:0/cpu:0
	 [[Node: save/RestoreV2_8 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save/Const, save/RestoreV2_8/tensor_names, save/RestoreV2_8/shape_and_slices)]]

The text was updated successfully, but these errors were encountered:

tobegit3hub · 2017-01-03T02:38:14Z

It may be the bug of the script for distributed training with latest TensorFlow.

I will refactor the code for distributed soon. If you want to run distributed TensorFlow application, please try tobegit3hub/distributed_tensorflow which is much better now.

anseey · 2017-01-03T12:07:54Z

@tobegit3hub Thank you!
I have tried tobegit3hub/distributed_tensorflow, it works!
But it still has the problem https://github.com/tensorflow/tensorflow/issues/5110

tobegit3hub · 2017-01-03T12:42:14Z

Yes, that's something we're working for now.

anseey closed this as completed Dec 31, 2016

anseey reopened this Dec 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_8' #11

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_8' #11

anseey commented Dec 30, 2016 •

edited

Loading

tobegit3hub commented Jan 3, 2017

anseey commented Jan 3, 2017

tobegit3hub commented Jan 3, 2017

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_8' #11

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_8' #11

Comments

anseey commented Dec 30, 2016 • edited Loading

tobegit3hub commented Jan 3, 2017

anseey commented Jan 3, 2017

tobegit3hub commented Jan 3, 2017

anseey commented Dec 30, 2016 •

edited

Loading