You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
distributed/cancer_classifier.py works in only one docker container.
It works in one container:
# both in 127.17.0.3
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.3:8223 --job_name=ps --task_index=0
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.3:8223 --job_name=worker --task_index=0
But it not work in two containers:
# ps in 127.17.0.3
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.4:8223 --job_name=ps --task_index=0`# worker in 127.17.0.4python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.4:8223 --job_name=worker --task_index=0
the error msg I got in the worker:
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> 127.17.0.3:8222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:211] Started server with target: grpc://localhost:8222
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py:344 in __init__.: __init__ (from tensorflow.python.training.summary_io) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.FileWriter. The interface and behavior is the same; this is just a rename.
I tensorflow/core/distributed_runtime/master_session.cc:993] Start master session 91acfc1008531f4d with config:
Traceback (most recent call last):
File "cancer_classifier_new.py", line 241, in <module>
tf.app.run(main=main)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "cancer_classifier_new.py", line 209, in main
with sv.managed_session(server.target) as sess:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 974, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 802, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 963, in managed_session
start_standard_services=start_standard_services)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 720, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 227, in prepare_session
config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 173, in _restore_checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1388, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device to node 'save/RestoreV2_8': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:worker/replica:0/task:0/cpu:0
[[Node: save/RestoreV2_8 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save/Const, save/RestoreV2_8/tensor_names, save/RestoreV2_8/shape_and_slices)]]
Caused by op u'save/RestoreV2_8', defined at:
File "cancer_classifier_new.py", line 241, in <module>
tf.app.run(main=main)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "cancer_classifier_new.py", line 191, in main
saver = tf.train.Saver()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1000, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1030, in build
restore_sequentially=self._restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 624, in build
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 361, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 200, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2
dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_8': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:worker/replica:0/task:0/cpu:0
[[Node: save/RestoreV2_8 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save/Const, save/RestoreV2_8/tensor_names, save/RestoreV2_8/shape_and_slices)]]
The text was updated successfully, but these errors were encountered:
It may be the bug of the script for distributed training with latest TensorFlow.
I will refactor the code for distributed soon. If you want to run distributed TensorFlow application, please try tobegit3hub/distributed_tensorflow which is much better now.
distributed/cancer_classifier.py works in only one docker container.
It works in one container:
# both in 127.17.0.3 python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.3:8223 --job_name=ps --task_index=0 python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.3:8223 --job_name=worker --task_index=0
But it not work in two containers:
the error msg I got in the worker:
The text was updated successfully, but these errors were encountered: