Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HW2 공지] GPU가 장착된 머신으로 cpu_enable branch 사용시 버그 수정 #25

Open
gyeongin opened this issue Dec 4, 2018 · 10 comments

Comments

@gyeongin
Copy link
Contributor

gyeongin commented Dec 4, 2018

GPU가 장착되어 있는 머신에서 cpu_enable branch를 사용해

localhost
localhost

따위로 cpu worker 2개를 사용하려 할 때 버그가 있어 이를 수정하였습니다.
GPU가 장착된 머신에서 CPU만 이용해 학습하려 하실 경우 cpu_enable branch를 새로 pull 해 주시길 바랍니다.

@jeeyung
Copy link

jeeyung commented Dec 4, 2018

cpu_enable branch를 pull 했는데도,
같은 error가 발생합니다.

@gyeongin
Copy link
Contributor Author

gyeongin commented Dec 4, 2018

사용하신 resource info가 무엇인가요?
발생한 error message가 무엇인가요?

@jeeyung
Copy link

jeeyung commented Dec 5, 2018

localhost
localhost

이구요,

error message는 아래와 같습니다.

WARNING:tensorflow:From /home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: init (from tensorflow.[0/1498]learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
INFO:140116569343744:PARALLAX:parallel_run(PARALLAX_RUN_MPI)
INFO:140116569343744:PARALLAX:resource ps_localhost:46849:+localhost:37463:^worker_localhost:43741:0+localhost:38105:0
INFO:139728530999040:PARALLAX:parallel_run(PARALLAX_RUN_MPI)
INFO:139728530999040:PARALLAX:resource ps_localhost:46849:+localhost:37463:^worker_localhost:43741:0+localhost:38105:0
Traceback (most recent call last):
File "/home/jeeyung/Dropbox/school_materials/large_scale_data/hw2_cp/run_parallax.py", line 58, in
parallax_config=parallax_config)
Traceback (most recent call last):
File "/home/jeeyung/Dropbox/school_materials/large_scale_data/hw2_cp/run_parallax.py", line 58, in
File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/common/runner.py", line 154, in parallel_run
parallax_config=parallax_config)
File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/common/runner.py", line 154, in parallel_run
return parallax_run_mpi(**kwargs)
File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/runner.py", line 137, in parallax_run_mpi
return parallax_run_mpi(**kwargs)
File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/runner.py", line 137, in parallax_run_mpi
graph_transform_mpi(single_gpu_meta_graph_def, config)
File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/graph_transform.py", line 101, in graph_transform_mpi
graph_transform_mpi(single_gpu_meta_graph_def, config)
File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/graph_transform.py", line 101, in graph_transform_mpi
_add_aggregation_ops(gradients_info, op_to_control_consumer_ops, config)
File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/graph_transform.py", line 43, in _add_aggregation_ops
_add_aggregation_ops(gradients_info, op_to_control_consumer_ops, config)
File "/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/mpi/graph_transform.py", line 43, in _add_aggregation_ops
use_allgatherv=config.communication_config.mpi_config.use_allgatherv)
use_allgatherv=config.communication_config.mpi_config.use_allgatherv)
TypeError: allreduce() got an unexpected keyword argument 'average_dense'
TypeError: allreduce() got an unexpected keyword argument 'average_dense'

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[10466,1],1]
Exit code: 1

@gyeongin
Copy link
Contributor Author

gyeongin commented Dec 5, 2018

Horovod를 어떻게 설치하셨나요?

@jeeyung
Copy link

jeeyung commented Dec 5, 2018

horovod에서도 cpu worker를 2개 사용하기 위해

pip install horovod

를 사용했습니다.

HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_WITHOUT_PYTORCH=True pip install --no-cache-dir dist/horovod-*.tar.gz

이렇게 horovod를 설치하고, parallax를 실행했을 땐,

tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1

이 error message가 나왔습니다.

@gyeongin
Copy link
Contributor Author

gyeongin commented Dec 5, 2018

말씀해주신 TypeError: allreduce() got an unexpected keyword argument 'average_dense'는 pip install horovod로 생긴 문제입니다.
원래대로

python setup.py sdist
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_ALLGATHER=NCCL HOROVOD_WITHOUT_PYTORCH=True pip install --no-cache-dir dist/horovod-*.tar.gz

로 설치해주시길 바랍니다.

이것과 별개로,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1
이 에러가 나는건 "사용 가능한 GPU가 있음에도 CPU를 사용하려고 할 때" 발생하는 버그네요 😢
CPU 지원 branch를 따로 만들었는데도 코너 케이스를 제대로 핸들링 못했던 것 같습니다...
지금 hot fix를 push했으니, 새로 pull 해서 테스트 부탁드립니다.

@jeeyung
Copy link

jeeyung commented Dec 5, 2018

새로 pull 해도 같은 error 입니다...ㅜㅜ

@gyeongin
Copy link
Contributor Author

gyeongin commented Dec 5, 2018

저는 해당 에러가 재현이 안되는데, 혹시 새로 parallax build 및 pip install --upgrade 하셨는지 아래 방법으로 확인 부탁드립니다:
/home/jeeyung/parallax_venv/local/lib/python2.7/site-packages/parallax/core/python/hybrid/runner.py 파일의 192 라인이 link와 같은지 확인

@jeeyung
Copy link

jeeyung commented Dec 5, 2018

해결됐습니다! 감사합니다.!!

@bgchun
Copy link
Contributor

bgchun commented Dec 5, 2018

@gyeongin Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants