Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA backend for asynchronous distributed multi-trainer segfaults (race condition) #56

Open
bwasti opened this issue Oct 2, 2022 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@bwasti
Copy link
Contributor

bwasti commented Oct 2, 2022

When running the distributed test with a CUDA backend, there's a segfault. It can be fixed easily with CUDA_LAUNCH_BLOCKING, but that's not ideal. Below are the commands to repro:

Server:

$ bash examples/distributed/serve.sh

Client:

$ bash examples/distributed/client.sh
@bwasti bwasti added the bug Something isn't working label Oct 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants