-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Communication via compressed spikes #1338
Conversation
@jarsi and I have been working to verify that MPI buffer sizes are sensibly sized when compression is active. By maintaining the network size and increasing the number of virtual processes, we find that the size of MPI buffers holding compressed spikes approaches (and eventually equals) the MPI buffer size in the case of no compression. This check demonstrates, as expected, that the original behaviour, i.e. in the case of no compression, is recovered when there is no opportunity for compression. |
Can someone try running the HPC benchmark with compressed spikes. My runs consistently fail. Configuration: scale set to 10, with 5-10 nodes on Jureca, 24 MPI ranks and 2 OpenMP threads. The same occurs with 1 rank and 48 threads. This configuration is fine without compressed spikes. |
@7eia thanks for your comments and testing the branch. could you please specify how your runs fail? are you using the |
@jakobj yes, I'm using |
@jakobj @H27CK I just tested |
A fix is now in: jakobj#133 |
awesome, thanks you @jhnnsnk 🎉 🎉 i've merged your fix and rebased onto the current master, now everything should be up to date. |
I removed the WIP-tag from the title and instead turned the PR into Draft state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have here one tiny further suggestion on the Exception message. Apart from that, I would suggest that I perform another test run as soon as the Warnings are fixed.
Thanks, @jakobj!
Once Travis is happy with this PR, I will approve. |
@jakobj Travis seems happy except for PEP8 problems in |
@jakobj Please add documentation for users for the "use_spike_compression" flag to the Could you also add at a suitable place (connection_manager.h|cpp) some information for developers about how the compression mechanism works, so that we won't have to decode this from the code five years from now. Or is that written up elsewhere? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakobj This looks quite fine, I mainly have suggestions for more documentation and some improvements for better code clarity.
i'm really unsure how to read the massive travis output, but from what i can decipher, these errors have nothing to do with this PR: MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:42:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:49:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:60:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:89:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:164:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:251:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:287:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:310:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:357:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:375:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:409:9: E741 ambiguous variable name 'l' i guess fixing this should be a separate PR? |
done. may i point out that the duplication of information between python docstrings and c++ files is suboptimal. maybe there's some way to merge the two in the future? |
thanks @jhnnsnk for the benchmarks and @heplesser for the comments! 🙇 i have addressed all of them, please have another look. |
@jakobj I just sent you a PR that fixes the "ambiguous l". The problem popped up here because you made an unrelated change to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakobj Very nice! I approve, merging should wait until Travis is happy (see also my PR to you fixing the l-problem).
@jakobj I just noticed that |
In situations where we have such flags, maybe we should have a test whether the status can be read out anyway. |
Hi @jakobj - just piping in here to say that reducing the redundancy in information is something the documentation team is trying to work on. Thanks for pointing out a specific instance of the problem - it's especially helpful to know where in the c++ code we have this issue. A possible way to merge the two in future is that we are planning to put the developer docs on ReadtheDocs, alongside the userdocs (#1844). I think this might give us a better chance to remove some redundancy and make it easier to cross-reference information. |
Duplication is one thing, complementarity another. Python docstrings need to tell the user what she needs to know to use NEST features. C++ documentation tells the developer what she needs to know to understand how the code works, what data structures represent and how they interact, to be able to maintain and evolve the code. Parameter names in PyNEST may match variable names in C++ code, but that is not a law of nature, the C++ implementation could change while PyNEST stays the same, so some duplicity of information cannot reasonably be avoided. Also, when working with C++ code, you will not always want to have to look at PyNEST to find information. |
alright, i've now implemented getting the kernel setting as suggested by @jhnnsnk and included the pep8 fixes from @heplesser. also, i've rewritten the history to be nice and clean. please have another look. |
thanks for the quick fix! :) great teamwork, thanks to everyone who has contributed to this PR. everything looks good now, doesn't it? ps: i've tried to make the commit history a bit nicer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve! 👍
My final benchmarks were successful: this PR greatly improves simulation times for models similar to the microcircuit. I did not encounter any issues any more.
Thank you, @jakobj, very much for your efforts!
Congratulations and a big Thanks to all of you. |
In the current ("5g") communication scheme if a neuron has multiple targets on different threads of a target process, spikes for all different threads are generated on the presynaptic side and send to the postsynaptic side via MPI. For example for 48threads, a neuron with targets on all threads will generate 48spikes presynaptically. This leads to unnecessarily large MPI buffers for massively multihreaded simulations (buffer sizes scale ~ #threads per process).
By introducing an additional postsynaptic data structure and the concept of a "compressed spike" this PR modifies the communication infrastructure such that this redundancy is avoided. A neuron with targets on all threads of a process hence only needs to send a single spike which is then multiplexed on the postsynaptic side, significantly reducing the load on the communication network. Initial benchmarks show significant improvements over 5g on specific benchmark instances with many threads (measured and reported by @jarsi @jhnnsnk ).
The current implementation relies on heuristics to select the connections which qualify for using compressed spikes, these should be critically reviewed.
@jhnnsnk @terhorstd