Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Communication via compressed spikes #1338

Merged
merged 3 commits into from
May 3, 2021
Merged

Conversation

jakobj
Copy link
Contributor

@jakobj jakobj commented Nov 15, 2019

In the current ("5g") communication scheme if a neuron has multiple targets on different threads of a target process, spikes for all different threads are generated on the presynaptic side and send to the postsynaptic side via MPI. For example for 48threads, a neuron with targets on all threads will generate 48spikes presynaptically. This leads to unnecessarily large MPI buffers for massively multihreaded simulations (buffer sizes scale ~ #threads per process).

By introducing an additional postsynaptic data structure and the concept of a "compressed spike" this PR modifies the communication infrastructure such that this redundancy is avoided. A neuron with targets on all threads of a process hence only needs to send a single spike which is then multiplexed on the postsynaptic side, significantly reducing the load on the communication network. Initial benchmarks show significant improvements over 5g on specific benchmark instances with many threads (measured and reported by @jarsi @jhnnsnk ).
The current implementation relies on heuristics to select the connections which qualify for using compressed spikes, these should be critically reviewed.

@jhnnsnk @terhorstd

@jakobj
Copy link
Contributor Author

jakobj commented Nov 15, 2019

Sketch of the modified postsynaptic infrastructure. Note that spikes can be delivered regularly (green arrow) or via the new data structure (blue arrows) depending on the number of target threads.
infrastructure_5g_compressed_spikedata

@terhorstd terhorstd added ZC: Kernel DO NOT USE THIS LABEL ZP: PR Created DO NOT USE THIS LABEL S: Normal Handle this with default priority T: Enhancement New functionality, model or documentation I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) labels Nov 15, 2019
@terhorstd terhorstd requested review from H27CK and jhnnsnk November 18, 2019 10:32
nestkernel/source_table.cpp Outdated Show resolved Hide resolved
nestkernel/source_table.cpp Outdated Show resolved Hide resolved
@heplesser heplesser changed the title Communication via compressed spikes [WIP] Communication via compressed spikes Dec 9, 2019
@H27CK
Copy link

H27CK commented Dec 13, 2019

@jarsi and I have been working to verify that MPI buffer sizes are sensibly sized when compression is active. By maintaining the network size and increasing the number of virtual processes, we find that the size of MPI buffers holding compressed spikes approaches (and eventually equals) the MPI buffer size in the case of no compression. This check demonstrates, as expected, that the original behaviour, i.e. in the case of no compression, is recovered when there is no opportunity for compression.

@H27CK
Copy link

H27CK commented Jan 6, 2020

Can someone try running the HPC benchmark with compressed spikes. My runs consistently fail. Configuration: scale set to 10, with 5-10 nodes on Jureca, 24 MPI ranks and 2 OpenMP threads. The same occurs with 1 rank and 48 threads. This configuration is fine without compressed spikes.

@jakobj
Copy link
Contributor Author

jakobj commented Jan 9, 2020

@7eia thanks for your comments and testing the branch. could you please specify how your runs fail? are you using the hpc_benchmark.sli from the examples folder?

@H27CK
Copy link

H27CK commented Jan 9, 2020

@jakobj yes, I'm using hpc_benchmark.sli. The runs fail silently after NodeManager::prepare_nodes [Info]. Standard error shows segmentation faults and bus errors, so nothing too helpful. This could indicate an OOM issue, but that should still be sorted.

@jhnnsnk
Copy link
Contributor

jhnnsnk commented Mar 5, 2020

@jakobj @H27CK I just tested hpc_benchmark.sli, the microcircuit and some other examples. I compared spike results for NEST2.20 and the branch of this PR. If spike compression is not active [local_num_threads=1 and >= 1 MPI process(es)], spike results always coincide. With >1 threads, microcircuit results are also good.
However, if both >1 threads and synapse models created with CopyModel are used (as in the case of hpc_benchmark.sli), this branch fails with a Segmentation fault.
The reason is that a dimension for different synapse types is missing, most likely in the compressed_spike_data_map_ (@jakobj already mentioned that this might become an issue). I could try to propose a fix next week.

@jhnnsnk
Copy link
Contributor

jhnnsnk commented Mar 26, 2020

A fix is now in: jakobj#133

@jakobj
Copy link
Contributor Author

jakobj commented Mar 26, 2020

awesome, thanks you @jhnnsnk 🎉 🎉 i've merged your fix and rebased onto the current master, now everything should be up to date.

nestkernel/source_table.cpp Outdated Show resolved Hide resolved
@heplesser heplesser removed ZC: Kernel DO NOT USE THIS LABEL ZP: PR Created DO NOT USE THIS LABEL labels Apr 7, 2020
@heplesser heplesser changed the title [WIP] Communication via compressed spikes Communication via compressed spikes May 20, 2020
@heplesser heplesser marked this pull request as draft May 20, 2020 08:51
@heplesser
Copy link
Contributor

I removed the WIP-tag from the title and instead turned the PR into Draft state.

Copy link
Contributor

@jhnnsnk jhnnsnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have here one tiny further suggestion on the Exception message. Apart from that, I would suggest that I perform another test run as soon as the Warnings are fixed.

nestkernel/connection_manager.cpp Outdated Show resolved Hide resolved
nestkernel/connection_manager.cpp Outdated Show resolved Hide resolved
@jhnnsnk
Copy link
Contributor

jhnnsnk commented Apr 20, 2021

Thanks, @jakobj!
I have now performed new microcircuit benchmark runs comparing this spike compression branch to its master ancestor on up to 10 nodes, and the results are only positive:

  1. The simulations are now correct according to the local spike counter.
  2. The simulation times (communication phase) are drastically reduced in line with previous benchmarks.
  3. The required total memory is even slightly reduced with spike compression (in contrast to earlier implementations).

Once Travis is happy with this PR, I will approve.

@heplesser
Copy link
Contributor

@jakobj Travis seems happy except for PEP8 problems in pynest/nest/tests/test_spatial/test_basics.py.

@heplesser
Copy link
Contributor

@jakobj Please add documentation for users for the "use_spike_compression" flag to the SetKernelStatus docstring in pynest/nest/lib/hl_api_simulation.py.

Could you also add at a suitable place (connection_manager.h|cpp) some information for developers about how the compression mechanism works, so that we won't have to decode this from the code five years from now. Or is that written up elsewhere?

Copy link
Contributor

@heplesser heplesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakobj This looks quite fine, I mainly have suggestions for more documentation and some improvements for better code clarity.

nestkernel/connection_manager.h Show resolved Hide resolved
nestkernel/connection_manager.h Show resolved Hide resolved
nestkernel/connection_manager.cpp Outdated Show resolved Hide resolved
nestkernel/connection_manager.cpp Outdated Show resolved Hide resolved
nestkernel/connection_manager.h Outdated Show resolved Hide resolved
nestkernel/source_table.cpp Show resolved Hide resolved
nestkernel/source_table.cpp Outdated Show resolved Hide resolved
nestkernel/source_table.h Show resolved Hide resolved
nestkernel/source_table.h Show resolved Hide resolved
nestkernel/event_delivery_manager.cpp Outdated Show resolved Hide resolved
@jakobj
Copy link
Contributor Author

jakobj commented Apr 22, 2021

@jakobj Travis seems happy except for PEP8 problems in pynest/nest/tests/test_spatial/test_basics.py.

i'm really unsure how to read the massive travis output, but from what i can decipher, these errors have nothing to do with this PR:

MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:42:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:49:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:60:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:89:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:164:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:251:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:287:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:310:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:357:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:375:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:409:9: E741 ambiguous variable name 'l'

i guess fixing this should be a separate PR?

@jakobj
Copy link
Contributor Author

jakobj commented Apr 22, 2021

@jakobj Please add documentation for users for the "use_spike_compression" flag to the SetKernelStatus docstring in pynest/nest/lib/hl_api_simulation.py.

Could you also add at a suitable place (connection_manager.h|cpp) some information for developers about how the compression mechanism works, so that we won't have to decode this from the code five years from now. Or is that written up elsewhere?

done.

may i point out that the duplication of information between python docstrings and c++ files is suboptimal. maybe there's some way to merge the two in the future?

@jakobj
Copy link
Contributor Author

jakobj commented Apr 22, 2021

thanks @jhnnsnk for the benchmarks and @heplesser for the comments! 🙇

i have addressed all of them, please have another look.

@heplesser
Copy link
Contributor

@jakobj I just sent you a PR that fixes the "ambiguous l". The problem popped up here because you made an unrelated change to test_basic.py and thus exposed the file to PEP8's radar.

Copy link
Contributor

@heplesser heplesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakobj Very nice! I approve, merging should wait until Travis is happy (see also my PR to you fixing the l-problem).

@jhnnsnk
Copy link
Contributor

jhnnsnk commented Apr 23, 2021

@jakobj I just noticed that use_compressed_spikes is not accessible from the kernel status dictionary. Apparently the flag can be set ( nest.SetKernelStatus({'sort_connections_by_source': False, 'use_compressed_spikes': False} works ) but not read out ( nest.GetKernelStatus('use_compressed_spikes') fails with a KeyError). This behavior is not intended, isn't it?

@diesmann
Copy link

@jakobj I just noticed that use_compressed_spikes is not accessible from the kernel status dictionary. Apparently the flag can be set ( nest.SetKernelStatus({'sort_connections_by_source': False, 'use_compressed_spikes': False} works ) but not read out ( nest.GetKernelStatus('use_compressed_spikes') fails with a KeyError). This behavior is not intended, isn't it?

In situations where we have such flags, maybe we should have a test whether the status can be read out anyway.

@jessica-mitchell
Copy link
Contributor

may i point out that the duplication of information between python docstrings and c++ files is suboptimal. maybe there's some way to merge the two in the future?

Hi @jakobj - just piping in here to say that reducing the redundancy in information is something the documentation team is trying to work on. Thanks for pointing out a specific instance of the problem - it's especially helpful to know where in the c++ code we have this issue. A possible way to merge the two in future is that we are planning to put the developer docs on ReadtheDocs, alongside the userdocs (#1844). I think this might give us a better chance to remove some redundancy and make it easier to cross-reference information.

@heplesser
Copy link
Contributor

may i point out that the duplication of information between python docstrings and c++ files is suboptimal. maybe there's some way to merge the two in the future?

Hi @jakobj - just piping in here to say that reducing the redundancy in information is something the documentation team is trying to work on. Thanks for pointing out a specific instance of the problem - it's especially helpful to know where in the c++ code we have this issue. A possible way to merge the two in future is that we are planning to put the developer docs on ReadtheDocs, alongside the userdocs (#1844). I think this might give us a better chance to remove some redundancy and make it easier to cross-reference information.

Duplication is one thing, complementarity another. Python docstrings need to tell the user what she needs to know to use NEST features. C++ documentation tells the developer what she needs to know to understand how the code works, what data structures represent and how they interact, to be able to maintain and evolve the code. Parameter names in PyNEST may match variable names in C++ code, but that is not a law of nature, the C++ implementation could change while PyNEST stays the same, so some duplicity of information cannot reasonably be avoided. Also, when working with C++ code, you will not always want to have to look at PyNEST to find information.

@jakobj
Copy link
Contributor Author

jakobj commented Apr 27, 2021

alright, i've now implemented getting the kernel setting as suggested by @jhnnsnk and included the pep8 fixes from @heplesser. also, i've rewritten the history to be nice and clean. please have another look.

@heplesser
Copy link
Contributor

@jakobj The CI failure in the MPI_ONLY case is due to failing ticket-459.sli (the long list of failed tests at the end is a testsuite error, see #1945). I believe this may be due to a variable not properly initialised and have just sent you a PR to fix this.

@jakobj
Copy link
Contributor Author

jakobj commented Apr 28, 2021

@jakobj The CI failure in the MPI_ONLY case is due to failing ticket-459.sli (the long list of failed tests at the end is a testsuite error, see #1945). I believe this may be due to a variable not properly initialised and have just sent you a PR to fix this.

thanks for the quick fix! :)

great teamwork, thanks to everyone who has contributed to this PR. everything looks good now, doesn't it?

ps: i've tried to make the commit history a bit nicer

Copy link
Contributor

@jhnnsnk jhnnsnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve! 👍
My final benchmarks were successful: this PR greatly improves simulation times for models similar to the microcircuit. I did not encounter any issues any more.
Thank you, @jakobj, very much for your efforts!

@heplesser heplesser merged commit aa0095a into nest:master May 3, 2021
@diesmann
Copy link

diesmann commented May 3, 2021

Congratulations and a big Thanks to all of you.

@jakobj jakobj deleted the enh/com-com branch May 3, 2021 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: Normal Handle this with default priority T: Enhancement New functionality, model or documentation
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

8 participants