Communication via compressed spikes #1338

jakobj · 2019-11-15T16:24:12Z

In the current ("5g") communication scheme if a neuron has multiple targets on different threads of a target process, spikes for all different threads are generated on the presynaptic side and send to the postsynaptic side via MPI. For example for 48threads, a neuron with targets on all threads will generate 48spikes presynaptically. This leads to unnecessarily large MPI buffers for massively multihreaded simulations (buffer sizes scale ~ #threads per process).

By introducing an additional postsynaptic data structure and the concept of a "compressed spike" this PR modifies the communication infrastructure such that this redundancy is avoided. A neuron with targets on all threads of a process hence only needs to send a single spike which is then multiplexed on the postsynaptic side, significantly reducing the load on the communication network. Initial benchmarks show significant improvements over 5g on specific benchmark instances with many threads (measured and reported by @jarsi @jhnnsnk ).
The current implementation relies on heuristics to select the connections which qualify for using compressed spikes, these should be critically reviewed.

@jhnnsnk @terhorstd

jakobj · 2019-11-15T16:27:53Z

Sketch of the modified postsynaptic infrastructure. Note that spikes can be delivered regularly (green arrow) or via the new data structure (blue arrows) depending on the number of target threads.

nestkernel/event_delivery_manager.cpp

nestkernel/source_table.cpp

H27CK · 2019-12-13T14:55:40Z

@jarsi and I have been working to verify that MPI buffer sizes are sensibly sized when compression is active. By maintaining the network size and increasing the number of virtual processes, we find that the size of MPI buffers holding compressed spikes approaches (and eventually equals) the MPI buffer size in the case of no compression. This check demonstrates, as expected, that the original behaviour, i.e. in the case of no compression, is recovered when there is no opportunity for compression.

H27CK · 2020-01-06T15:25:18Z

Can someone try running the HPC benchmark with compressed spikes. My runs consistently fail. Configuration: scale set to 10, with 5-10 nodes on Jureca, 24 MPI ranks and 2 OpenMP threads. The same occurs with 1 rank and 48 threads. This configuration is fine without compressed spikes.

jakobj · 2020-01-09T10:12:59Z

@7eia thanks for your comments and testing the branch. could you please specify how your runs fail? are you using the hpc_benchmark.sli from the examples folder?

H27CK · 2020-01-09T12:07:12Z

@jakobj yes, I'm using hpc_benchmark.sli. The runs fail silently after NodeManager::prepare_nodes [Info]. Standard error shows segmentation faults and bus errors, so nothing too helpful. This could indicate an OOM issue, but that should still be sorted.

jhnnsnk · 2020-03-05T18:18:54Z

@jakobj @H27CK I just tested hpc_benchmark.sli, the microcircuit and some other examples. I compared spike results for NEST2.20 and the branch of this PR. If spike compression is not active [local_num_threads=1 and >= 1 MPI process(es)], spike results always coincide. With >1 threads, microcircuit results are also good.
However, if both >1 threads and synapse models created with CopyModel are used (as in the case of hpc_benchmark.sli), this branch fails with a Segmentation fault.
The reason is that a dimension for different synapse types is missing, most likely in the compressed_spike_data_map_ (@jakobj already mentioned that this might become an issue). I could try to propose a fix next week.

jhnnsnk · 2020-03-26T17:35:43Z

A fix is now in: jakobj#133

jakobj · 2020-03-26T21:31:23Z

awesome, thanks you @jhnnsnk 🎉 🎉 i've merged your fix and rebased onto the current master, now everything should be up to date.

nestkernel/source_table.cpp

heplesser · 2020-05-20T08:52:08Z

I removed the WIP-tag from the title and instead turned the PR into Draft state.

jhnnsnk

I have here one tiny further suggestion on the Exception message. Apart from that, I would suggest that I perform another test run as soon as the Warnings are fixed.

nestkernel/connection_manager.cpp

jhnnsnk · 2021-04-20T14:39:40Z

Thanks, @jakobj!
I have now performed new microcircuit benchmark runs comparing this spike compression branch to its master ancestor on up to 10 nodes, and the results are only positive:

The simulations are now correct according to the local spike counter.
The simulation times (communication phase) are drastically reduced in line with previous benchmarks.
The required total memory is even slightly reduced with spike compression (in contrast to earlier implementations).

Once Travis is happy with this PR, I will approve.

heplesser · 2021-04-21T05:20:36Z

@jakobj Travis seems happy except for PEP8 problems in pynest/nest/tests/test_spatial/test_basics.py.

heplesser · 2021-04-21T14:38:25Z

@jakobj Please add documentation for users for the "use_spike_compression" flag to the SetKernelStatus docstring in pynest/nest/lib/hl_api_simulation.py.

Could you also add at a suitable place (connection_manager.h|cpp) some information for developers about how the compression mechanism works, so that we won't have to decode this from the code five years from now. Or is that written up elsewhere?

heplesser

@jakobj This looks quite fine, I mainly have suggestions for more documentation and some improvements for better code clarity.

nestkernel/connection_manager.h

nestkernel/connection_manager.cpp

nestkernel/connection_manager.h

nestkernel/source_table.cpp

nestkernel/source_table.h

nestkernel/event_delivery_manager.cpp

jakobj · 2021-04-22T15:52:11Z

@jakobj Travis seems happy except for PEP8 problems in pynest/nest/tests/test_spatial/test_basics.py.

i'm really unsure how to read the massive travis output, but from what i can decipher, these errors have nothing to do with this PR:

MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:42:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:49:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:60:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:89:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:164:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:251:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:287:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:310:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:357:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:375:9: E741 ambiguous variable name 'l'
MSGBLD0195: [PEP8] pynest/nest/tests/test_spatial/test_basics.py:409:9: E741 ambiguous variable name 'l'

i guess fixing this should be a separate PR?

jakobj · 2021-04-22T15:58:01Z

@jakobj Please add documentation for users for the "use_spike_compression" flag to the SetKernelStatus docstring in pynest/nest/lib/hl_api_simulation.py.

Could you also add at a suitable place (connection_manager.h|cpp) some information for developers about how the compression mechanism works, so that we won't have to decode this from the code five years from now. Or is that written up elsewhere?

done.

may i point out that the duplication of information between python docstrings and c++ files is suboptimal. maybe there's some way to merge the two in the future?

jakobj · 2021-04-22T16:54:49Z

thanks @jhnnsnk for the benchmarks and @heplesser for the comments! 🙇

i have addressed all of them, please have another look.

heplesser · 2021-04-22T20:50:41Z

@jakobj I just sent you a PR that fixes the "ambiguous l". The problem popped up here because you made an unrelated change to test_basic.py and thus exposed the file to PEP8's radar.

heplesser

@jakobj Very nice! I approve, merging should wait until Travis is happy (see also my PR to you fixing the l-problem).

jhnnsnk · 2021-04-23T08:07:30Z

@jakobj I just noticed that use_compressed_spikes is not accessible from the kernel status dictionary. Apparently the flag can be set ( nest.SetKernelStatus({'sort_connections_by_source': False, 'use_compressed_spikes': False} works ) but not read out ( nest.GetKernelStatus('use_compressed_spikes') fails with a KeyError). This behavior is not intended, isn't it?

diesmann · 2021-04-24T07:49:57Z

@jakobj I just noticed that use_compressed_spikes is not accessible from the kernel status dictionary. Apparently the flag can be set ( nest.SetKernelStatus({'sort_connections_by_source': False, 'use_compressed_spikes': False} works ) but not read out ( nest.GetKernelStatus('use_compressed_spikes') fails with a KeyError). This behavior is not intended, isn't it?

In situations where we have such flags, maybe we should have a test whether the status can be read out anyway.

jessica-mitchell · 2021-04-26T18:35:38Z

may i point out that the duplication of information between python docstrings and c++ files is suboptimal. maybe there's some way to merge the two in the future?

Hi @jakobj - just piping in here to say that reducing the redundancy in information is something the documentation team is trying to work on. Thanks for pointing out a specific instance of the problem - it's especially helpful to know where in the c++ code we have this issue. A possible way to merge the two in future is that we are planning to put the developer docs on ReadtheDocs, alongside the userdocs (#1844). I think this might give us a better chance to remove some redundancy and make it easier to cross-reference information.

heplesser · 2021-04-27T06:10:57Z

may i point out that the duplication of information between python docstrings and c++ files is suboptimal. maybe there's some way to merge the two in the future?

Hi @jakobj - just piping in here to say that reducing the redundancy in information is something the documentation team is trying to work on. Thanks for pointing out a specific instance of the problem - it's especially helpful to know where in the c++ code we have this issue. A possible way to merge the two in future is that we are planning to put the developer docs on ReadtheDocs, alongside the userdocs (#1844). I think this might give us a better chance to remove some redundancy and make it easier to cross-reference information.

Duplication is one thing, complementarity another. Python docstrings need to tell the user what she needs to know to use NEST features. C++ documentation tells the developer what she needs to know to understand how the code works, what data structures represent and how they interact, to be able to maintain and evolve the code. Parameter names in PyNEST may match variable names in C++ code, but that is not a law of nature, the C++ implementation could change while PyNEST stays the same, so some duplicity of information cannot reasonably be avoided. Also, when working with C++ code, you will not always want to have to look at PyNEST to find information.

jakobj · 2021-04-27T15:13:27Z

alright, i've now implemented getting the kernel setting as suggested by @jhnnsnk and included the pep8 fixes from @heplesser. also, i've rewritten the history to be nice and clean. please have another look.

heplesser · 2021-04-27T21:17:58Z

@jakobj The CI failure in the MPI_ONLY case is due to failing ticket-459.sli (the long list of failed tests at the end is a testsuite error, see #1945). I believe this may be due to a variable not properly initialised and have just sent you a PR to fix this.

jakobj · 2021-04-28T17:37:22Z

@jakobj The CI failure in the MPI_ONLY case is due to failing ticket-459.sli (the long list of failed tests at the end is a testsuite error, see #1945). I believe this may be due to a variable not properly initialised and have just sent you a PR to fix this.

thanks for the quick fix! :)

great teamwork, thanks to everyone who has contributed to this PR. everything looks good now, doesn't it?

ps: i've tried to make the commit history a bit nicer

jhnnsnk

I approve! 👍
My final benchmarks were successful: this PR greatly improves simulation times for models similar to the microcircuit. I did not encounter any issues any more.
Thank you, @jakobj, very much for your efforts!

diesmann · 2021-05-03T07:00:46Z

Congratulations and a big Thanks to all of you.

terhorstd requested review from H27CK and jhnnsnk November 18, 2019 10:32

H27CK reviewed Dec 2, 2019

View reviewed changes

nestkernel/event_delivery_manager.cpp Outdated Show resolved Hide resolved

H27CK reviewed Dec 2, 2019

View reviewed changes

nestkernel/event_delivery_manager.cpp Show resolved Hide resolved

H27CK reviewed Dec 2, 2019

View reviewed changes

nestkernel/source_table.cpp Outdated Show resolved Hide resolved

H27CK reviewed Dec 2, 2019

View reviewed changes

nestkernel/source_table.cpp Outdated Show resolved Hide resolved

heplesser changed the title ~~Communication via compressed spikes~~ [WIP] Communication via compressed spikes Dec 9, 2019

jakobj force-pushed the enh/com-com branch from ce1c37a to b6246d3 Compare March 26, 2020 21:29

jakobj commented Mar 26, 2020

View reviewed changes

nestkernel/source_table.cpp Outdated Show resolved Hide resolved

heplesser assigned jakobj Apr 7, 2020

heplesser removed ZC: Kernel DO NOT USE THIS LABEL ZP: PR Created DO NOT USE THIS LABEL labels Apr 7, 2020

heplesser changed the title ~~[WIP] Communication via compressed spikes~~ Communication via compressed spikes May 20, 2020

heplesser marked this pull request as draft May 20, 2020 08:51

jhnnsnk suggested changes Apr 19, 2021

View reviewed changes

nestkernel/connection_manager.cpp Outdated Show resolved Hide resolved

nestkernel/connection_manager.cpp Outdated Show resolved Hide resolved

heplesser requested changes Apr 21, 2021

View reviewed changes

heplesser approved these changes Apr 22, 2021

View reviewed changes

jakobj force-pushed the enh/com-com branch from 0696194 to d7113fc Compare April 27, 2021 15:12

jakobj and others added 3 commits April 28, 2021 19:41

Implement compressed spikes

b8f978e

Replaced "l" by "layer".

bcd8cbf

Correctly initialize ArchivingNode::trace_.

07e1b46

jakobj force-pushed the enh/com-com branch from 365b15b to 07e1b46 Compare April 28, 2021 17:42

heplesser requested a review from jhnnsnk April 30, 2021 12:40

jhnnsnk approved these changes May 1, 2021

View reviewed changes

heplesser merged commit aa0095a into nest:master May 3, 2021

jakobj deleted the enh/com-com branch May 3, 2021 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Communication via compressed spikes #1338

Communication via compressed spikes #1338

jakobj commented Nov 15, 2019

jakobj commented Nov 15, 2019

H27CK commented Dec 13, 2019

H27CK commented Jan 6, 2020

jakobj commented Jan 9, 2020

H27CK commented Jan 9, 2020

jhnnsnk commented Mar 5, 2020

jhnnsnk commented Mar 26, 2020

jakobj commented Mar 26, 2020

heplesser commented May 20, 2020

jhnnsnk left a comment

jhnnsnk commented Apr 20, 2021

heplesser commented Apr 21, 2021

heplesser commented Apr 21, 2021

heplesser left a comment

jakobj commented Apr 22, 2021

jakobj commented Apr 22, 2021

jakobj commented Apr 22, 2021

heplesser commented Apr 22, 2021

heplesser left a comment

jhnnsnk commented Apr 23, 2021

diesmann commented Apr 24, 2021

jessica-mitchell commented Apr 26, 2021

heplesser commented Apr 27, 2021

jakobj commented Apr 27, 2021

heplesser commented Apr 27, 2021

jakobj commented Apr 28, 2021 •

edited

Loading

jhnnsnk left a comment

diesmann commented May 3, 2021

Communication via compressed spikes #1338

Communication via compressed spikes #1338

Conversation

jakobj commented Nov 15, 2019

jakobj commented Nov 15, 2019

H27CK commented Dec 13, 2019

H27CK commented Jan 6, 2020

jakobj commented Jan 9, 2020

H27CK commented Jan 9, 2020

jhnnsnk commented Mar 5, 2020

jhnnsnk commented Mar 26, 2020

jakobj commented Mar 26, 2020

heplesser commented May 20, 2020

jhnnsnk left a comment

Choose a reason for hiding this comment

jhnnsnk commented Apr 20, 2021

heplesser commented Apr 21, 2021

heplesser commented Apr 21, 2021

heplesser left a comment

Choose a reason for hiding this comment

jakobj commented Apr 22, 2021

jakobj commented Apr 22, 2021

jakobj commented Apr 22, 2021

heplesser commented Apr 22, 2021

heplesser left a comment

Choose a reason for hiding this comment

jhnnsnk commented Apr 23, 2021

diesmann commented Apr 24, 2021

jessica-mitchell commented Apr 26, 2021

heplesser commented Apr 27, 2021

jakobj commented Apr 27, 2021

heplesser commented Apr 27, 2021

jakobj commented Apr 28, 2021 • edited Loading

jhnnsnk left a comment

Choose a reason for hiding this comment

diesmann commented May 3, 2021

jakobj commented Apr 28, 2021 •

edited

Loading