Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SP compute() const, thread safe #560

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft

SP compute() const, thread safe #560

wants to merge 10 commits into from

Conversation

breznak
Copy link
Member

@breznak breznak commented Jul 12, 2019

This PR will benefit from boosting cleanup #544 , is part of parallelization effort #214 and will make PR parallel MNIST usable #559 .

The concept is to make SP thread-safe, suitable for parallel execution. This is done by making SP.compute() const. Eventually that is not feasible, as SP.adaptSegments() calls Conn.updateSynapsePermanence() which cannot be const. But it serves as proof of concepts and shows what can be const, and what needs to be synced.

Note that parallelizing SP.compute() like this won't be useful, only in MNIST #559 , in the end we want parallelized SP.compute() - the inside of it. Likely inhibition.

Takeaways:

  • focus more on functional programming. Functions are const and pass needed arguments. Less/none state of the object (this.numColumns_). Connections work like this.
  • SP.adaptSegment will not be const -> needs lock
  • research atomic, thread-safe data structures. Eg. Folly?
  • efficient parallelism is likely incompatible with deterministic builds. So for single-thread we get that, of for parallel we get speed.

@breznak
Copy link
Member Author

breznak commented Jul 12, 2019

@marty1885 FYI, this is my take on c++17 TS parallel. Do you have any advice on atomic, thread safe data-structures? (Ive worked a bit with facebook/folly) to avoid need of locks?

@marty1885
Copy link

marty1885 commented Jul 12, 2019

@breznak The main reason I made all compute() functions in Etaler const is so that parallelizing then becomes trivial. And the DoD/functional approach also helps in a huge way.

Some guidelines:

  • const functions are awesome to parallelize
  • avoid atomics and mutex unless needed. They are costly
  • paralleling programming is easy as long as you don't care about speed

Copy link
Member Author

@breznak breznak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review if you are interested.

  • don't take too seriously the SP.compute() const
  • but the other changes I'd like to implement, as this is nicer, safer, faster.


void computeActivity(std::vector<SynapseIdx> &numActiveConnectedSynapsesForSegment,
const std::vector<CellIdx> &activePresynapticCells);
const std::vector<CellIdx> &activePresynapticCells) const;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connections::computeActivity can be made const. With the "fake" (mutable) that for timeseries_ the curent/previousUpdates are mutable & not locked (I may use mutex, or just let it be for now)

@@ -176,18 +176,6 @@ void SpatialPooler::setBoostStrength(Real boostStrength) {
boostStrength_ = boostStrength;
}

UInt SpatialPooler::getIterationNum() const { return iterationNum_; }

void SpatialPooler::setIterationNum(UInt iterationNum) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SP removed iteration stuff. not really needed. And connections will have the iteration from #537

@@ -462,16 +441,18 @@ void SpatialPooler::initialize(
}


void SpatialPooler::compute(const SDR &input, const bool learn, SDR &active) {
vector<SynapseIdx> SpatialPooler::compute(const SDR &input, const bool learn, SDR &active) const {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP compute() is const.
But that will not be feasible. (adaptSegment -> conn.updateSynapsePermanence)

@@ -934,7 +901,7 @@ void SpatialPooler::inhibitColumnsLocal_(const vector<Real> &overlaps,


bool SpatialPooler::isUpdateRound_() const {
return (iterationNum_ % updatePeriod_) == 0;
return (rng_.getReal64() < 1.0/updatePeriod_); //approx every updatePeriod steps
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is one place where iterationNum_ was needed, I have replaced with probability, which I think is nicer approach.

// calls to RNG
std::mt19937 gen; //Standard mersenne_twister_engine 64bit seeded with seed_
mutable std::mt19937 gen; //Standard mersenne_twister_engine 64bit seeded with seed_
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all Random methods can be made const (thread safe), with the exception in mutable here. Result, will work OK, but for effective parallel algorithm, we likely won't have deterministic builds. So when finished, I imagine we run parallel by default, but when stuff breaks, we switch to single thread and deterministic operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marty1885 thank you for your advice. What do you think of the mutable in Random here? Shouldn't make too much damage if some collision messed the state.

but it results in the non-deterministic builds. (Currently we support set seed and deterministic operation for the whole HTM, which is cool for some usages).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll say that using a shared RNG across threads is not a good idea. Besides making things nondeterministic. It also leads to false sharing and a tremendous amount of cache invalidation. (Even though I do share a RNG across threads in Etaler)

One solution is to have a master RNG, and then you use the master RNG to seed light weight RNGs local to each thread. This will make your results deterministic as long as you have the same # of thread.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: you can have a master RNG, then use the master RNG to seed a secondary RNG local to each function. Then use the secondary to seed thread-local RNGs. This way you can have deterministic results no matter the amount of threads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PRs that prepare for this functionality, like #552 , I'm removing dependency on local variables / state. That is a good thing overall. With the suggested approach as above, that is "passing all that is needed as parameters" I need to balance convenience and usability of the public API (for single threaded use), I don't want users who want to call SDR result = SP.compute(input) to need to pass 10s parameters (Random, ...).

So I'll need to figure some compromise. Maybe function overloads for user,parallel? compute(simple, params) x compute(many,needed,for,parallel)

Also whole SP.compute is quite bad idea to parallelize, rather some inner methods: inhibition, overlaps,...

@breznak
Copy link
Member Author

breznak commented Jul 12, 2019

avoid atomics and mutex unless needed. They are costly

will keep in mind

paralleling programming is easy as long as you don't care about speed

this makes the code more complex, I'm doing it only for the speed.

@breznak
Copy link
Member Author

breznak commented Jul 13, 2019

Follow up,

  • should I be consistent in the meaning const = 100% thread-safe? Like not abusing mutable, so revert the const changes to Random?

Thread-count management

How do you deal with some heuristic on when to parallelize?
There are 2 cases:

  • too small loops: if num columns > 10.000, it'd make sense, but for 100s a single thread would be faster.
  • total number of threads used by HTM. On CPUs I'm much more limited by available parallel computation resources. Say for a 2 thread system: If the task is MNIST, only SP, then I can parallelize some computation. If I use a Network (SP+TM+...) then I'm better off only running each of them in a single thread.

In real world it'll be something inbetween, say newest AMD Zen CPUs with 32threads. If I have HTM with SP-TM-SP-TM. I want to give 32/4 threads to each object.

  • now I realize c++17 ParallelTS/TBB is not able to specify the number of threads the loop should use. Is this already done some smart way?

    • I will want to implement some control over threading, so for debug we can disable it, probably in a header define mode = std::execution_policy::par and redifine it to ::sequential if parallelization should be disabled
  • parallelizing SP.compute is probably pointless (as it'd be only used in MNIST and the likes, not for sequential usecases) (but this PR is excercise for me to get on with the TBB)

    • should I focus on parallelizing stuff inside SP-compute? ie the inhibition?
      • or focus solely on making Connections (which is the workhorse) work seamlessly with threads?

@marty1885
Copy link

marty1885 commented Jul 13, 2019

Regarding to managing the amount of threads.
TBB can do so. Please refer to: https://stackoverflow.com/questions/3786408/number-of-threads-used-by-intel-tbb

And the problem of to too much threads. My solution in Etaler (not implemented) is to have a NUMA aware backend. For example. A 32 thread ThreadRipper is made of 4 NUMA nodes. And the InfinityFabric link is the main bottleneck. So just make your backend run on one of the NUMA modes. And spawn multiple of them.
Ref: etaler/Etaler#14

@breznak
Copy link
Member Author

breznak commented Jul 13, 2019

Regarding to managing the amount of threads.
TBB can do so. Please refer to: https://stackoverflow.com/questions/3786408/number-of-threads-used-by-intel-tbb

Thank you for this link! I assume Parallel TS is TBB, so I can use this.
For the hack I described (re)defining mode to switch par/seq, I guess the manager is that smart to use seq when #threads =1

@marty1885
Copy link

No, Parallel TS is not TBB. They do look similar, but they are not the same thing.

@breznak
Copy link
Member Author

breznak commented Jul 13, 2019

No, Parallel TS is not TBB.

oh, I'm aiming for the c++17 standard. Seems the whole parallel_for is TBB specific, I should use transform

@breznak
Copy link
Member Author

breznak commented Jul 13, 2019

OT It looks like 3+ years, but I like where c++ is heading with concurrency, parallelism and heterogenous computing :
https://www.codeplay.com/portal/10-06-17-whats-in-cpp-20-and-cpp17-final-score-card

It looks like eventually we'll be able to run on GPUs and even FPGAs.

@marty1885
Copy link

marty1885 commented Jul 13, 2019

My experiences with FPGA tells me to not expect that to work well. FPGA is difficult to optimize and takes forever to compile (> 6Hr compilation time for a mid-range FPGA and a 10K compiler license fee)

The GPU part sounds interesting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants