-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize matrix multiply to support C = f.(A * B + x)
#56
Comments
I'm assuming that here |
Yes, in the example if Although of course I think this is a compelling use case for Octavian. While performance is close to MKL at smallish sizes, I think we'd have a nice advantage over the current dense layer implementation which looks like this on the forward pass ( C = A * B
@. C = f(C .+ x) Note that the broadcast is single threaded. I imagine we could get a nice performance advantage at the sizes people would actually consider CPU-training over MKL, and definitely over the default OpenBLAS through this fusion and threading the entire operation. |
This is really exciting! |
Two questions:
|
Yes and yes. The goal would be to just launch threads once per pass (i.e., once for forward, and once for back). |
This is for the same of implementing dense layers for neural networks.
The reverse pass also needs
g.(Cbar) * B'
andA' * g.(Cbar)
, but the fact that we have two instances ofg.(Cbar)
here means we should probably evaluateg
just once per element ofCbar
.However, perhaps somewhat related to #40, we should add support for batching matrix operations of different sizes as well -- in particular, the reverse pass of
should perhaps be evaluated with one function call that can minimize the (already low) threading and synchronization overhead.
Ideally, we'd have an API of doing this a little more generically, but it'd help for allocating threads to know that a lot of the array dimension here are the same.
The text was updated successfully, but these errors were encountered: