Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization of ConstProp compilation #3042

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

imaihal
Copy link
Collaborator

@imaihal imaihal commented Jan 14, 2025

To accelerate compilation time, this PR parallelizes compilation of ConstProp using parallelFor() API in MLIR. Basically this improve constant propagation for reduction computation.

@imaihal imaihal marked this pull request as ready for review January 17, 2025 06:29
Copy link
Member

@sorenlassen sorenlassen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/Dialect/ONNX/ElementsAttr/ElementsAttrBuilder.hpp Outdated Show resolved Hide resolved
src/Dialect/ONNX/ElementsAttr/ElementsAttrBuilder.hpp Outdated Show resolved Hide resolved
src/Dialect/ONNX/ElementsAttr/ElementsAttrBuilder.cpp Outdated Show resolved Hide resolved
}
}
};
parallelFor(ctx, 0, ctx->getNumThreads(), work);
});
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also assume that the work up there assumes that there are batch.size() reductions that can all be done in parallel.

Since we have for quantization "whole tensor" quantization, we have cases where we have only 1 reduction.
That can also be done in parallel. Say you have 1000 elements and 10 threads. Each thread process its own 100 numbers, and save its result in its location in an array of 10 partial sum. Then after the parallel region, just reduce these 10 values sequentially. You will still get a near 10x speedup.

Also, should we check if that if the batch.size is small, we may want to do things sequentially? It would probably be good in case we have a few very small tensors. You can easily print out the sizes on stderr for a few benchmarks and see if you have such cases.

src/Dialect/ONNX/ElementsAttr/ElementsAttrBuilder.hpp Outdated Show resolved Hide resolved
for (WideNum &n : batch)
n = fun(n);
};
parallelFor(ctx, 0, ctx->getNumThreads(), work);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, please check that there is enough work to go to parallel computations. I suspect that if the reduction is very small, then we really want to do it sequentially and it will be faster.

@AlexandreEichenberger
Copy link
Collaborator

@imaihal please ping me when you have implemented the changes, I will then review it again. Thanks for working on accelerating the compiler, much appreciated.

If you know of other opportunities that are not exploited yet, maybe you can add a "todo" in the code or in the description of this PR so that we don't lose such opportunities.

Signed-off-by: Haruki Imai <[email protected]>
Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes looks good to me now.

Do you want to test if there are cases where there is very little work and the code would go in sequential mode? Maybe report here if you have seen this in benchmarks?

So I approved the PR, would be good to know if we should consider doing some of the work in sequential mode (as this PR could help 90% of the cases, but for small one hurt perf, so while it may look in general good, we would still let performance on the table by not doing the small case sequentially).

@imaihal
Copy link
Collaborator Author

imaihal commented Jan 28, 2025

@AlexandreEichenberger

Do you want to test if there are cases where there is very little work and the code would go in sequential mode? Maybe report here if you have seen this in benchmarks?

Yes. I will test and add a threshold (maybe if loop length is small, it should run in sequential mode)
Also, I will add test code you suggested in another PR.

Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding the new algo for selecting the batch bounds.

Will you add a lit test to make sure parallel version of the algo works?

@imaihal
Copy link
Collaborator Author

imaihal commented Feb 4, 2025

@AlexandreEichenberger

Will you add a lit test to make sure parallel version of the algo works?

I added threshold to avoid parallelization in small input. Also I added lit tests. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants