Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Migrate to Thrust nosync stream policy for performance. #12086

Open
1 of 12 tasks
bdice opened this issue Nov 7, 2022 · 0 comments
Open
1 of 12 tasks

[FEA] Migrate to Thrust nosync stream policy for performance. #12086

bdice opened this issue Nov 7, 2022 · 0 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@bdice
Copy link
Contributor

bdice commented Nov 7, 2022

Motivation & Description

Thrust 1.16 added an execution policy thrust::cuda::par_nosync that removes unnecessary internal stream synchronizations, except when required for correctness (e.g. if the algorithm returns a value to the host like thrust::reduce, a sync is required).

In PR #11577, I experimented with a bulk find-replace of rmm::exec_policy with rmm::exec_policy_nosync. This resulted in notable performance improvements, particularly for small data sizes. See: https://gist.github.com/bdice/bbeae4d28a45bedf0f53a13304714f70

However, a plain find-replace may leave issues with stream safety. For example, an internal detail API could be constructing host memory whose lifetime doesn't guarantee a safe copy to device without a stream sync before returning (thought experiment identified by @jrhemstad). Changing to nosync execution policies requires analysis of every use case individually, at both the detail and public API levels.

(As a reminder of the current stream policy: libcudf APIs called on the host do not guarantee that the stream is synchronized before returning, but this does not mean we can always use nosync safely.)

Tasks

I am planning to open PRs to use nosync across the libcudf codebase. Below is a list of benchmarks that showed improvements. I will need to analyze each benchmark to identify the primary algorithms being called that should be refactored to use nosync. This list is loosely sorted by largest impact for small data sizes (or whatever fixed data sizes are in the benchmark), which indicates overhead that we can systematically eliminate. Note that these performance improvements use nosync everywhere, which may not be stream-safe in all cases, so the real performance gains may be lower if not all executions can use nosync. Additionally, the performance improvements for a given algorithm may rely on improvements in other algorithms, so the full improvement may not be achieved until all tasks are complete.

Other notes:

  • I excluded I/O benchmarks from the prioritized list of algorithms above, because I/O benchmarks are a little noisy on the system I used for benchmarking.

Further work

After addressing the major tasks above where we have clearly identified speedups resulting from nosync policies, I will re-assess the rest of the code base to evaluate a broader replacement to use nosync policies everywhere in libcudf.

@bdice bdice added feature request New feature or request Needs Triage Need team to review and classify labels Nov 7, 2022
@bdice bdice self-assigned this Nov 7, 2022
@bdice bdice added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Nov 7, 2022
@GregoryKimball GregoryKimball added this to the Enable streams milestone Nov 19, 2022
@bdice bdice added the Performance Performance related issue label Jun 6, 2023
rapids-bot bot pushed a commit that referenced this issue Oct 23, 2024
Closes #17117

Related to #12086

This PR replaces the synchronous execution policy with an asynchronous one to eliminate unnecessary synchronization.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Shruti Shivakumar (https://github.com/shrshi)
  - Jason Lowe (https://github.com/jlowe)
  - Nghia Truong (https://github.com/ttnghia)

URL: #17146
@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Oct 23, 2024
@GregoryKimball GregoryKimball removed this from the Enable streams milestone Oct 23, 2024
@GregoryKimball GregoryKimball moved this from Needs owner to Story Issue in libcudf Oct 29, 2024
rapids-bot bot pushed a commit that referenced this issue Dec 3, 2024
Part of #12086
Replaced `rmm::exec_policy` with `rmm::exec_policy_nosync` in write_json

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #17445
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
Status: Story Issue
Development

No branches or pull requests

2 participants