Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IB verbs logging and enable traces through install.sh #1511

Merged
merged 10 commits into from
Jan 31, 2025

Conversation

mustafabar
Copy link
Contributor

@mustafabar mustafabar commented Jan 28, 2025

Details

This PR will facilitate IB Verbs calls logging to help with network-level tracing of work requests at the IB Verbs level.

Work item: Internal.

What were the changes?

  1. Enable logging of all ibv_post_sends and their underlying QP, source, and destination NIC, message length, etc.
  2. Add VERBS as a type of NCCL_DEBUG_SUBSYS. Note that The NET subsystem level is broader in its coverage and adds a lot of unneeded info if used just to view IB verbs call traces
  3. Add remote IBV device index to meta data struct ncclIbDevInfo so that it is possible to show both sender and receiver NIC index on the sender side

Why were the changes made?
Many want to understand how RCCL interacts with the network fabric in a fine-grain fashion. This will enable logging all other IB verbs interactions incremantally

How was the outcome achieved?
Through making TRACE calls for both FIFO and data ibv_post_sends, and adding a flag for enabling traces (--log-trace) to the install script

Additional Documentation:
What else should the reviewer know?

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

@mustafabar mustafabar changed the title Add verbs logging and enable traces through install.sh Add IB verbs logging and enable traces through install.sh Jan 28, 2025
@mustafabar mustafabar marked this pull request as ready for review January 30, 2025 18:44
@mustafabar mustafabar merged commit dc75209 into ROCm:develop Jan 31, 2025
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants