Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: establishing verbs connection using the GID doesn't work #10472

Open
SnaKyEyeS opened this issue Oct 18, 2024 · 1 comment
Open

Comments

@SnaKyEyeS
Copy link

Describe the bug
Attempting to use NICs with IPoIB (via a call to fi_domain) disabled with the verbs provider using @sydidelot's feature (#5605) with RxM enabled doesn't work, likely due to RxM's assumption of dest_addr = FI_SOCKADDR rather than FI_SOCKADDR_IB

To Reproduce
Steps to reproduce the behavior:

  • Use MPICH on a NIC where IPoIB is disabled (with FI_PROVIDER=verbs,ofi_rxm)
  • Unfortunately I don't really have a simple reproducer, but the specific call failing is attempting to call fi_domain on a NIC using its GID rather than via IPoIB

Expected behavior
Replacing this line with .addr_format = FI_SOCKADDR_IB makes everything working as expected (ie, the call to fi_domain on a NIC with IPoIB disabled (thus using its GID instead) succeeds.

Output
OFI fails with ENODATA

[3] libfabric:1347797:1729070846:ofi_rxm:verbs:fabric:vrb_get_rai_id():301<warn> rdma_resolve_addr: Invalid argument (22)
[3] libfabric:1347797:1729070846:ofi_rxm:verbs:fabric:vrb_get_rai_id():303<info> src addr: fi_sockaddr_ib://[fe80::88e9:a4ff:ff1c:5860]:0xffff:0x13f:0x0
[3] libfabric:1347797:1729070846:ofi_rxm:verbs:fabric:vrb_get_rai_id():305<info> dst addr: (null)
[3] libfabric:1347797:1729070846:ofi_rxm:verbs:fabric:vrb_get_match_infos():1825<info> handling of the socket address fails - -22

Environment:
fi_info's relevant output:

fi_info:
    caps: [ FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_HMEM ]
    mode: [ FI_BUFFERED_RECV ]
    addr_format: FI_SOCKADDR_IB
    src_addrlen: 48
    dest_addrlen: 0
    src_addr: fi_sockaddr_ib://[fe80::88e9:a4ff:ff4a:997c]:0xffff:0x13f:0x0
    dest_addr: (null)
    handle: (nil)
[...]
@shefty
Copy link
Member

shefty commented Jan 29, 2025

Address format FI_SOCKADDR should support sockaddr_ib. That is, the expectation is the address to be castable to struct sockaddr, with the sa_family indicating the address format. There is a wrong or missing check buried somewhere in the code that is not properly including sockaddr_ib. I looked into the verbs code to see if I could figure out where the check is needed, but the code is too confusing to identify where a problem might be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants