Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/verbs: Helgrind detected data race in rdma_create_id/rdma_destroy_id during vrb_getinfo #10736

Open
piotrchmiel opened this issue Jan 28, 2025 · 0 comments
Labels

Comments

@piotrchmiel
Copy link
Contributor

piotrchmiel commented Jan 28, 2025

Describe the bug
A race condition has been detected in the libfabric library when multiple threads concurrently invoke using verbs with fi_getinfo, fi_fabric, fi_domain and then threads close the related objects. This leads to a possible data race involving the access to shared memory in functions such as idm_set and idm_clear in librdmacm (rdma_core).

To Reproduce

  1. In a multi-threaded environment, call fi_getinfo, fi_fabric, and fi_domain simultaneously from multiple threads. Use verbs with rxm utlity provider.
  2. Subsequently, ensure fi_close is invoked after these calls.
  3. Use Valgrind with Helgrind to detect potential data races.

Expected behavior
No data race should occur during the execution of these functions, and the library should handle multiple concurrent accesses to shared memory correctly without causing crashes or corruption.

Output
Valgrind/Helgrind detects the following possible data race:

 ==2602226== Possible data race during write of size 8 at 0xEC712670 by thread #5
==2602226== Locks held: none
==2602226==    at 0x5297C7A: idm_set (indexer.c:151)
==2602226==    by 0x52912C4: ucma_insert_id (cma.c:686)
==2602226==    by 0x52916B3: rdma_create_id2 (cma.c:784)
==2602226==    by 0x5291732: rdma_create_id (cma.c:800)
==2602226==    by 0x4A63CE0: vrb_get_rai_id (prov/verbs/src/verbs_init.c:281)
==2602226==    by 0x4A7BDA8: vrb_handle_sock_addr (prov/verbs/src/verbs_info.c:1797)
==2602226==    by 0x4A787E0: vrb_get_match_infos (prov/verbs/src/verbs_info.c:1832)
==2602226==    by 0x4A77FE6: vrb_getinfo (prov/verbs/src/verbs_info.c:1904)
==2602226==    by 0x4A0547E: fi_getinfo@@FABRIC_1.7 (src/fabric.c:1365)
==2602226==    by 0x4A34F0B: ofi_get_core_info (prov/util/src/util_attr.c:319)
==2602226==    by 0x4A3528F: ofix_getinfo (prov/util/src/util_attr.c:342)
==2602226==    by 0x4A8C312: rxm_getinfo (prov/rxm/src/rxm_init.c:558)
==2602226==
==2602226== This conflicts with a previous write of size 8 by thread #7
==2602226== Locks held: none
==2602226==    at 0x5297CE1: idm_clear (indexer.c:162)
==2602226==    by 0x5291312: ucma_remove_id (cma.c:693)
==2602226==    by 0x5291356: ucma_free_id (cma.c:703)
==2602226==    by 0x5291900: rdma_destroy_id (cma.c:839)
==2602226==    by 0x4A7BE19: vrb_handle_sock_addr (prov/verbs/src/verbs_info.c:1810)
==2602226==    by 0x4A787E0: vrb_get_match_infos (prov/verbs/src/verbs_info.c:1832)
==2602226==    by 0x4A77FE6: vrb_getinfo (prov/verbs/src/verbs_info.c:1904)
==2602226==    by 0x4A0547E: fi_getinfo@@FABRIC_1.7 (src/fabric.c:1365)
==2602226==  Address 0xec712670 is 0 bytes inside a block of size 8,192 alloc'd
==2602226==    at 0x4852274: calloc (vg_replace_malloc.c:1675)
==2602226==    by 0x5297B9D: idm_grow (indexer.c:125)
==2602226==    by 0x5297C3A: idm_set (indexer.c:146)
==2602226==    by 0x52912C4: ucma_insert_id (cma.c:686)
==2602226==    by 0x52916B3: rdma_create_id2 (cma.c:784)
==2602226==    by 0x5291732: rdma_create_id (cma.c:800)
==2602226==    by 0x528FFDE: ucma_set_af_ib_support (cma.c:250)
==2602226==    by 0x5290726: ucma_init (cma.c:411)
==2602226==    by 0x5291553: rdma_create_id2 (cma.c:762)
==2602226==    by 0x5291732: rdma_create_id (cma.c:800)
==2602226==    by 0x4A79DAA: vrb_ifa_rdma_info (prov/verbs/src/verbs_info.c:959)
==2602226==    by 0x4A790A6: vrb_getifaddrs (prov/verbs/src/verbs_info.c:1181)
==2602226==  Block was alloc'd by thread #7
==2602226==
==2602226== ----------------------------------------------------------------
==2602226==
==2602226== Possible data race during read of size 8 at 0xEC712670 by thread #5
==2602226== Locks held: none
==2602226==    at 0x5297CC1: idm_clear (indexer.c:161)
==2602226==    by 0x5291312: ucma_remove_id (cma.c:693)
==2602226==    by 0x5291356: ucma_free_id (cma.c:703)
==2602226==    by 0x5291900: rdma_destroy_id (cma.c:839)
==2602226==    by 0x4A7BE19: vrb_handle_sock_addr (prov/verbs/src/verbs_info.c:1810)
==2602226==    by 0x4A787E0: vrb_get_match_infos (prov/verbs/src/verbs_info.c:1832)
==2602226==    by 0x4A77FE6: vrb_getinfo (prov/verbs/src/verbs_info.c:1904)
==2602226==    by 0x4A0547E: fi_getinfo@@FABRIC_1.7 (src/fabric.c:1365)
==2602226==    by 0x4A34F0B: ofi_get_core_info (prov/util/src/util_attr.c:319)
==2602226==    by 0x4A3528F: ofix_getinfo (prov/util/src/util_attr.c:342)
==2602226==    by 0x4A8C312: rxm_getinfo (prov/rxm/src/rxm_init.c:558)
==2602226==    by 0x4A0547E: fi_getinfo@@FABRIC_1.7 (src/fabric.c:1365)
==2602226==
==2602226== This conflicts with a previous write of size 8 by thread #7
==2602226== Locks held: none
==2602226==    at 0x5297CE1: idm_clear (indexer.c:162)
==2602226==    by 0x5291312: ucma_remove_id (cma.c:693)
==2602226==    by 0x5291356: ucma_free_id (cma.c:703)
==2602226==    by 0x5291900: rdma_destroy_id (cma.c:839)
==2602226==    by 0x4A7BE19: vrb_handle_sock_addr (prov/verbs/src/verbs_info.c:1810)
==2602226==    by 0x4A787E0: vrb_get_match_infos (prov/verbs/src/verbs_info.c:1832)
==2602226==    by 0x4A77FE6: vrb_getinfo (prov/verbs/src/verbs_info.c:1904)
==2602226==    by 0x4A0547E: fi_getinfo@@FABRIC_1.7 (src/fabric.c:1365)
==2602226==  Address 0xec712670 is 0 bytes inside a block of size 8,192 alloc'd
==2602226==    at 0x4852274: calloc (vg_replace_malloc.c:1675)
==2602226==    by 0x5297B9D: idm_grow (indexer.c:125)
==2602226==    by 0x5297C3A: idm_set (indexer.c:146)
==2602226==    by 0x52912C4: ucma_insert_id (cma.c:686)
==2602226==    by 0x52916B3: rdma_create_id2 (cma.c:784)
==2602226==    by 0x5291732: rdma_create_id (cma.c:800)
==2602226==    by 0x528FFDE: ucma_set_af_ib_support (cma.c:250)
==2602226==    by 0x5290726: ucma_init (cma.c:411)
==2602226==    by 0x5291553: rdma_create_id2 (cma.c:762)
==2602226==    by 0x5291732: rdma_create_id (cma.c:800)
==2602226==    by 0x4A79DAA: vrb_ifa_rdma_info (prov/verbs/src/verbs_info.c:959)
==2602226==    by 0x4A790A6: vrb_getifaddrs (prov/verbs/src/verbs_info.c:1181)
==2602226==  Block was alloc'd by thread #7

Environment:

  • OS: Ubuntu 22.04 (Jammy Jellyfish)
  • glibc version: 2.35-0ubuntu3.8
  • Provider: Verbs with RXM utility provider
  • Relevant functions: fi_getinfo, fi_fabric, fi_domain, fi_close
  • rdma-core 43

Additional context
Analyzed:

Tagging @shefty is the author of the functions mentioned above in rdma-core, may be able to provide more insights or answers regarding this issue:

  1. Why is ucma_insert_id protected by a mutex, but ucma_remove_id is not?
  2. Why do two threads in libfabric receive the same id from rdma-core?
  3. Why id returned by rdma_create_id is frequently created and destroyed ?
@piotrchmiel piotrchmiel changed the title prov/verbs: helgrind data using rdma_create_id/rdma_destroy_id during vrb_getinfo prov/verbs: Helgrind detected data race in rdma_create_id/rdma_destroy_id during vrb_getinfo Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant