You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A race condition has been detected in the libfabric library when multiple threads concurrently invoke using verbs with fi_getinfo, fi_fabric, fi_domain and then threads close the related objects. This leads to a possible data race involving the access to shared memory in functions such as idm_set and idm_clear in librdmacm (rdma_core).
To Reproduce
In a multi-threaded environment, call fi_getinfo, fi_fabric, and fi_domain simultaneously from multiple threads. Use verbs with rxm utlity provider.
Subsequently, ensure fi_close is invoked after these calls.
Use Valgrind with Helgrind to detect potential data races.
Expected behavior
No data race should occur during the execution of these functions, and the library should handle multiple concurrent accesses to shared memory correctly without causing crashes or corruption.
Output
Valgrind/Helgrind detects the following possible data race:
==2602226== Possible data race during write of size 8 at 0xEC712670 by thread #5
==2602226== Locks held: none
==2602226== at 0x5297C7A: idm_set (indexer.c:151)
==2602226== by 0x52912C4: ucma_insert_id (cma.c:686)
==2602226== by 0x52916B3: rdma_create_id2 (cma.c:784)
==2602226== by 0x5291732: rdma_create_id (cma.c:800)
==2602226== by 0x4A63CE0: vrb_get_rai_id (prov/verbs/src/verbs_init.c:281)
==2602226== by 0x4A7BDA8: vrb_handle_sock_addr (prov/verbs/src/verbs_info.c:1797)
==2602226== by 0x4A787E0: vrb_get_match_infos (prov/verbs/src/verbs_info.c:1832)
==2602226== by 0x4A77FE6: vrb_getinfo (prov/verbs/src/verbs_info.c:1904)
==2602226== by 0x4A0547E: fi_getinfo@@FABRIC_1.7 (src/fabric.c:1365)
==2602226== by 0x4A34F0B: ofi_get_core_info (prov/util/src/util_attr.c:319)
==2602226== by 0x4A3528F: ofix_getinfo (prov/util/src/util_attr.c:342)
==2602226== by 0x4A8C312: rxm_getinfo (prov/rxm/src/rxm_init.c:558)
==2602226==
==2602226== This conflicts with a previous write of size 8 by thread #7
==2602226== Locks held: none
==2602226== at 0x5297CE1: idm_clear (indexer.c:162)
==2602226== by 0x5291312: ucma_remove_id (cma.c:693)
==2602226== by 0x5291356: ucma_free_id (cma.c:703)
==2602226== by 0x5291900: rdma_destroy_id (cma.c:839)
==2602226== by 0x4A7BE19: vrb_handle_sock_addr (prov/verbs/src/verbs_info.c:1810)
==2602226== by 0x4A787E0: vrb_get_match_infos (prov/verbs/src/verbs_info.c:1832)
==2602226== by 0x4A77FE6: vrb_getinfo (prov/verbs/src/verbs_info.c:1904)
==2602226== by 0x4A0547E: fi_getinfo@@FABRIC_1.7 (src/fabric.c:1365)
==2602226== Address 0xec712670 is 0 bytes inside a block of size 8,192 alloc'd
==2602226== at 0x4852274: calloc (vg_replace_malloc.c:1675)
==2602226== by 0x5297B9D: idm_grow (indexer.c:125)
==2602226== by 0x5297C3A: idm_set (indexer.c:146)
==2602226== by 0x52912C4: ucma_insert_id (cma.c:686)
==2602226== by 0x52916B3: rdma_create_id2 (cma.c:784)
==2602226== by 0x5291732: rdma_create_id (cma.c:800)
==2602226== by 0x528FFDE: ucma_set_af_ib_support (cma.c:250)
==2602226== by 0x5290726: ucma_init (cma.c:411)
==2602226== by 0x5291553: rdma_create_id2 (cma.c:762)
==2602226== by 0x5291732: rdma_create_id (cma.c:800)
==2602226== by 0x4A79DAA: vrb_ifa_rdma_info (prov/verbs/src/verbs_info.c:959)
==2602226== by 0x4A790A6: vrb_getifaddrs (prov/verbs/src/verbs_info.c:1181)
==2602226== Block was alloc'd by thread #7
==2602226==
==2602226== ----------------------------------------------------------------
==2602226==
==2602226== Possible data race during read of size 8 at 0xEC712670 by thread #5
==2602226== Locks held: none
==2602226== at 0x5297CC1: idm_clear (indexer.c:161)
==2602226== by 0x5291312: ucma_remove_id (cma.c:693)
==2602226== by 0x5291356: ucma_free_id (cma.c:703)
==2602226== by 0x5291900: rdma_destroy_id (cma.c:839)
==2602226== by 0x4A7BE19: vrb_handle_sock_addr (prov/verbs/src/verbs_info.c:1810)
==2602226== by 0x4A787E0: vrb_get_match_infos (prov/verbs/src/verbs_info.c:1832)
==2602226== by 0x4A77FE6: vrb_getinfo (prov/verbs/src/verbs_info.c:1904)
==2602226== by 0x4A0547E: fi_getinfo@@FABRIC_1.7 (src/fabric.c:1365)
==2602226== by 0x4A34F0B: ofi_get_core_info (prov/util/src/util_attr.c:319)
==2602226== by 0x4A3528F: ofix_getinfo (prov/util/src/util_attr.c:342)
==2602226== by 0x4A8C312: rxm_getinfo (prov/rxm/src/rxm_init.c:558)
==2602226== by 0x4A0547E: fi_getinfo@@FABRIC_1.7 (src/fabric.c:1365)
==2602226==
==2602226== This conflicts with a previous write of size 8 by thread #7
==2602226== Locks held: none
==2602226== at 0x5297CE1: idm_clear (indexer.c:162)
==2602226== by 0x5291312: ucma_remove_id (cma.c:693)
==2602226== by 0x5291356: ucma_free_id (cma.c:703)
==2602226== by 0x5291900: rdma_destroy_id (cma.c:839)
==2602226== by 0x4A7BE19: vrb_handle_sock_addr (prov/verbs/src/verbs_info.c:1810)
==2602226== by 0x4A787E0: vrb_get_match_infos (prov/verbs/src/verbs_info.c:1832)
==2602226== by 0x4A77FE6: vrb_getinfo (prov/verbs/src/verbs_info.c:1904)
==2602226== by 0x4A0547E: fi_getinfo@@FABRIC_1.7 (src/fabric.c:1365)
==2602226== Address 0xec712670 is 0 bytes inside a block of size 8,192 alloc'd
==2602226== at 0x4852274: calloc (vg_replace_malloc.c:1675)
==2602226== by 0x5297B9D: idm_grow (indexer.c:125)
==2602226== by 0x5297C3A: idm_set (indexer.c:146)
==2602226== by 0x52912C4: ucma_insert_id (cma.c:686)
==2602226== by 0x52916B3: rdma_create_id2 (cma.c:784)
==2602226== by 0x5291732: rdma_create_id (cma.c:800)
==2602226== by 0x528FFDE: ucma_set_af_ib_support (cma.c:250)
==2602226== by 0x5290726: ucma_init (cma.c:411)
==2602226== by 0x5291553: rdma_create_id2 (cma.c:762)
==2602226== by 0x5291732: rdma_create_id (cma.c:800)
==2602226== by 0x4A79DAA: vrb_ifa_rdma_info (prov/verbs/src/verbs_info.c:959)
==2602226== by 0x4A790A6: vrb_getifaddrs (prov/verbs/src/verbs_info.c:1181)
==2602226== Block was alloc'd by thread #7
piotrchmiel
changed the title
prov/verbs: helgrind data using rdma_create_id/rdma_destroy_id during vrb_getinfo
prov/verbs: Helgrind detected data race in rdma_create_id/rdma_destroy_id during vrb_getinfo
Jan 28, 2025
Describe the bug
A race condition has been detected in the libfabric library when multiple threads concurrently invoke using verbs with
fi_getinfo
,fi_fabric
,fi_domain
and then threads close the related objects. This leads to a possible data race involving the access to shared memory in functions such asidm_set
andidm_clear
in librdmacm (rdma_core).To Reproduce
Expected behavior
No data race should occur during the execution of these functions, and the library should handle multiple concurrent accesses to shared memory correctly without causing crashes or corruption.
Output
Valgrind/Helgrind detects the following possible data race:
Environment:
Additional context
Analyzed:
Tagging @shefty is the author of the functions mentioned above in rdma-core, may be able to provide more insights or answers regarding this issue:
The text was updated successfully, but these errors were encountered: