Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librdmacm: prevent NULL pointer access during device initialization #1547

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dragonJACson
Copy link
Contributor

@dragonJACson dragonJACson commented Jan 18, 2025

When an RNIC with node_guid 0 is present, rdma_resolve_addr succeeds with ADDR_RESOLVED but subsequent device initialization can fail. This occurs because ucma_query_addr and ucma_query_route skip device initialization when the kernel returns a zero node_guid, leading to NULL pointer access in ucma_process_addr_resolved.

Add explicit NULL checks for id->verbs after ucma_query_addr and ucma_query_route calls. Return ENODEV error if device initialization fails, ensuring proper error propagation instead of crashes.

Note: ucma_query_addr must still return success in this case as it's used for probing AF_IB support, which intentionally skips device initialization.

This is easily reproducible with this RNIC configuration and C code:

$ ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         22.41.1000
        node_guid:                      0000:0000:0000:0000
        sys_image_guid:                 b8ce:f603:00e9:d18e
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000430
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <rdma/rdma_cma.h>
#include <netinet/in.h>
#include <arpa/inet.h>

int main() {
    struct rdma_cm_id *cm_id;
    struct rdma_event_channel *channel;
    struct rdma_cm_event *event;
    struct sockaddr_in addr;
    int ret;

    // Create event channel
    channel = rdma_create_event_channel();
    if (!channel) {
        perror("rdma_create_event_channel failed");
        return 1;
    }

    // Create RDMA ID
    ret = rdma_create_id(channel, &cm_id, NULL, RDMA_PS_TCP);
    if (ret) {
        perror("rdma_create_id failed");
        rdma_destroy_event_channel(channel);
        return 1;
    }

    // Setup address
    memset(&addr, 0, sizeof(addr));
    addr.sin_family = AF_INET;
    addr.sin_port = htons(7471); 
    inet_pton(AF_INET, "127.0.0.1", &addr.sin_addr);

    // Resolve address
    ret = rdma_resolve_addr(cm_id, NULL, (struct sockaddr *)&addr, 2000);
    if (ret) {
        perror("rdma_resolve_addr failed");
        goto cleanup;
    }

    // Get the address resolved event
    ret = rdma_get_cm_event(channel, &event);
    if (ret) {
        perror("rdma_get_cm_event failed");
        goto cleanup;
    }

    if (event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
        fprintf(stderr, "Unexpected event: %s, status: %d\n", rdma_event_str(event->event), event->status);
        rdma_ack_cm_event(event);
        goto cleanup;
    }

    printf("Address resolved successfully\n");
    rdma_ack_cm_event(event);

cleanup:
    rdma_destroy_id(cm_id);
    rdma_destroy_event_channel(channel);
    return ret;
}

When use the original librdmacm:

$ ./test_cm
[1]    44206 segmentation fault  ./test_cm

After applying the fix:

LD_LIBRARY_PATH=~/workspace/rdma-core/build/lib ./test_cm
Unexpected event: RDMA_CM_EVENT_ADDR_ERROR, status: -1

When an RNIC with node_guid 0 is present, rdma_resolve_addr succeeds with
ADDR_RESOLVED but subsequent device initialization can fail. This occurs
because ucma_query_addr and ucma_query_route skip device initialization
when the kernel returns a zero node_guid, leading to NULL pointer access
in ucma_process_addr_resolved.

Add explicit NULL checks for id->verbs after ucma_query_addr and
ucma_query_route calls. Return ENODEV error if device initialization
fails, ensuring proper error propagation instead of crashes.

Note: ucma_query_addr must still return success in this case as it's
used for probing AF_IB support, which intentionally skips device
initialization.

Signed-off-by: Luke Yue <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant