Virtio VF PCIe devices can be attached to the guest VM using vhost acceleration software stack. This enables performing live migration of guest VMs. This document describes how to achieve this.
- Minimum hypervisor kernel version - Linux kernel 5.7 (for VFIO SR-IOV support).
- To use high-availability (the additional vfe-vhostd-ha service which can persist datapath when vfe-vhostd crashes), this linux kernel patch must be applied on hypervisor.
Vhost acceleration software stack is built using open-source BSD licensed DPDK.
Pre-built release package is installed to /opt/mellanox/dpdk-vhost-vfe
. To use vfe-vhost-cli
, make sure /opt/mellanox/dpdk-vhost-vfe/bin
is in enviroment variable PATH
.
-
Clone the software source code, or install previous releases:
[host]# git clone https://github.com/Mellanox/dpdk-vhost-vfe
-
Build software:
[host]# apt-get install libev-dev [host]# yum install -y numactl-devel libev-devel [host]# meson build --debug -Denable_drivers=vdpa/virtio,common/virtio,common/virtio_mi,common/virtio_ha [host]# ninja -C build install
-
Start vhostd and vhostd-ha(optional) service:
[host]# systemctl start vfe-vhostd [host]# systemctl start vfe-vhostd-ha
Upstream QEMU later than 8.1 can be used or the following NVIDIA QEMU.
-
Clone NVIDIA QEMU sources:
[host]# git clone https://github.com/Mellanox/qemu -b stable-8.1-presetup
-
Build NVIDIA QEMU:
[host]# mkdir bin [host]# cd bin [host]# ../configure --target-list=x86_64-softmmu --enable-kvm [host]# make -j24
-
Set the DPU nvconfig to enable virtio static PF with SR-IOV VF. Make sure virtio PF exists on host and SR-IOV enabled. For detail, refer NVIDIA BlueField Virtio-net and NVIDIA BlueField-3 SNAP for NVMe and Virtio-blk.
-
Configure huge pages:
[host]# mkdir /dev/hugepages1G [host]# mount -t hugetlbfs -o pagesize=1G none /dev/hugepages1G [host]# echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages [host]# echo 16 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
-
Configure libvirt VM XML.
-
Open the VM’s configuration XML for editing:
virsh edit <domain name>
-
Change the top line to:
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
-
Assign a memory amount and use 1GB page size for huge pages:
<memory unit='GiB'>4</memory> <currentMemory unit='GiB'>4</currentMemory> <memoryBacking> <hugepages> <page size='1' unit='GiB'/> </hugepages> </memoryBacking>
-
Set the memory access for the CPUs to be shared:
<cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>Skylake-Server-IBRS</model> <numa> <cell id='0' cpus='0-1' memory='4' unit='GiB' memAccess='shared'/> </numa> </cpu>
-
Add a virtio-net interface in VM XML:
<qemu:commandline> <qemu:arg value='-chardev'/> <qemu:arg value='socket,id=char0,path=/tmp/vhost-net0,server=on'/> <qemu:arg value='-netdev'/> <qemu:arg value='type=vhost-user,id=vhost1,chardev=char0,queues=4'/> <qemu:arg value='-device'/> <qemu:arg value='virtio-net-pci,netdev=vhost1,mac=00:00:00:00:33:00,vectors=10,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024,mq=on,disable-legacy=on,disable-modern=off'/> </qemu:commandline>
-
Add a a virtio-blk interface in VM XML:
<qemu:commandline> <qemu:arg value='-chardev'/> <qemu:arg value='socket,id=char1,path=/tmp/vhost-blk0,server=on'/> <qemu:arg value='-device'/> <qemu:arg value='vhost-user-blk-pci,chardev=char1,page-per-vq=on,num-queues=4,disable-legacy=on,disable-modern=off'/> </qemu:commandline>
-
-
Bind PCIe VF device to vfio-pci driver and enable SR-IOV.
-
Bind the virtio-net PF devices to vfio-pci driver and create 1 virtio-net VF:
[host]# lspci -s 0000:af:00.2 af:00.2 Ethernet controller: Red Hat, Inc. Virtio network device (rev 01) [host]# modprobe vfio vfio_pci [host]# echo 1 > /sys/module/vfio_pci/parameters/enable_sriov [host]# echo 0x1af4 0x1041 > /sys/bus/pci/drivers/vfio-pci/new_id [host]# echo 0000:af:00.2 > /sys/bus/pci/drivers/vfio-pci/bind [host]# lspci -vvv -s 0000:af:00.2 | grep "Kernel driver" Kernel driver in use: vfio-pci Enable SR-IOV and create a virtio-net VF(s): [host]# echo 1 > /sys/bus/pci/devices/0000:af:00.2/sriov_numvfs [host]# lspci | grep Virtio af:00.2 Ethernet controller: Red Hat, Inc. Virtio network device af:04.5 Ethernet controller: Red Hat, Inc. Virtio network device
-
Bind the virtio-blk PF devices to vfio-pci driver and create 1 virtio-blk VF:
[host]# lspci -s 0000:af:00.3 af:00.3 Non-Volatile memory controller: Red Hat, Inc. Virtio block device (rev 01) [host]# modprobe vfio vfio_pci [host]# echo 1 > /sys/module/vfio_pci/parameters/enable_sriov [host]# echo 0x1af4 0x1042 > /sys/bus/pci/drivers/vfio-pci/new_id [host]# echo 0000:af:00.3 > /sys/bus/pci/drivers/vfio-pci/bind [host]# lspci -vvv -s 0000:af:00.3 | grep "Kernel driver" Kernel driver in use: vfio-pci Enable SR-IOV and create a virtio-blkVF(s): [host]# echo 1 > /sys/bus/pci/devices/0000:af:00.3/sriov_numvfs [host]# lspci | grep Virtio af:00.3 Non-Volatile memory controller: Red Hat, Inc. Virtio block device af:05.1 Non-Volatile memory controller: Red Hat, Inc. Virtio block device
-
-
Provision and add VF device to vhostd service:
-
For virtio-net VF:
-
Add a VF representor to the OVS bridge on the DPU:
[dpu]# virtnet query -p 0 -v 0 | grep sf_rep_net_device "sf_rep_net_device": "en3f0pf0sf3000", [dpu]# ovs-vsctl add-port ovsbr1 en3f0pf0sf3000
-
Add PF device and wait virtio-net-controller finishing handle PF FLR:
[host]# vfe-vhost-cli mgmtpf -a 0000:af:00.2
-
Provision VF on DPU(optional):
[dpu]# virtnet modify -p 0 -v 0 device -m 00:00:00:00:33:00
-
Add VF to vhostd service:
[host]# vfe-vhost-cli vf -a 0000:af:04.5 -v /tmp/vhost-net0
-
-
For virtio-blk VF:
-
Create block device on the DPU:
On BlueField-3 SNAP: [dpu]# spdk_rpc.py bdev_null_create Null0 1024 512 [dpu]# snap_rpc.py virtio_blk_controller_create --pf_id 0 --bdev Null0 --num_queues 1 --admin_q For BlueField-3 feature: shared memory based recovery(environment variable VBLK_RECOVERY_SHM) is developed to substitute --force_in_order. On BlueField-2 SNAP: [dpu]# spdk_rpc.py bdev_null_create Null0 1024 512 [dpu]# snap_rpc.py controller_virtio_blk_create --pf_id 0 --bdev_type spdk mlx5_0 --bdev Null0 --num_queues 1 --admin_q --force_in_order
-
Add virtio-blk PF to vhost acceleration service:
[host]# vfe-vhost-cli mgmtpf -a 0000:af:00.3 # Wait on SNAP controller to finish handling PF FLR # On DPU, the user must create a VF device controller before adding the VF device to the vhostd # service or after pf or vf device delete from vhostd service , or vhostd service restart: # For BlueField-3, the VF controller is automatically recreated # For BlueField-2, the VF controller must be manually recreated # Use snap_rpc.py controller_list to check for controller exsistence and create controller if it's not there [dpu]# snap_rpc.py controller_virtio_blk_create mlx5_0 --pf_id 0 --vf_id 0 --bdev_type spdk --bdev Null0 --force_in_order
-
Add virtio-blk VF to vhostd service:
[host]# vfe-vhost-cli vf -a 0000:af:05.1 -v /tmp/vhost-blk0
-
-
-
Start the VM:
virsh start <domain-name>
-
Remove Device.
When finished using the virtio device, use following commands to remove them from vhostd service:
[host]# vfe-vhost-cli vf -r 0000:af:04.5 [host]# vfe-vhost-cli mgmtpf -r 0000:af:00.2 [host]# vfe-vhost-cli vf -r 0000:af:05.1 [host]# vfe-vhost-cli vf -r 0000:af:00.3
Prepare two identical hosts and perform the steps of adding the virtio devices to vhostd service on both server. Boot the virtual machine on one server and live migration it to another with command like:
[host]# virsh migrate --verbose --live --persistent gen-l-vrt-440-162-CentOS-7.4 qemu+ssh://gen-l-vrt-439/system --unsafe
This service communicates to QEMU vhost-user front-end and programs virtio VF through VFIO and virtio driver.
-
Start vfe-vhostd service:
[host]# systemctl start vfe-vhostd
-
Stop vfe-vhostd service:
[host]# systemctl start vfe-vhostd
-
Check vfe-vhostd service log:
[host]# journalctl -u vfe-vhostd
Running vfe-vhostd-ha service allows datapath to persist in case vfe-vhostd crash. vhostd service and vhostd-ha service will connect each other through unix domain socket. So vhostd-ha service can get information from vhostd service and give back to vhostd service for recovery.
-
Start/Stop/Log of the vfe-vhostd-ha service
[host]# systemctl start vfe-vhostd-ha [host]# systemctl stop vfe-vhostd-ha [host]# journalctl -u vfe-vhostd-ha
After install new software package:
systemctl restart vfe-vhostd-ha
systemctl restart vfe-vhostd
[host]# vfe-vhost-cli version
{
"vfe-vhostd version": "DPDK 22.07.0-rc0.mlx_vfe_vdpa-1.2",
"vfe-vhostd build time": "May 20 2024 18:08:50",
"vfe-vhostd-ha version": "DPDK 22.07.0-rc0.mlx_vfe_vdpa-1.2",
"vfe-vhostd-ha build time": "May 20 2024 18:08:50",
"errno": 0,
"errstr": "Success"
}
Before adding any VF device, PF device must be added in advance. Before removing any PF device, all VF devices must be removed in advance.
[host]# vfe-vhost-cli mgmtpf -h
usage: vfe-vhost-cli mgmtpf [-h] [-a | -r | -l] [DEVICE]
positional arguments:
DEVICE Device specified as PCI "domain:bus:slot.func" syntax or "bus:slot.func" syntax. For device
add/remove to drivers, they may be referred to by interface name.
optional arguments:
-h, --help show this help message and exit
-a, --add add a pci device
-r, --remove remove a pci device
-l, --list list all PF devices
[host]# vfe-vhost-cli vf -h
usage: vfe-vhost-cli vf [-h] [-a | -r | -l | -i | -d] [-o test_operation] [-b test_size_mode] [-v vhost_socket]
[-u vm_uuid]
[DEVICE]
positional arguments:
DEVICE Device specified as PCI "domain:bus:slot.func" syntax or "bus:slot.func" syntax. For device
add/remove to drivers, they may be referred to by interface name.
optional arguments:
-h, --help show this help message and exit
-a, --add add a pci device
-r, --remove remove a pci device
-l, --list list all VF devices of PF device
-i, --info show specified VF device information
-d, --debug test VF device debug
-v vhost_socket Vhost socket file name
-u vm_uuid Virtual machine UUID
QEMU's direct doorbell mapping feature can reduce vmexit when guest driver rings VQ doorbell. To use this feature, virtio-device must work in modern mode and page-per-vq must be set.
The offload properties (host_tso4, csum, ...) or VQ properties (tx_queue_size, rx_queue_size...) should be properly set to QEMU. Full property list can be get from following QEMU command:
[host]# qemu-system-x86_64 -device virtio-net,help
[host]# qemu-system-x86_64 -device vhost-user-blk-pci,help
For virtio-net, guest driver will assign each TX queue or RX queue a dedicated MSIX interrupt ONLY IF device has enough MSIX resource. Make sure: QEMU virtio-net property vectors = 2*N_qpair+2, give QEMU emulated VF device enough MSIX resource.
Configure virtio-net device MTU on DPU, set QEMU virtio-net device property host_mtu
to the same value.
QEMU can send out virtio device state in early live migration stage, so vhostd backend can apply device state before VM is down. This can reduce live migration downtime a lot. Only NVIDIA QEMU(branch stable-8.1-presetup) has this feature.
Add x-early-migration=on
in VM XML:
<qemu:arg value='-device'/>
<qemu:arg value='virtio-net-pci,netdev=vhost1,mac=00:00:00:00:33:00,vectors=10,page-per-vq=on,rx_queue_size=1024,tx_queue_size=1024,mq=on,disable-legacy=on,disable-modern=off,x-early-migration=on'/>
Through doca_snap.yaml environment variable APP_ARGS:
env:
- name: APP_ARGS
value: "-m 0xfff"
Example: vblk(3 cores) and vnet (8 cores).
In virtio-blk controller snap container yaml file /etc/kubelet.d/doca_snap.yaml:
env:
- name: dpa_virtq_split_core_mask
value: "0xffffffffffff"
In virtio-net controller configuration file /opt/mellanox/mlnx_virtnet/virtnet.conf:
{
"dpa_core_start": 3,
"dpa_core_end": 10
}
-
Error on host:
Can't bind virtio device to VFIO
.Solution: Add intel_iommu=on iommu=pt to /proc/cmdline.
-
Error on host:
vfio-pci 0001:86:00.3: can't enable 127 VFs (bus 87 out of range of [bus 86])
Solution: Add pci=realloc,assign-busses to /proc/cmdline.
For NVIDIA BlueField virtio-net PCIe devices and NVIDIA BlueField-3 virtio-blk configuration, Please refer lastest document in BlueField DPUs / SuperNICs & DOCA: