Skip to content

Fio end to end data protection part 2 fio support

Vincent Fu edited this page Dec 17, 2024 · 2 revisions

This blog post is co-authored by Vincent Fu and Ankit Kumar. We are grateful to Klaus Jensen, Adam Manzanares, and Krishna Kanth Reddy for their support and feedback.

This is Part 2 of our series on fio's support for NVMe end to end data protection (E2EDP). Part 1 of this series provided background information on NVMe E2EDP. This second part describes fio's support for E2EDP and the related ecosystem.

Setup

For our test environment we rely on QEMU and nvme-cli. QEMU provides emulated NVMe devices and we use nvme-cli to manage them.

QEMU

QEMU supports E2EDP in its emulated PCIe NVMe devices. We recommend using v8.1 or later as this version fixes a bug in 64b Guard protection information support. With QEMU v8.1 the options below bring up a guest with three different PCIe NVMe devices:

-device "nvme,id=nvme0,serial=deadbeef" \
   -drive "id=nvm-0,file=nvme.img,format=raw,if=none,discard=unmap,media=disk" \
   -device "nvme-ns,id=nvm-0,drive=nvm-0,bus=nvme0,nsid=1" \
-device "nvme,id=nvme1,serial=deadbeee" \
   -drive "id=nvm-1,file=nvme1.img,format=raw,if=none,discard=unmap,media=disk" \
   -device "nvme-ns,id=nvm-1,drive=nvm-1,bus=nvme1,nsid=1,pif=2,ms=16,mset=1,pi=1,pil=0,logical_block_size=4096,physical_block_size=4096" \
-device "nvme,id=nvme2,serial=deadeeef" \
   -drive "id=nvm-2,file=nvme2.img,format=raw,if=none,discard=unmap,media=disk" \
   -device "nvme-ns,id=nvm-2,drive=nvm-2,bus=nvme2,nsid=1,pif=2,ms=64,mset=1,pi=1,pil=0,logical_block_size=4096,physical_block_size=4096" \

The first device supports 16b Guard protection information for all LBA formats. The second device supports 64b Guard protection information for the LBA format with 4096 bytes of data and 16 bytes of metadata per LBA. The third device supports 64b Guard protection information for the LBA format with 4096 bytes of data and 64 bytes of metadata per LBA. For details see QEMU's documentation on NVMe Emulation.

A complete guide to using QEMU as a test platform is beyond the scope of this blog post. For more background on QEMU see the many resources available online including this and this.

nvme-cli

How can we manage the NVMe devices in our guest VM? nvme-cli can be used to inspect the capabilities supported by the device and also to format the device to select a supported E2EDP configuration.

Querying device supported features

The first nvme-cli command to use is the Identify Namespace command, id-ns. This provides information about the current namespace format as well as capabilities supported by the controller. At the very end of the output will also be a list of the LBA formats supported by the device. Run with the -H option for human-readable details. For our test platform the capabilities will be identical for each device.

root@localhost:~# nvme id-ns -H /dev/nvme2n1
NVME Identify Namespace 1:
...
nlbaf   : 7
flbas   : 0x7
  [6:5] : 0     Most significant 2 bits of Current LBA Format Selected
  [4:4] : 0     Metadata Transferred in Separate Contiguous Buffer
  [3:0] : 0x7   Least significant 4 bits of Current LBA Format Selected
 
mc      : 0x3
  [1:1] : 0x1   Metadata Pointer Supported
  [0:0] : 0x1   Metadata as Part of Extended Data LBA Supported
 
dpc     : 0x1f
  [4:4] : 0x1   Protection Information Transferred as Last 8 Bytes of Metadata Supported
  [3:3] : 0x1   Protection Information Transferred as First 8 Bytes of Metadata Supported
  [2:2] : 0x1   Protection Information Type 3 Supported
  [1:1] : 0x1   Protection Information Type 2 Supported
  [0:0] : 0x1   Protection Information Type 1 Supported
 
dps     : 0x3
  [3:3] : 0     Protection Information is Transferred as Last 8 Bytes of Metadata
  [2:0] : 0x3   Protection Information Type 3 Enabled
...
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format  1 : Metadata Size: 8   bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format  2 : Metadata Size: 16  bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format  3 : Metadata Size: 64  bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format  4 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format  5 : Metadata Size: 8   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format  6 : Metadata Size: 16  bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format  7 : Metadata Size: 64  bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)

The Formatted LBA Size (FLBAS) field's bit 4 displays whether the namespace was formatted in extended LBA mode or has metadata in a separate buffer. In this case, bit 4 is 0, indicating that metadata is in a separate buffer. The Metadata Capabilities (MC) bits indicate whether the device supports extended LBA mode (bit 0) or having the metadata in a separate buffer (bit 1). In this case both bits are 1, indicating that both modes are supported. The End-to-end Data Protection Capabilities (DPC) field describes support for different locations for the protection information inside the metadata buffer as well as data protection Types 1, 2, and 3. All of the bits are 1 for the emulated NVMe device, indicating that the device supports Type 1 (bit 0), Type 2 (bit 1), and Type 3 (bit 2) data protection as well as locating the protection information at the start of the metadata buffer (bit 3) and at the end of the metadata buffer (bit 4). The End-to-end Data Protection Type Settings (DPS) field describes the location of the protection information and the data protection type for the current namespace format. In this case, bits 0-2 indicate that the namespace is currently formatted with Type 3 data protection and that protection information is at the end of the metadata buffer (bit 3).

Finally, at the bottom of the id-ns output is a list of the LBA formats supported by the namespace. Each QEMU emulated NVMe device supports eight different LBA formats with data sizes of 512 and 4096 bytes and metadata sizes of 0, 8, 16, and 64 bytes.

Differences among the devices will be apparent with the nvme-cli NVMe Identify Namespace NVM Command Set command, nvm-id-ns. Run this with the -v option for verbose output:

root@localhost:~# nvme nvm-id-ns /dev/nvme0n1 -v
NVMe NVM Identify Namespace 1:
lbstm : 0
pic   : 0
  [2:2] : 0     Storage Tag Check Read Support
  [1:1] : 0     16b Guard Protection Information Storage Tag Mask
  [0:0] : 0     16b Guard Protection Information Storage Tag Support
 
Extended LBA Format  0 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  1 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  2 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  3 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  4 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  5 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  6 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  7 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0  (in use)
root@localhost:~# nvme nvm-id-ns /dev/nvme1n1 -v
NVMe NVM Identify Namespace 1:
lbstm : 0
pic   : 0
  [2:2] : 0     Storage Tag Check Read Support
  [1:1] : 0     16b Guard Protection Information Storage Tag Mask
  [0:0] : 0     16b Guard Protection Information Storage Tag Support
 
Extended LBA Format  0 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  1 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  2 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  3 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  4 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  5 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  6 : Protection Information Format: 64b Guard(2) - Storage Tag Size (MSB): 0  (in use)
Extended LBA Format  7 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
root@localhost:~# nvme nvm-id-ns /dev/nvme2n1 -v
NVMe NVM Identify Namespace 1:
lbstm : 0
pic   : 0
  [2:2] : 0     Storage Tag Check Read Support
  [1:1] : 0     16b Guard Protection Information Storage Tag Mask
  [0:0] : 0     16b Guard Protection Information Storage Tag Support
 
Extended LBA Format  0 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  1 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  2 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  3 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  4 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  5 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  6 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format  7 : Protection Information Format: 64b Guard(2) - Storage Tag Size (MSB): 0  (in use)

The output shows that all LBAFs support 16b Guard protection information for the first device. The second device is identical except that LBAF 6 supports 64b Guard PI with a data size of 4096 bytes and metadata size of 16 bytes. The first and third devices are also identical except that for the third device LBAF 7 supports 64b Guard PI with a data size of 4096 bytes and metadata size of 64 bytes.

Formatting an NVMe Namespace

We can use nvme-cli's format command to select the namespace's on-disk format. The relevant options are described in the help text available for nvme format:

root@localhost:~# nvme format --help
Usage: nvme format <device> [OPTIONS]
 
Re-format a specified namespace on the
given device. Can erase all data in namespace (user
data erase) or delete data encryption key if specified.
Can also be used to change LBAF to change the namespaces reported physical
block format.
 
Options:
  [  --namespace-id=<NUM>, -n <NUM> ]   --- identifier of desired namespace
  [  --timeout=<NUM>, -t <NUM> ]        --- timeout value, in milliseconds
  [  --lbaf=<NUM>, -l <NUM> ]           --- LBA format to apply (required)
  [  --ses=<NUM>, -s <NUM> ]            --- [0-2]: secure erase
  [  --pi=<NUM>, -i <NUM> ]             --- [0-3]: protection info off/Type
                                            1/Type 2/Type 3
  [  --pil=<NUM>, -p <NUM> ]            --- [0-1]: protection info location
                                            last/first 8 bytes of metadata
  [  --ms=<NUM>, -m <NUM> ]             --- [0-1]: extended format off/on
  [  --reset, -r ]                      --- Automatically reset the
                                            controller after successful
                                            format
  [  --force ]                          --- The "I know what I'm doing" flag,
                                            skip confirmation before sending
                                            command
  [  --block-size=<IONUM>, -b <IONUM> ] --- target block size

The first option to consider is --lbaf. The LBA format chosen (based on the output of nvme id-ns and nvme nvm-id-ns) determines the LBA data size (512 or 4096 bytes for our test platform), metadata buffer size (8, 16, or 64 bytes), and Guard protection information format (16b, 32b, or 64b).

Then, select the remaining on-device E2EDP format parameters:

  • E2EDP Type
    • --pi=1 selects Type 1 data protection
    • --pi=2 selects Type 2 data protection
    • --pi=3 selects Type 3 data protection
  • Protection Information Location
    • --pil=0 places protection information at the end of the metadata buffer
    • --pil=1 places protection information at the beginning of the metadata buffer.
  • Extended LBA vs Separate Metadata Buffer
    • --ms=0 formats the device so that LBA data and metadata are stored in separate buffers
    • --ms=1 formats the device in extended LBA mode (metadata is contiguous with the LBA data)

Fio

With all of the preliminaries out of the way, let us now detail how fio supports E2EDP.

End-to-end Data Protection Options

NVMe E2EDP support was added to fio via the io_uring_cmd ioengine. This ioengine provides a means to submit commands directly to NVMe devices. The new E2EDP options were developed as engine-specific options and are listed below:

Option Description
md_per_io_size (int) Size in bytes for separate metadata buffer per IO. Default: 0.
This option must be set if the namespace is formatted with a separate metadata buffer, and
1. protection information is disabled, or
2. protection information is enabled, and the job in question will use pi_act=0, or
3. protection information is enabled, the job in question will use pi_act=1, and the formatted namespace metadata size is greater than protection information size.
pi_act (int) Action to take when NVMe namespace is formatted with protection information. Default: 1.
If this is set to 1 and the namespace is formatted with metadata size equal to protection information size, fio will not use a separate metadata buffer or extended logical block. If this is set to 1 and the namespace is formatted with metadata size greater than protection information size, fio will not generate or verify the protection information portion of metadata for write or read case respectively. If this is set to 0, fio generates protection information for the write case and verifies for the read case.
pi_chk (str) Controls protection information checking. This can take one or more of the values below. Default: none.
1. GUARD - Enables protection information checking of Guard field.
2. REFTAG - Enables protection information checking of Logical Block Reference Tag field.
3. APPTAG - Enables protection information checking of Application Tag field.
apptag (int) Specifies Logical Block Application Tag value. Default: 0x1234.
apptag_mask (int) Specifies Logical Block Application Tag Mask value. Default: 0xffff.

The new md_per_io_size option directs fio to allocate an extra metadata buffer for each IO. Note that this buffer's size differs from the device's LBA format metadata buffer size in that this buffer is sized per IO rather than per LBA. In other words, for an LBA format with 4096 bytes of data and 16 bytes of metadata, if fio issues 16KiB read and write requests, md_per_io_size will need to be set to 64 bytes.

The new pi_act option sets the NVMe command's PRACT bit. Setting this to 1 directs the controller to be responsible for protection information generation. Setting this to 0 directs fio to be responsible for protection information generation. For more details see Part 1 of this series.

The new pi_chk option sets the NVMe command's PRCHK bits. The option accepts a string value. If the string contains GUARD, REFTAG, APPTAG, or any combination thereof, then respectively the Guard, Reference Tag, and Application Tag bits of the PRCHK field will be set.

The final two options are apptag and apptag_mask. These are only relevant if the pi_chk option sets the Application Tag bit. If this bit is set then the specified Application Tag and mask are included in the NVMe command.

These options names are consistent with those used for the SPDK external ioengine. The only difference is that md_per_io_size has a default of 0 for fio's io_uring_cmd ioengine and a default of 4096 in the SPDK external ioengine.

For best performance ensure that the ISA-L library is installed and detected when building fio. Fio's source code includes CRC calculation functions, but ISA-L has optimized versions of these routines.

Supported Protection Information Configurations

Fio's E2EDP support does not include the full set of E2EDP options outlined in the NVMe specification. The table below lists the support status for combinations of different parameters.

LBA Data Size (bytes) Metadata Size (bytes) Metadata at End of LBA Data (Extended LBA) or Separate Metadata Buffer PI Size (bytes) Guard Format PI location (Start/End) Fio Support
512 8 Both 8 16 bit NA Yes
512 16 Both 8 16 bit Both Yes
512 64 Both 8 16 bit Both Yes
4096 8 Both 8 16 bit NA Yes
4096 16 Both 8 16 bit Both Yes
4096 64 Both 8 16 bit Both Yes
4096 16 Both 16 32 bit NA No
4096 16 Both 16 64 bit NA Yes (Without Storage tags)
4096 64 Both 16 32 bit Both No
4096 64 Both 16 64 bit Both Yes (Without Storage tags)

Fio does support 16b and 64b Guard protection information formats, but the most notable omissions are 32b Guard Protection information format and Storage Tags. No E2EDP configurations with 32b Guard protection information formats are supported. Furthermore, formats with Storage Tags are also not supported. Fio only supports LBA formats with 16b and 64b Guard protection information formats if those LBA formats have no Storage Tags.

Examples

Now let us go through three examples covering the new fio options. With the io_uring_cmd ioengine, we must use the cmd_type=nvme ioengine option to specify that we wish to issue NVMe commands. With this configuration, the filename option also needs to specify the NVMe character device. All of this requires version 5.19 or later of the Linux kernel.

Example 1: 512-byte LBA data size with 8 bytes of metadata (16b Guard PI)

Our first example is for a namespace formatted with a 512-byte LBA data size with 8 bytes of metadata in a separate buffer. We use the first device from our QEMU invocation above. In the command below we format the device selecting LBA Format 1 since that corresponds to our desired combination of data and metadata size (see the nvme id-ns output from above). We select Type 2 protection information with --pi=2 in our nvme format command. And finally we specify that metadata resides in a separate buffer with the --ms=0 option.

root@localhost:~# nvme format /dev/ng0n1 --lbaf=1 --pi=2 --ms=0 --force
Success formatting namespace:1

We formatted this device with metadata size the same as protection information size since LBAF 1 uses 16b Guard PI with a PI size of 8 bytes. The pi-sb-512.fio configuration file below carries out data integrity checks with pi_chk set to include GUARD and REFTAG bits. Application tag checking is not enabled in this example. As pi_act is set to 1 and metadata size is the same as protection information size, fio does not need to send a separate metadata buffer, so there is no need to pass md_per_io_size. With this configuration all of the protection information checking happens behind the scenes. The fio job first carries out a sequential write of the entire device and then sequentially reads back the just-written data.

pi-sb-512.fio

[global]
filename=/dev/ng0n1
ioengine=io_uring_cmd
cmd_type=nvme
size=1G
iodepth=1
bs=512
pi_act=1
pi_chk=GUARD,REFTAG
thread=1
stonewall=1
 
[write]
rw=write
 
[read]
rw=read

Expand the collapsed section below to see the output from running pi-sb-512.fio. No errors were reported, indicating that all data integrity checks passed.

pi-sb-512.fio output
root@localhost:~# fio pi-sb-512.fio
write: (g=0): rw=write, bs=(R) 512B-512B, (W) 512B-512B, (T) 512B-512B, ioengine=io_uring_cmd, iodepth=1
read: (g=1): rw=read, bs=(R) 512B-512B, (W) 512B-512B, (T) 512B-512B, ioengine=io_uring_cmd, iodepth=1
fio-3.35-126-ge2c5f
Starting 2 threads
Jobs: 1 (f=1): [_(1),R(1)][100.0%][r=9.88MiB/s][r=20.2k IOPS][eta 00m:00s]
write: (groupid=0, jobs=1): err= 0: pid=4474: Wed Sep 20 01:20:43 2023
  write: IOPS=19.1k, BW=9543KiB/s (9772kB/s)(1024MiB/109876msec); 0 zone resets
    slat (usec): min=7, max=305, avg= 8.85, stdev= 1.13
    clat (nsec): min=415, max=66444k, avg=42774.55, stdev=70168.33
     lat (usec): min=40, max=66453, avg=51.62, stdev=70.23
    clat percentiles (usec):
     |  1.00th=[   36],  5.00th=[   37], 10.00th=[   37], 20.00th=[   38],
     | 30.00th=[   39], 40.00th=[   40], 50.00th=[   42], 60.00th=[   43],
     | 70.00th=[   44], 80.00th=[   46], 90.00th=[   49], 95.00th=[   57],
     | 99.00th=[   76], 99.50th=[   79], 99.90th=[   90], 99.95th=[  100],
     | 99.99th=[  161]
   bw (  KiB/s): min= 6988, max=10922, per=99.99%, avg=9542.78, stdev=629.86, samples=219
   iops        : min=13976, max=21844, avg=19085.63, stdev=1259.71, samples=219
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 20=0.01%, 50=91.74%, 100=8.21%
  lat (usec)   : 250=0.04%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 50=0.01%, 100=0.01%
  cpu          : usr=6.71%, sys=22.88%, ctx=2097233, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2097152,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
read: (groupid=1, jobs=1): err= 0: pid=4479: Wed Sep 20 01:20:43 2023
  read: IOPS=18.6k, BW=9290KiB/s (9512kB/s)(1024MiB/112877msec)
    slat (usec): min=7, max=847, avg= 9.19, stdev= 5.25
    clat (nsec): min=422, max=1061.9k, avg=43866.26, stdev=40023.71
     lat (usec): min=39, max=1131, avg=53.05, stdev=45.02
    clat percentiles (usec):
     |  1.00th=[   35],  5.00th=[   36], 10.00th=[   36], 20.00th=[   37],
     | 30.00th=[   38], 40.00th=[   38], 50.00th=[   39], 60.00th=[   40],
     | 70.00th=[   43], 80.00th=[   44], 90.00th=[   46], 95.00th=[   51],
     | 99.00th=[   80], 99.50th=[  469], 99.90th=[  553], 99.95th=[  562],
     | 99.99th=[  644]
   bw (  KiB/s): min=  845, max=11050, per=99.98%, avg=9288.39, stdev=2542.99, samples=225
   iops        : min= 1690, max=22100, avg=18576.77, stdev=5085.97, samples=225
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 20=0.01%, 50=94.71%, 100=4.47%
  lat (usec)   : 250=0.05%, 500=0.41%, 750=0.35%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=6.65%, sys=23.36%, ctx=2097220, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=2097152,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=9543KiB/s (9772kB/s), 9543KiB/s-9543KiB/s (9772kB/s-9772kB/s), io=1024MiB (1074MB), run=109876-109876msec

Run status group 1 (all jobs):
   READ: bw=9290KiB/s (9512kB/s), 9290KiB/s-9290KiB/s (9512kB/s-9512kB/s), io=1024MiB (1074MB), run=112877-112877msec

Example 2: 4096-byte LBA data size with 16 bytes of metadata (64b Guard PI)

Our second example uses the second device from our QEMU invocation above. We format the device using LBAF 6 which has a logical block data size of 4096 bytes and metadata size of 16 bytes (see the nvme id-ns output above). We select Type 1 protection information with the --pi=1 option and designate a separate buffer for metadata with --ms=0 in the nvme format command below.

root@localhost:~# nvme format /dev/nvme1n1 --lbaf=6 --pi=1 --ms=0 --force
Success formatting namespace:1

The selected LBAF uses 64b Guard protection information (see the nvme nvm-id-ns output for this device above) and thus has metadata size same as the protection information size. The below pi-sb-4096.fio configuration file carries out a data integrity test with pi_chk having the GUARD, REFTAG and APPTAG bits set to 1. All three protection information fields will be checked. The Application Tag is set to 0x0888 and apptag_mask instructs the controller to check all bits of the Application Tag. As pi_act is set to 0, fio computes the 64-bit CRC value and fills in the appropriate protection information fields in the separate metadata buffer for write commands. For the block size of 8192 bytes (equivalent to 2 logical blocks), md_per_io_size must be set to 32 bytes (or more). As in the first example, this example is also a sequential write of the entire device followed by a full sequential read.

pi-sb-4096.fio

[global]
filename=/dev/ng1n1
ioengine=io_uring_cmd
cmd_type=nvme
size=1G
iodepth=32
bs=8192
pi_act=0
md_per_io_size=32
pi_chk=GUARD,APPTAG,REFTAG
apptag=0x0888
apptag_mask=0xFFFF
thread=1
stonewall=1
 
[write]
rw=write
 
[read]
rw=read

Expand the collapsed section below to see the output from pi-sb-4096.fio. No data integrity errors occurred during the run.

pi-sb-4096.fio output
root@localhost:~# fio pi-sb-4096.fio
write: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring_cmd, iodepth=32
read: (g=1): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring_cmd, iodepth=32
fio-3.35-126-ge2c5f
Starting 2 threads
Jobs: 1 (f=1): [_(1),R(1)][100.0%][r=146MiB/s][r=18.7k IOPS][eta 00m:00s]
write: (groupid=0, jobs=1): err= 0: pid=4511: Wed Sep 20 01:36:02 2023
  write: IOPS=33.8k, BW=264MiB/s (277MB/s)(1024MiB/3875msec); 0 zone resets
    slat (usec): min=16, max=1121, avg=23.03, stdev=22.80
    clat (nsec): min=581, max=6162.4k, avg=921957.70, stdev=351174.47
     lat (usec): min=126, max=6256, avg=944.98, stdev=352.07
    clat percentiles (usec):
     |  1.00th=[  289],  5.00th=[  363], 10.00th=[  441], 20.00th=[  586],
     | 30.00th=[  709], 40.00th=[  824], 50.00th=[  922], 60.00th=[ 1029],
     | 70.00th=[ 1139], 80.00th=[ 1237], 90.00th=[ 1352], 95.00th=[ 1467],
     | 99.00th=[ 1713], 99.50th=[ 1795], 99.90th=[ 1975], 99.95th=[ 2114],
     | 99.99th=[ 4817]
   bw (  KiB/s): min=255104, max=278800, per=99.92%, avg=270392.71, stdev=7805.51, samples=7
   iops        : min=31888, max=34850, avg=33799.00, stdev=975.69, samples=7
  lat (nsec)   : 750=0.01%
  lat (usec)   : 50=0.02%, 100=0.03%, 250=0.16%, 500=13.32%, 750=20.08%
  lat (usec)   : 1000=23.67%
  lat (msec)   : 2=42.63%, 4=0.07%, 10=0.02%
  cpu          : usr=61.15%, sys=20.81%, ctx=2737, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,131072,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
read: (groupid=1, jobs=1): err= 0: pid=4512: Wed Sep 20 01:36:02 2023
  read: IOPS=19.5k, BW=153MiB/s (160MB/s)(1024MiB/6711msec)
    slat (usec): min=8, max=2833, avg=34.42, stdev=13.37
    clat (usec): min=46, max=12183, avg=1586.98, stdev=277.91
     lat (usec): min=84, max=12194, avg=1621.40, stdev=283.86
    clat percentiles (usec):
     |  1.00th=[ 1500],  5.00th=[ 1500], 10.00th=[ 1516], 20.00th=[ 1516],
     | 30.00th=[ 1532], 40.00th=[ 1532], 50.00th=[ 1582], 60.00th=[ 1598],
     | 70.00th=[ 1614], 80.00th=[ 1631], 90.00th=[ 1663], 95.00th=[ 1680],
     | 99.00th=[ 1696], 99.50th=[ 1713], 99.90th=[ 8291], 99.95th=[ 9372],
     | 99.99th=[10159]
   bw (  KiB/s): min=144496, max=163680, per=100.00%, avg=156384.00, stdev=6257.99, samples=13
   iops        : min=18062, max=20460, avg=19548.15, stdev=782.43, samples=13
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.84%, 4=0.01%, 10=0.12%, 20=0.03%
  cpu          : usr=33.61%, sys=66.38%, ctx=11, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=131072,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=264MiB/s (277MB/s), 264MiB/s-264MiB/s (277MB/s-277MB/s), io=1024MiB (1074MB), run=3875-3875msec

Run status group 1 (all jobs):
   READ: bw=153MiB/s (160MB/s), 153MiB/s-153MiB/s (160MB/s-160MB/s), io=1024MiB (1074MB), run=6711-6711msec

Example 3: 4096-byte LBA data size with 64 bytes of metadata (64b Guard PI)

Our final example uses the third device from our QEMU invocation above. We format the device using LBAF 7 which has a logical block data size of 4096 bytes and metadata size of 64 bytes (see the nvme id-ns output above). We select Type 1 protection information with the --pi=1 option, place the protection information at the end of the metadata buffer with the --pil=0 option, and select extended LBA mode for the namespace with --ms=1 in the nvme format command below.

root@localhost:~# nvme format /dev/ng2n1 --lbaf=7 --pi=1 --pil=0 --ms=1 --force
Success formatting namespace:1

The LBAF for this namespace uses 64b Guard protection information format (see the nvme nvm-id-ns output for this device above) and thus has metadata size (64 bytes) is greater than protection information size (16 bytes). The below pi-ext-4160.fio configuration file carries out a data integrity test with pi_chk having the GUARD, REFTAG and APPTAG bits set. As with the previous example, all three protection information fields will be checked. The Application Tag is set to 0x0888 and apptag_mask instructs the controller to check all bits of the Application Tag.

This namespace requires a block size to be passed in multiples of extended logical block sizes. For this particular case fio will issue commands reading and writing four extended logical blocks (16640 bytes) at a time. With pi_act set to 0, for each write command, fio computes the 64-bit CRC value and fills in the appropriate protection information fields in the 64-byte metadata buffer following each 4096 segment of LBA data.

pi-ext-4160.fio

[global]
filename=/dev/ng2n1
ioengine=io_uring_cmd
cmd_type=nvme
size=1G
iodepth=32
bs=16640
pi_act=0
pi_chk=GUARD,APPTAG,REFTAG
apptag=0x0888
apptag_mask=0xFFFF
thread=1
stonewall=1
 
[write]
rw=write
 
[read]
rw=read

Expand the collapsed section below for the output from pi-ext-4160.fio. There were no errors reported during the run, indicating that the data passed all of the device's integrity checks.

pi-ext-4160.fio output
root@localhost:~# fio pi-ext-4160.fio
write: (g=0): rw=write, bs=(R) 16.2KiB-16.2KiB, (W) 16.2KiB-16.2KiB, (T) 16.2KiB-16.2KiB, ioengine=io_uring_cmd, iodepth=32
read: (g=1): rw=read, bs=(R) 16.2KiB-16.2KiB, (W) 16.2KiB-16.2KiB, (T) 16.2KiB-16.2KiB, ioengine=io_uring_cmd, iodepth=32
fio-3.35-126-ge2c5f
Starting 2 threads
Jobs: 1 (f=1): [_(1),R(1)][87.5%][r=224MiB/s][r=14.1k IOPS][eta 00m:01s]
write: (groupid=0, jobs=1): err= 0: pid=4522: Wed Sep 20 01:39:46 2023
  write: IOPS=21.9k, BW=348MiB/s (365MB/s)(1024MiB/2944msec); 0 zone resets
    slat (usec): min=32, max=1259, avg=35.32, stdev=20.54
    clat (nsec): min=441, max=28864k, avg=1422643.07, stdev=869099.64
     lat (usec): min=188, max=28897, avg=1457.97, stdev=869.68
    clat percentiles (usec):
     |  1.00th=[  355],  5.00th=[  457], 10.00th=[  545], 20.00th=[  775],
     | 30.00th=[  996], 40.00th=[ 1205], 50.00th=[ 1385], 60.00th=[ 1582],
     | 70.00th=[ 1795], 80.00th=[ 2024], 90.00th=[ 2278], 95.00th=[ 2376],
     | 99.00th=[ 2606], 99.50th=[ 2704], 99.90th=[ 7177], 99.95th=[10814],
     | 99.99th=[28443]
   bw (  KiB/s): min=356770, max=365625, per=100.00%, avg=362566.80, stdev=3665.68, samples=5
   iops        : min=21955, max=22500, avg=22311.80, stdev=225.61, samples=5
  lat (nsec)   : 500=0.01%, 750=0.01%
  lat (usec)   : 50=0.01%, 100=0.02%, 250=0.11%, 500=7.46%, 750=11.14%
  lat (usec)   : 1000=11.52%
  lat (msec)   : 2=48.96%, 4=20.48%, 10=0.24%, 20=0.01%, 50=0.05%
  cpu          : usr=70.64%, sys=9.45%, ctx=1818, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,64528,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
read: (groupid=1, jobs=1): err= 0: pid=4523: Wed Sep 20 01:39:46 2023
  read: IOPS=14.1k, BW=224MiB/s (235MB/s)(1024MiB/4576msec)
    slat (usec): min=9, max=4466, avg=39.02, stdev=24.15
    clat (usec): min=72, max=16822, avg=2196.45, stdev=522.75
     lat (usec): min=107, max=16890, avg=2235.47, stdev=533.10
    clat percentiles (usec):
     |  1.00th=[ 2089],  5.00th=[ 2089], 10.00th=[ 2089], 20.00th=[ 2147],
     | 30.00th=[ 2180], 40.00th=[ 2180], 50.00th=[ 2180], 60.00th=[ 2180],
     | 70.00th=[ 2180], 80.00th=[ 2212], 90.00th=[ 2212], 95.00th=[ 2245],
     | 99.00th=[ 2278], 99.50th=[ 2311], 99.90th=[14877], 99.95th=[15139],
     | 99.99th=[15926]
   bw (  KiB/s): min=201760, max=240565, per=99.95%, avg=229041.78, stdev=10731.73, samples=9
   iops        : min=12416, max=14804, avg=14094.89, stdev=660.42, samples=9
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=99.63%, 10=0.17%, 20=0.16%
  cpu          : usr=44.70%, sys=55.28%, ctx=13, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=64528,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=348MiB/s (365MB/s), 348MiB/s-348MiB/s (365MB/s-365MB/s), io=1024MiB (1074MB), run=2944-2944msec

Run status group 1 (all jobs):
   READ: bw=224MiB/s (235MB/s), 224MiB/s-224MiB/s (235MB/s-235MB/s), io=1024MiB (1074MB), run=4576-4576msec

Conclusion

These three examples obviously do not cover the entire set of possibilities that these new options provide. For a fuller (albeit still not exhaustive) set of examples, fio also provides a Python-based test script that includes many cases beyond the examples here.

Notes

  • Verify is disabled for namespaces formatted in extended LBA mode with protection information enabled because the protection information portion of the data buffer conflicts with the checksum calculated for verification.
  • This is yet another demonstration of flexibility of the io_uring command interface. The Linux kernel only supports a subset of the protection information capabilities in the NVMe specification, but with the io_uring command interface we are able to directly submit NVMe commands to exercise a broader range of protection information capabiilities.