-
Notifications
You must be signed in to change notification settings - Fork 2
Fio end to end data protection part 2 fio support
This blog post is co-authored by Vincent Fu and Ankit Kumar. We are grateful to Klaus Jensen, Adam Manzanares, and Krishna Kanth Reddy for their support and feedback.
This is Part 2 of our series on fio's support for NVMe end to end data protection (E2EDP). Part 1 of this series provided background information on NVMe E2EDP. This second part describes fio's support for E2EDP and the related ecosystem.
For our test environment we rely on QEMU and nvme-cli. QEMU provides emulated NVMe devices and we use nvme-cli to manage them.
QEMU supports E2EDP in its emulated PCIe NVMe devices. We recommend using v8.1 or later as this version fixes a bug in 64b Guard protection information support. With QEMU v8.1 the options below bring up a guest with three different PCIe NVMe devices:
-device "nvme,id=nvme0,serial=deadbeef" \
-drive "id=nvm-0,file=nvme.img,format=raw,if=none,discard=unmap,media=disk" \
-device "nvme-ns,id=nvm-0,drive=nvm-0,bus=nvme0,nsid=1" \
-device "nvme,id=nvme1,serial=deadbeee" \
-drive "id=nvm-1,file=nvme1.img,format=raw,if=none,discard=unmap,media=disk" \
-device "nvme-ns,id=nvm-1,drive=nvm-1,bus=nvme1,nsid=1,pif=2,ms=16,mset=1,pi=1,pil=0,logical_block_size=4096,physical_block_size=4096" \
-device "nvme,id=nvme2,serial=deadeeef" \
-drive "id=nvm-2,file=nvme2.img,format=raw,if=none,discard=unmap,media=disk" \
-device "nvme-ns,id=nvm-2,drive=nvm-2,bus=nvme2,nsid=1,pif=2,ms=64,mset=1,pi=1,pil=0,logical_block_size=4096,physical_block_size=4096" \
The first device supports 16b Guard protection information for all LBA formats. The second device supports 64b Guard protection information for the LBA format with 4096 bytes of data and 16 bytes of metadata per LBA. The third device supports 64b Guard protection information for the LBA format with 4096 bytes of data and 64 bytes of metadata per LBA. For details see QEMU's documentation on NVMe Emulation.
A complete guide to using QEMU as a test platform is beyond the scope of this blog post. For more background on QEMU see the many resources available online including this and this.
How can we manage the NVMe devices in our guest VM? nvme-cli can be used to inspect the capabilities supported by the device and also to format the device to select a supported E2EDP configuration.
The first nvme-cli command to use is the Identify Namespace command, id-ns
.
This provides information about the current namespace format as well as
capabilities supported by the controller. At the very end of the output will
also be a list of the LBA formats supported by the device. Run with the -H
option for human-readable details. For our test platform the capabilities will
be identical for each device.
root@localhost:~# nvme id-ns -H /dev/nvme2n1
NVME Identify Namespace 1:
...
nlbaf : 7
flbas : 0x7
[6:5] : 0 Most significant 2 bits of Current LBA Format Selected
[4:4] : 0 Metadata Transferred in Separate Contiguous Buffer
[3:0] : 0x7 Least significant 4 bits of Current LBA Format Selected
mc : 0x3
[1:1] : 0x1 Metadata Pointer Supported
[0:0] : 0x1 Metadata as Part of Extended Data LBA Supported
dpc : 0x1f
[4:4] : 0x1 Protection Information Transferred as Last 8 Bytes of Metadata Supported
[3:3] : 0x1 Protection Information Transferred as First 8 Bytes of Metadata Supported
[2:2] : 0x1 Protection Information Type 3 Supported
[1:1] : 0x1 Protection Information Type 2 Supported
[0:0] : 0x1 Protection Information Type 1 Supported
dps : 0x3
[3:3] : 0 Protection Information is Transferred as Last 8 Bytes of Metadata
[2:0] : 0x3 Protection Information Type 3 Enabled
...
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format 1 : Metadata Size: 8 bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format 2 : Metadata Size: 16 bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format 3 : Metadata Size: 64 bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format 4 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format 5 : Metadata Size: 8 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format 6 : Metadata Size: 16 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format 7 : Metadata Size: 64 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)
The Formatted LBA Size (FLBAS) field's bit 4 displays whether the namespace was formatted in extended LBA mode or has metadata in a separate buffer. In this case, bit 4 is 0, indicating that metadata is in a separate buffer. The Metadata Capabilities (MC) bits indicate whether the device supports extended LBA mode (bit 0) or having the metadata in a separate buffer (bit 1). In this case both bits are 1, indicating that both modes are supported. The End-to-end Data Protection Capabilities (DPC) field describes support for different locations for the protection information inside the metadata buffer as well as data protection Types 1, 2, and 3. All of the bits are 1 for the emulated NVMe device, indicating that the device supports Type 1 (bit 0), Type 2 (bit 1), and Type 3 (bit 2) data protection as well as locating the protection information at the start of the metadata buffer (bit 3) and at the end of the metadata buffer (bit 4). The End-to-end Data Protection Type Settings (DPS) field describes the location of the protection information and the data protection type for the current namespace format. In this case, bits 0-2 indicate that the namespace is currently formatted with Type 3 data protection and that protection information is at the end of the metadata buffer (bit 3).
Finally, at the bottom of the id-ns
output is a list of the LBA formats
supported by the namespace. Each QEMU emulated NVMe device supports eight
different LBA formats with data sizes of 512 and 4096 bytes and metadata sizes
of 0, 8, 16, and 64 bytes.
Differences among the devices will be apparent with the nvme-cli NVMe Identify
Namespace NVM Command Set command, nvm-id-ns
. Run this with the -v
option for
verbose output:
root@localhost:~# nvme nvm-id-ns /dev/nvme0n1 -v
NVMe NVM Identify Namespace 1:
lbstm : 0
pic : 0
[2:2] : 0 Storage Tag Check Read Support
[1:1] : 0 16b Guard Protection Information Storage Tag Mask
[0:0] : 0 16b Guard Protection Information Storage Tag Support
Extended LBA Format 0 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 1 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 2 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 3 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 4 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 5 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 6 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 7 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0 (in use)
root@localhost:~# nvme nvm-id-ns /dev/nvme1n1 -v
NVMe NVM Identify Namespace 1:
lbstm : 0
pic : 0
[2:2] : 0 Storage Tag Check Read Support
[1:1] : 0 16b Guard Protection Information Storage Tag Mask
[0:0] : 0 16b Guard Protection Information Storage Tag Support
Extended LBA Format 0 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 1 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 2 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 3 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 4 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 5 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 6 : Protection Information Format: 64b Guard(2) - Storage Tag Size (MSB): 0 (in use)
Extended LBA Format 7 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
root@localhost:~# nvme nvm-id-ns /dev/nvme2n1 -v
NVMe NVM Identify Namespace 1:
lbstm : 0
pic : 0
[2:2] : 0 Storage Tag Check Read Support
[1:1] : 0 16b Guard Protection Information Storage Tag Mask
[0:0] : 0 16b Guard Protection Information Storage Tag Support
Extended LBA Format 0 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 1 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 2 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 3 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 4 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 5 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 6 : Protection Information Format: 16b Guard(0) - Storage Tag Size (MSB): 0
Extended LBA Format 7 : Protection Information Format: 64b Guard(2) - Storage Tag Size (MSB): 0 (in use)
The output shows that all LBAFs support 16b Guard protection information for the first device. The second device is identical except that LBAF 6 supports 64b Guard PI with a data size of 4096 bytes and metadata size of 16 bytes. The first and third devices are also identical except that for the third device LBAF 7 supports 64b Guard PI with a data size of 4096 bytes and metadata size of 64 bytes.
We can use nvme-cli's format command to select the namespace's on-disk format. The relevant options are described in the help text available for nvme format:
root@localhost:~# nvme format --help
Usage: nvme format <device> [OPTIONS]
Re-format a specified namespace on the
given device. Can erase all data in namespace (user
data erase) or delete data encryption key if specified.
Can also be used to change LBAF to change the namespaces reported physical
block format.
Options:
[ --namespace-id=<NUM>, -n <NUM> ] --- identifier of desired namespace
[ --timeout=<NUM>, -t <NUM> ] --- timeout value, in milliseconds
[ --lbaf=<NUM>, -l <NUM> ] --- LBA format to apply (required)
[ --ses=<NUM>, -s <NUM> ] --- [0-2]: secure erase
[ --pi=<NUM>, -i <NUM> ] --- [0-3]: protection info off/Type
1/Type 2/Type 3
[ --pil=<NUM>, -p <NUM> ] --- [0-1]: protection info location
last/first 8 bytes of metadata
[ --ms=<NUM>, -m <NUM> ] --- [0-1]: extended format off/on
[ --reset, -r ] --- Automatically reset the
controller after successful
format
[ --force ] --- The "I know what I'm doing" flag,
skip confirmation before sending
command
[ --block-size=<IONUM>, -b <IONUM> ] --- target block size
The first option to consider is --lbaf
. The LBA format chosen (based on the
output of nvme id-ns
and nvme nvm-id-ns
) determines the LBA data size (512 or
4096 bytes for our test platform), metadata buffer size (8, 16, or 64 bytes),
and Guard protection information format (16b, 32b, or 64b).
Then, select the remaining on-device E2EDP format parameters:
- E2EDP Type
-
--pi=1
selects Type 1 data protection -
--pi=2
selects Type 2 data protection -
--pi=3
selects Type 3 data protection
-
- Protection Information Location
-
--pil=0
places protection information at the end of the metadata buffer -
--pil=1
places protection information at the beginning of the metadata buffer.
-
- Extended LBA vs Separate Metadata Buffer
-
--ms=0
formats the device so that LBA data and metadata are stored in separate buffers -
--ms=1
formats the device in extended LBA mode (metadata is contiguous with the LBA data)
-
With all of the preliminaries out of the way, let us now detail how fio supports E2EDP.
NVMe E2EDP support was added to fio via the io_uring_cmd ioengine. This ioengine provides a means to submit commands directly to NVMe devices. The new E2EDP options were developed as engine-specific options and are listed below:
Option | Description |
---|---|
md_per_io_size (int)
|
Size in bytes for separate metadata buffer per IO. Default: 0. This option must be set if the namespace is formatted with a separate metadata buffer, and 1. protection information is disabled, or 2. protection information is enabled, and the job in question will use pi_act=0 , or3. protection information is enabled, the job in question will use pi_act=1 , and the formatted namespace metadata size is greater than protection information size.
|
pi_act (int) |
Action to take when NVMe namespace is formatted with protection information. Default: 1. If this is set to 1 and the namespace is formatted with metadata size equal to protection information size, fio will not use a separate metadata buffer or extended logical block. If this is set to 1 and the namespace is formatted with metadata size greater than protection information size, fio will not generate or verify the protection information portion of metadata for write or read case respectively. If this is set to 0, fio generates protection information for the write case and verifies for the read case. |
pi_chk (str) |
Controls protection information checking. This can take one or more of the values below. Default: none. 1. GUARD - Enables protection information checking of Guard field.2. REFTAG - Enables protection information checking of Logical Block Reference Tag field.3. APPTAG - Enables protection information checking of Application Tag field. |
apptag (int) |
Specifies Logical Block Application Tag value. Default: 0x1234. |
apptag_mask (int) |
Specifies Logical Block Application Tag Mask value. Default: 0xffff. |
The new md_per_io_size
option directs fio to allocate an extra metadata buffer
for each IO. Note that this buffer's size differs from the device's LBA format
metadata buffer size in that this buffer is sized per IO rather than per LBA.
In other words, for an LBA format with 4096 bytes of data and 16 bytes of
metadata, if fio issues 16KiB read and write requests, md_per_io_size
will need
to be set to 64 bytes.
The new pi_act
option sets the NVMe command's PRACT bit. Setting this to 1
directs the controller to be responsible for protection information generation.
Setting this to 0 directs fio to be responsible for protection information
generation. For more details see Part
1
of this series.
The new pi_chk
option sets the NVMe command's PRCHK bits. The option accepts
a string value. If the string contains GUARD, REFTAG, APPTAG, or any
combination thereof, then respectively the Guard, Reference Tag, and
Application Tag bits of the PRCHK field will be set.
The final two options are apptag
and apptag_mask
. These are only relevant
if the pi_chk
option sets the Application Tag bit. If this bit is set then the
specified Application Tag and mask are included in the NVMe command.
These options names are consistent with those used for the SPDK external
ioengine.
The only difference is that md_per_io_size
has a default of 0 for fio's
io_uring_cmd
ioengine and a default of 4096 in the SPDK external ioengine.
For best performance ensure that the ISA-L library is installed and detected when building fio. Fio's source code includes CRC calculation functions, but ISA-L has optimized versions of these routines.
Fio's E2EDP support does not include the full set of E2EDP options outlined in the NVMe specification. The table below lists the support status for combinations of different parameters.
LBA Data Size (bytes) | Metadata Size (bytes) | Metadata at End of LBA Data (Extended LBA) or Separate Metadata Buffer | PI Size (bytes) | Guard Format | PI location (Start/End) | Fio Support |
---|---|---|---|---|---|---|
512 | 8 | Both | 8 | 16 bit | NA | Yes |
512 | 16 | Both | 8 | 16 bit | Both | Yes |
512 | 64 | Both | 8 | 16 bit | Both | Yes |
4096 | 8 | Both | 8 | 16 bit | NA | Yes |
4096 | 16 | Both | 8 | 16 bit | Both | Yes |
4096 | 64 | Both | 8 | 16 bit | Both | Yes |
4096 | 16 | Both | 16 | 32 bit | NA | No |
4096 | 16 | Both | 16 | 64 bit | NA | Yes (Without Storage tags) |
4096 | 64 | Both | 16 | 32 bit | Both | No |
4096 | 64 | Both | 16 | 64 bit | Both | Yes (Without Storage tags) |
Fio does support 16b and 64b Guard protection information formats, but the most notable omissions are 32b Guard Protection information format and Storage Tags. No E2EDP configurations with 32b Guard protection information formats are supported. Furthermore, formats with Storage Tags are also not supported. Fio only supports LBA formats with 16b and 64b Guard protection information formats if those LBA formats have no Storage Tags.
Now let us go through three examples covering the new fio options. With the
io_uring_cmd ioengine, we must use the cmd_type=nvme
ioengine option to specify
that we wish to issue NVMe commands. With this configuration, the filename
option also needs to specify the NVMe character device. All of this requires
version 5.19 or later of the Linux kernel.
Our first example is for a namespace formatted with a 512-byte LBA data size
with 8 bytes of metadata in a separate buffer. We use the first device from our
QEMU invocation above. In the command below we format the device selecting LBA
Format 1 since that corresponds to our desired combination of data and metadata
size (see the nvme id-ns
output from above). We select Type 2 protection
information with --pi=2
in our nvme format
command. And finally we specify
that metadata resides in a separate buffer with the --ms=0
option.
root@localhost:~# nvme format /dev/ng0n1 --lbaf=1 --pi=2 --ms=0 --force
Success formatting namespace:1
We formatted this device with metadata size the same as protection information
size since LBAF 1 uses 16b Guard PI with a PI size of 8 bytes. The
pi-sb-512.fio
configuration file below carries out data integrity checks with
pi_chk
set to include GUARD and REFTAG bits. Application tag checking is not
enabled in this example. As pi_act
is set to 1 and metadata size is the same
as protection information size, fio does not need to send a separate metadata
buffer, so there is no need to pass md_per_io_size
. With this configuration
all of the protection information checking happens behind the scenes. The fio
job first carries out a sequential write of the entire device and then
sequentially reads back the just-written data.
[global]
filename=/dev/ng0n1
ioengine=io_uring_cmd
cmd_type=nvme
size=1G
iodepth=1
bs=512
pi_act=1
pi_chk=GUARD,REFTAG
thread=1
stonewall=1
[write]
rw=write
[read]
rw=read
Expand the collapsed section below to see the output from running
pi-sb-512.fio
. No errors were reported, indicating that all data integrity
checks passed.
pi-sb-512.fio
output
root@localhost:~# fio pi-sb-512.fio
write: (g=0): rw=write, bs=(R) 512B-512B, (W) 512B-512B, (T) 512B-512B, ioengine=io_uring_cmd, iodepth=1
read: (g=1): rw=read, bs=(R) 512B-512B, (W) 512B-512B, (T) 512B-512B, ioengine=io_uring_cmd, iodepth=1
fio-3.35-126-ge2c5f
Starting 2 threads
Jobs: 1 (f=1): [_(1),R(1)][100.0%][r=9.88MiB/s][r=20.2k IOPS][eta 00m:00s]
write: (groupid=0, jobs=1): err= 0: pid=4474: Wed Sep 20 01:20:43 2023
write: IOPS=19.1k, BW=9543KiB/s (9772kB/s)(1024MiB/109876msec); 0 zone resets
slat (usec): min=7, max=305, avg= 8.85, stdev= 1.13
clat (nsec): min=415, max=66444k, avg=42774.55, stdev=70168.33
lat (usec): min=40, max=66453, avg=51.62, stdev=70.23
clat percentiles (usec):
| 1.00th=[ 36], 5.00th=[ 37], 10.00th=[ 37], 20.00th=[ 38],
| 30.00th=[ 39], 40.00th=[ 40], 50.00th=[ 42], 60.00th=[ 43],
| 70.00th=[ 44], 80.00th=[ 46], 90.00th=[ 49], 95.00th=[ 57],
| 99.00th=[ 76], 99.50th=[ 79], 99.90th=[ 90], 99.95th=[ 100],
| 99.99th=[ 161]
bw ( KiB/s): min= 6988, max=10922, per=99.99%, avg=9542.78, stdev=629.86, samples=219
iops : min=13976, max=21844, avg=19085.63, stdev=1259.71, samples=219
lat (nsec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 20=0.01%, 50=91.74%, 100=8.21%
lat (usec) : 250=0.04%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 50=0.01%, 100=0.01%
cpu : usr=6.71%, sys=22.88%, ctx=2097233, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2097152,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
read: (groupid=1, jobs=1): err= 0: pid=4479: Wed Sep 20 01:20:43 2023
read: IOPS=18.6k, BW=9290KiB/s (9512kB/s)(1024MiB/112877msec)
slat (usec): min=7, max=847, avg= 9.19, stdev= 5.25
clat (nsec): min=422, max=1061.9k, avg=43866.26, stdev=40023.71
lat (usec): min=39, max=1131, avg=53.05, stdev=45.02
clat percentiles (usec):
| 1.00th=[ 35], 5.00th=[ 36], 10.00th=[ 36], 20.00th=[ 37],
| 30.00th=[ 38], 40.00th=[ 38], 50.00th=[ 39], 60.00th=[ 40],
| 70.00th=[ 43], 80.00th=[ 44], 90.00th=[ 46], 95.00th=[ 51],
| 99.00th=[ 80], 99.50th=[ 469], 99.90th=[ 553], 99.95th=[ 562],
| 99.99th=[ 644]
bw ( KiB/s): min= 845, max=11050, per=99.98%, avg=9288.39, stdev=2542.99, samples=225
iops : min= 1690, max=22100, avg=18576.77, stdev=5085.97, samples=225
lat (nsec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 20=0.01%, 50=94.71%, 100=4.47%
lat (usec) : 250=0.05%, 500=0.41%, 750=0.35%, 1000=0.01%
lat (msec) : 2=0.01%
cpu : usr=6.65%, sys=23.36%, ctx=2097220, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=2097152,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=9543KiB/s (9772kB/s), 9543KiB/s-9543KiB/s (9772kB/s-9772kB/s), io=1024MiB (1074MB), run=109876-109876msec
Run status group 1 (all jobs):
READ: bw=9290KiB/s (9512kB/s), 9290KiB/s-9290KiB/s (9512kB/s-9512kB/s), io=1024MiB (1074MB), run=112877-112877msec
Our second example uses the second device from our QEMU invocation above. We
format the device using LBAF 6 which has a logical block data size of 4096
bytes and metadata size of 16 bytes (see the nvme id-ns
output above). We
select Type 1 protection information with the --pi=1
option and designate a
separate buffer for metadata with --ms=0
in the nvme format
command below.
root@localhost:~# nvme format /dev/nvme1n1 --lbaf=6 --pi=1 --ms=0 --force
Success formatting namespace:1
The selected LBAF uses 64b Guard protection information (see the nvme nvm-id-ns
output for this device above) and thus has metadata size same as the protection
information size. The below pi-sb-4096.fio
configuration file carries out a
data integrity test with pi_chk
having the GUARD, REFTAG and APPTAG bits set
to 1. All three protection information fields will be checked. The Application
Tag is set to 0x0888 and apptag_mask
instructs the controller to check all bits
of the Application Tag. As pi_act
is set to 0, fio computes the 64-bit CRC
value and fills in the appropriate protection information fields in the
separate metadata buffer for write commands. For the block size of 8192 bytes
(equivalent to 2 logical blocks), md_per_io_size
must be set to 32 bytes (or
more). As in the first example, this example is also a sequential write of the
entire device followed by a full sequential read.
[global]
filename=/dev/ng1n1
ioengine=io_uring_cmd
cmd_type=nvme
size=1G
iodepth=32
bs=8192
pi_act=0
md_per_io_size=32
pi_chk=GUARD,APPTAG,REFTAG
apptag=0x0888
apptag_mask=0xFFFF
thread=1
stonewall=1
[write]
rw=write
[read]
rw=read
Expand the collapsed section below to see the output from pi-sb-4096.fio
. No
data integrity errors occurred during the run.
pi-sb-4096.fio
output
root@localhost:~# fio pi-sb-4096.fio
write: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring_cmd, iodepth=32
read: (g=1): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring_cmd, iodepth=32
fio-3.35-126-ge2c5f
Starting 2 threads
Jobs: 1 (f=1): [_(1),R(1)][100.0%][r=146MiB/s][r=18.7k IOPS][eta 00m:00s]
write: (groupid=0, jobs=1): err= 0: pid=4511: Wed Sep 20 01:36:02 2023
write: IOPS=33.8k, BW=264MiB/s (277MB/s)(1024MiB/3875msec); 0 zone resets
slat (usec): min=16, max=1121, avg=23.03, stdev=22.80
clat (nsec): min=581, max=6162.4k, avg=921957.70, stdev=351174.47
lat (usec): min=126, max=6256, avg=944.98, stdev=352.07
clat percentiles (usec):
| 1.00th=[ 289], 5.00th=[ 363], 10.00th=[ 441], 20.00th=[ 586],
| 30.00th=[ 709], 40.00th=[ 824], 50.00th=[ 922], 60.00th=[ 1029],
| 70.00th=[ 1139], 80.00th=[ 1237], 90.00th=[ 1352], 95.00th=[ 1467],
| 99.00th=[ 1713], 99.50th=[ 1795], 99.90th=[ 1975], 99.95th=[ 2114],
| 99.99th=[ 4817]
bw ( KiB/s): min=255104, max=278800, per=99.92%, avg=270392.71, stdev=7805.51, samples=7
iops : min=31888, max=34850, avg=33799.00, stdev=975.69, samples=7
lat (nsec) : 750=0.01%
lat (usec) : 50=0.02%, 100=0.03%, 250=0.16%, 500=13.32%, 750=20.08%
lat (usec) : 1000=23.67%
lat (msec) : 2=42.63%, 4=0.07%, 10=0.02%
cpu : usr=61.15%, sys=20.81%, ctx=2737, majf=0, minf=0
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,131072,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
read: (groupid=1, jobs=1): err= 0: pid=4512: Wed Sep 20 01:36:02 2023
read: IOPS=19.5k, BW=153MiB/s (160MB/s)(1024MiB/6711msec)
slat (usec): min=8, max=2833, avg=34.42, stdev=13.37
clat (usec): min=46, max=12183, avg=1586.98, stdev=277.91
lat (usec): min=84, max=12194, avg=1621.40, stdev=283.86
clat percentiles (usec):
| 1.00th=[ 1500], 5.00th=[ 1500], 10.00th=[ 1516], 20.00th=[ 1516],
| 30.00th=[ 1532], 40.00th=[ 1532], 50.00th=[ 1582], 60.00th=[ 1598],
| 70.00th=[ 1614], 80.00th=[ 1631], 90.00th=[ 1663], 95.00th=[ 1680],
| 99.00th=[ 1696], 99.50th=[ 1713], 99.90th=[ 8291], 99.95th=[ 9372],
| 99.99th=[10159]
bw ( KiB/s): min=144496, max=163680, per=100.00%, avg=156384.00, stdev=6257.99, samples=13
iops : min=18062, max=20460, avg=19548.15, stdev=782.43, samples=13
lat (usec) : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=99.84%, 4=0.01%, 10=0.12%, 20=0.03%
cpu : usr=33.61%, sys=66.38%, ctx=11, majf=0, minf=0
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=131072,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
WRITE: bw=264MiB/s (277MB/s), 264MiB/s-264MiB/s (277MB/s-277MB/s), io=1024MiB (1074MB), run=3875-3875msec
Run status group 1 (all jobs):
READ: bw=153MiB/s (160MB/s), 153MiB/s-153MiB/s (160MB/s-160MB/s), io=1024MiB (1074MB), run=6711-6711msec
Our final example uses the third device from our QEMU invocation
above. We format the device using LBAF 7 which has a logical block data size of
4096 bytes and metadata size of 64 bytes (see the nvme id-ns
output above). We
select Type 1 protection information with the --pi=1
option, place the
protection information at the end of the metadata buffer with the --pil=0
option, and select extended LBA mode for the namespace with --ms=1
in the nvme format
command below.
root@localhost:~# nvme format /dev/ng2n1 --lbaf=7 --pi=1 --pil=0 --ms=1 --force
Success formatting namespace:1
The LBAF for this namespace uses 64b Guard protection information format (see
the nvme nvm-id-ns
output for this device above) and thus has metadata size (64
bytes) is greater than protection information size (16 bytes). The below
pi-ext-4160.fio
configuration file carries out a data integrity test with
pi_chk
having the GUARD, REFTAG and APPTAG bits set. As with the previous
example, all three protection information fields will be checked. The
Application Tag is set to 0x0888 and apptag_mask
instructs the controller to
check all bits of the Application Tag.
This namespace requires a block size to be passed in multiples of extended
logical block sizes. For this particular case fio will issue commands reading
and writing four extended logical blocks (16640 bytes) at a time. With pi_act
set to 0, for each write command, fio computes the 64-bit CRC value and fills
in the appropriate protection information fields in the 64-byte metadata buffer
following each 4096 segment of LBA data.
[global]
filename=/dev/ng2n1
ioengine=io_uring_cmd
cmd_type=nvme
size=1G
iodepth=32
bs=16640
pi_act=0
pi_chk=GUARD,APPTAG,REFTAG
apptag=0x0888
apptag_mask=0xFFFF
thread=1
stonewall=1
[write]
rw=write
[read]
rw=read
Expand the collapsed section below for the output from pi-ext-4160.fio
. There
were no errors reported during the run, indicating that the data passed all of
the device's integrity checks.
pi-ext-4160.fio
output
root@localhost:~# fio pi-ext-4160.fio
write: (g=0): rw=write, bs=(R) 16.2KiB-16.2KiB, (W) 16.2KiB-16.2KiB, (T) 16.2KiB-16.2KiB, ioengine=io_uring_cmd, iodepth=32
read: (g=1): rw=read, bs=(R) 16.2KiB-16.2KiB, (W) 16.2KiB-16.2KiB, (T) 16.2KiB-16.2KiB, ioengine=io_uring_cmd, iodepth=32
fio-3.35-126-ge2c5f
Starting 2 threads
Jobs: 1 (f=1): [_(1),R(1)][87.5%][r=224MiB/s][r=14.1k IOPS][eta 00m:01s]
write: (groupid=0, jobs=1): err= 0: pid=4522: Wed Sep 20 01:39:46 2023
write: IOPS=21.9k, BW=348MiB/s (365MB/s)(1024MiB/2944msec); 0 zone resets
slat (usec): min=32, max=1259, avg=35.32, stdev=20.54
clat (nsec): min=441, max=28864k, avg=1422643.07, stdev=869099.64
lat (usec): min=188, max=28897, avg=1457.97, stdev=869.68
clat percentiles (usec):
| 1.00th=[ 355], 5.00th=[ 457], 10.00th=[ 545], 20.00th=[ 775],
| 30.00th=[ 996], 40.00th=[ 1205], 50.00th=[ 1385], 60.00th=[ 1582],
| 70.00th=[ 1795], 80.00th=[ 2024], 90.00th=[ 2278], 95.00th=[ 2376],
| 99.00th=[ 2606], 99.50th=[ 2704], 99.90th=[ 7177], 99.95th=[10814],
| 99.99th=[28443]
bw ( KiB/s): min=356770, max=365625, per=100.00%, avg=362566.80, stdev=3665.68, samples=5
iops : min=21955, max=22500, avg=22311.80, stdev=225.61, samples=5
lat (nsec) : 500=0.01%, 750=0.01%
lat (usec) : 50=0.01%, 100=0.02%, 250=0.11%, 500=7.46%, 750=11.14%
lat (usec) : 1000=11.52%
lat (msec) : 2=48.96%, 4=20.48%, 10=0.24%, 20=0.01%, 50=0.05%
cpu : usr=70.64%, sys=9.45%, ctx=1818, majf=0, minf=0
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,64528,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
read: (groupid=1, jobs=1): err= 0: pid=4523: Wed Sep 20 01:39:46 2023
read: IOPS=14.1k, BW=224MiB/s (235MB/s)(1024MiB/4576msec)
slat (usec): min=9, max=4466, avg=39.02, stdev=24.15
clat (usec): min=72, max=16822, avg=2196.45, stdev=522.75
lat (usec): min=107, max=16890, avg=2235.47, stdev=533.10
clat percentiles (usec):
| 1.00th=[ 2089], 5.00th=[ 2089], 10.00th=[ 2089], 20.00th=[ 2147],
| 30.00th=[ 2180], 40.00th=[ 2180], 50.00th=[ 2180], 60.00th=[ 2180],
| 70.00th=[ 2180], 80.00th=[ 2212], 90.00th=[ 2212], 95.00th=[ 2245],
| 99.00th=[ 2278], 99.50th=[ 2311], 99.90th=[14877], 99.95th=[15139],
| 99.99th=[15926]
bw ( KiB/s): min=201760, max=240565, per=99.95%, avg=229041.78, stdev=10731.73, samples=9
iops : min=12416, max=14804, avg=14094.89, stdev=660.42, samples=9
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.02%, 4=99.63%, 10=0.17%, 20=0.16%
cpu : usr=44.70%, sys=55.28%, ctx=13, majf=0, minf=0
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=64528,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
WRITE: bw=348MiB/s (365MB/s), 348MiB/s-348MiB/s (365MB/s-365MB/s), io=1024MiB (1074MB), run=2944-2944msec
Run status group 1 (all jobs):
READ: bw=224MiB/s (235MB/s), 224MiB/s-224MiB/s (235MB/s-235MB/s), io=1024MiB (1074MB), run=4576-4576msec
These three examples obviously do not cover the entire set of possibilities that these new options provide. For a fuller (albeit still not exhaustive) set of examples, fio also provides a Python-based test script that includes many cases beyond the examples here.
- Verify is disabled for namespaces formatted in extended LBA mode with protection information enabled because the protection information portion of the data buffer conflicts with the checksum calculated for verification.
- This is yet another demonstration of flexibility of the io_uring command interface. The Linux kernel only supports a subset of the protection information capabilities in the NVMe specification, but with the io_uring command interface we are able to directly submit NVMe commands to exercise a broader range of protection information capabiilities.