-
Notifications
You must be signed in to change notification settings - Fork 2
fio waf
This is a guest post by Joel Granados. Many thanks to the people who help with their reviews : Vincent Fu, Adam Manzanares and Javier Gonzalez.
Write amplification factor (WAF) is the difference in the number of bytes written at two levels in an application to device storage stack for a given workload. It is a value that can give an indication of how a device will age with time. In this article we explore what WAF is, what are the different ways of calculating it and how we can take advantage of fio to help make those calculations easy. We also provide a recipe that can hopefully be used as a building block in a more general storage telemetry system.
WAF is a unitless ratio between the number of bytes written at two levels of the storage hierarchy. In the absence of things like compression and encryption, the best that we can hope for is that the amount of bytes written to store a set of data in one level of the storage stack is the same as the amount of bytes written to represent it further down. This, of course is not always the case, as we have metadata, structures to handle the metadata, idiosyncrasies of the physical medium holding the data, etc...
We calculate WAF by dividing the lower layer write count (LLWC) by the upper layer write count (ULWC). The ideal value for WAF is one, which effectively means that there are no additional bytes needed in the lower storage layers to represent those bytes coming from the upper layers. If the ULWC is 1MiB and the LLWC is 1.5MiB because of garbage collection, then the WAF would be 1.5. There is no unique value representing a particular storage stack configuration, instead there are several values depending on where LLWC and ULWC are sampled.
Let us use a database application to shed light on the specifics of sampling LLWC and ULWC and how this can result in several WAF values. RocksDB is a log structured database that stores key value pairs. Depending on the workload, RocksDB can write one order of magnitude more to the lower storage layers increasing the WAF to as high as 10. Let's assume that our ULWC is the key/values defined by the user to be stored in the database. These are measured by counting bytes in the "Put" call from the user to the RocksDB API. With this as our starting point we will give three examples of where to place the LLWC.
-
We can measure the LLWC in the interface between user and kernel space. Metadata storage, write ahead log and internal RocksDB processes like compaction influence this value. To measure these we count writes before RocksDB executes a system call in functions such as "PositionedAppend" and "Append".
-
If we measure the LLWC further down in the interface between the operating system and the device, we would get a different value that includes filesystem metadata and journaling on top of the before mentioned influential factors.
-
Finally, if we assume that our main storage device is a Solid State Drive (SSD) and move LLWC further down in the storage hierarchy between the flash translation layer (FTL) and the physical NAND dies, we get a different value that adds garbage collection, over-provisioning and wear leveling (among others) to the already mentioned factors.
In the previous examples we adjusted the LLWC down the stack but we could have easily adjusted the ULWC as well. If we place it after the operating systems (like in 2.) and LLWC at the die level (like in 3.), we measure Device WAF. This represents the additional writes needed by a device to execute any write operation.
It is worth mentioning that one of the ways to access internal values within an NVMe Specification compliant SSD is something called SMART Health information log page which contains a measure of the writes that reach the controller. Unfortunately it is a value that is defined in units of 500 KiB, which is far too coarse to be used in WAF calculations. Luckily the Open Compute Project Datacenter NVMe SSD Specification defines an extended SMART log that has a value for measuring WAF called "Physical Media Units Written". This extended OCP SMART log data is not supported by all vendors. Some vendors may report physical media writes via other means but details depend on the device and vendor. That could be the third and last option to get the writes at the device level.
RocksDB is great but it does not really give you a lot of control over your workload and that might not be what you want when you are benchmarking a particular storage stack. This is where fio comes into the picture. In this section I describe a recipe that you can follow to arrive to a WAF value as well as identify what particular fio features can be used for WAF calculation.
I'm assuming that you already have an idea of what you want to measure. Fio has so many options and it is a fair assumption that whatever you are thinking of measuring, fio already has some way of expressing it. You will have to decide on the IO patterns (i.e. random -vs- sequential), block size (512B -vs- 4096B), how much data is being generated (i.e. 50% -vs- 90% of device), what path that data is taking to get to the device (i.e. io_uring -vs- libaio -vs- SPDK), I/O depth if it is asynchronous, what file (device) or file system to target, and how many threads you want to use at the same time. Our recipe section provides a simple starting point.
You do need to ensure the following to get a accurate measure:
-
The device being tested needs to be exclusively used. No other process can be accessing it during the test.
-
Since WAF is about write amplification you must use one of the write workloads that fio offers.
Fio is a user space application that generates I/O in lots of different ways but is not always directly involved in gathering WAF data. For Device WAF for example, fio generates writes but is not involved in the actual calculation. We will not discuss measuring Device WAF as it happens orthogonally.
Fio does not generate writes on higher levels of the stack. It cannot create key/value pairs for a RocksDB database, for example (though you can probably create an io-engine for it). This restricts how high in the stack we can go with our ULWC. For the examples contained in this post we will not need to go up too high.
Fio has a very compelling set of knobs that adjust its behavior as mentioned in the "Before Starting" section. This post is not meant to be exhaustive; it just describes how to calculate WAF but not how to use fio in general.
We generate and record writes with fio by using "json" output format to log and timestamp the number of bytes written in an interval. This is effectively the ULWC. Since fio cannot actually give us the writes that arrived to the device, we will use a third party python script that will do that job for us and call it by using fio's exec io-engine. This will effectively be the LLWC.
Looking at the fio configuration file (fio.cfg
) we see that the job runs for
the number of seconds defined by the FIO_TOTAL_TIME
environment variable and
creates up to 1000 files of size 500MiB on top of whatever filesystem is
mounted on /mnt
. The files are randomly written with direct I/O. A second job
is spawned through the exec io-engine that runs the measure_dev
script which
is the one that collects the LLWC data.
fio.cfg
[global]
# Total run time
runtime=${FIO_TOTAL_TIME}
# So its stops after runtime
time_based=1
# One line per report
# The line for the WAF_monitor does not really measure anything
group_reporting
# Section that measures WAF
[WAF_monitor]
# Engine to execute third party apps
ioengine=exec
# Path to the executable
program=./measure_dev
# Arguments to the executable
arguments=/dev/nvme0n1 ${FIO_INTERVAL}
# The actual fio Benchmark
[WAF_filesystem]
# We are testing whatever is mounted on mnt
directory=/mnt
# We want to see perf without caching
direct=1
# fio writes in random order
rw=randwrite
thread=1
# fio will write files of 500 MiB
filesize=500M
# fio will write up to 1000. It will stop if the
# Time expires.
nrfiles=1000
Next is the python script doing the LLWC calculations. It extracts the physical
writes for the start and end of every interval by using the ocp
argument in
nvme-cli which implements the
extended health log defined in the Datacenter NVMe SSD
Specification.
The command requires specifying the device file and returns lots of information
related to the device "health". The line containing the physical writes is
named "Physical media units written". The script outputs the last measured
writes as soon as it receives the kill signal from fio so as not to lose any
writes happening between the last sample and the end of the experiment. We run
the fio job using the
--status-interval
option which provides cumulative fio output periodically. In order to calculate
WAF values, the script displays elapsed time and cumulative write counts at
approximately the same frequency as the fio output.
measure_dev script
#!/bin/python3
import subprocess
import sys
import time
from datetime import datetime
from datetime import timedelta
import signal
class WAFInterval:
def __init__(self):
self.phy_writes_init = None
self.time_stamp = None
def get_physical_media_units_writes(self, dev):
value = 0
cmd = "nvme ocp smart-add-log {0}".format(dev)
out = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
pattern = "Physical media units written"
for line in out.stdout.readlines():
if pattern in str(line):
value = int(line.decode("utf-8").split()[-1])
return value
def start_interval(self):
self.phy_writes_init = self.get_physical_media_units_writes(dev)
self.time_stamp = datetime.now()
def lap_interval(self, last=False):
now = datetime.now()
if last: time.sleep(1)
phy_writes_now = self.get_physical_media_units_writes(dev)
phy_writes = phy_writes_now - self.phy_writes_init
return "{0} {1} {2} {3}".format(
(now - self.time_stamp)/timedelta(milliseconds=1),
self.phy_writes_init, phy_writes_now, phy_writes)
def colnames(self):
return "TIME(ms) P_WRITE_PREV P_WRITE_NOW P_WRITTEN(ACCUM)"
def signal_handler(signal, frame):
global waf_int
row = waf_int.lap_interval(last=True)
print(row)
sys.exit(1)
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("arguments should be: DEVICE INTERVAL(seconds)")
sys.exit(1)
dev = sys.argv[1]
interval = int(sys.argv[2])
waf_int = WAFInterval()
print(waf_int.colnames())
waf_int.start_interval()
while True:
try:
time.sleep(interval)
row = waf_int.lap_interval()
except Exception as e:
print("Error : %s" % e)
sys.exit(1)
print(row)
sys.exit(0)
Finally we have a run.sh
script that puts everything together. It tests an EXT4
mount on a predefined device. Note that this script requires awk
, tail
and jq
installed on your machine. The script will output the timestamped ULWC and LLWC
values at the end of execution and it also generates two files (fio.json
and
WAF_monitor.stdout
) that contain the raw output of the experiment.
run.sh script
DEV="/dev/nvme0n1"
MOUNT="/mnt"
umount ${DEV} 2>1
mkfs.ext4 ${DEV} -F -q 2>1
mount ${DEV} ${MOUNT} 2>1
./fio --status-interval ${FIO_INTERVAL} --output-format json fio.cfg > fio.json
umount ${DEV} 2>1
echo "ULWC:"
tail --lines=+3 fio.json | jq -r '[.jobs[0].write.runtime,.jobs[0].write.io_bytes] | @csv'
echo "-------------"
echo "LLWC:"
tail --lines=+2 WAF_monitor.stdout | awk '{print $1","$4}'
Let's give it a go with short time intervals so we can quickly see what it
does. Let's say that the total time and measuring interval are 10 seconds
(FIO_TOTAL_TIME=10 FIO_INTERVAL=10
). The example below generates random
writes to /dev/nvme0n1
within a QEMU VM on the following platform:
Item | Description |
---|---|
fio | 3.32 |
QEMU | Patched Qemu. Patch on master. Base commit |
Host Kernel | 5.19.0 |
Host Distro | Debian 11.4 |
Guest Kernel | 6.1.0-rc3+ |
Guest Distro | Arch Linux |
Remember to adjust the files to your local configuration. You might have to
change "program", "arguments" in the fio.cfg
file and $DEV
and FIO_PATH
in the run.sh
file. Let's quickly go through the generated output and how to
understand it. First thing to notice is that we have two values in ULWC and
only one in LLWC. This is because fio printed output just before it stopped
execution; we can safely ignore the first line because the values in fio are
accumulated and therefore the last line has the writes that are in the first.
The second thing to notice is that the times in ULWC and LLWC do not match;
they have a difference of 74 (approx.) milliseconds. This comes from small
timing differences in the two processes and can mostly be ignored for the
purpose of this post.
archlinux# FIO_TOTAL_TIME=10 FIO_INTERVAL=10 ./run.sh
ULWC:
9884,546570240
10001,556687360
-------------
LLWC:
9926.939,557273088
The ULWC values come from the fio.json
output where we take the timestamp
from jobs[0].write.runtime
and the actual bytes written out from
jobs[0].write.io_bytes
. The LLWC values come from the python script and are
directly taken from python's datetime library and the output of the nvme-cli
command. We already see that we have all the information necessary to calculate
WAF in this particular case. We simply do a (LLWC)557273088 / (ULWC)556687360
= (WAF)1.001052167. Here we don't see a very pronounced ratio but our test was
very short and did not push the limits of the filesystem.
Now let's see how we can get a more fine grained measurement. This recipe allows
for such an execution by just changing the values of the environment variables.
Let's leave the total runtime at 10 seconds but have 5 second intervals to see
how WAF changes every 5 seconds. When we run the modified scripts we see that
an additional line has appeared in the output. As we saw previously fio is
outputting one more line but we can simply ignore the second to the last one.
Also notice that the written bytes are accumulated so to calculate the WAF for
a certain interval we need to subtract from the previous interval to get the
bytes written for one interval. Other than that, the calculation is exactly the
same as before. In this particular case we will get two WAF values :
1.035346438, 0.971488105. So what happened there with the last value? Did it
really go under 1? In this case the last interval is always the one that has
the most error as it is where the time differences is most visible as the two
processes (fio
and measure_dev
) stop at different times. The difference
also gets exacerbated by the size of our test: because our time periods are so
short any timing discrepancies can have a large impact on WAF. You can actually
remove the last interval from longer runs as it will have the most errors.
archlinux# FIO_TOTAL_TIME=10 FIO_INTERVAL=5 ./run.sh
ULWC:
4883,265252864
9882,566841344
10001,575844352
-------------
LLWC:
5000.792,274628608
9925.093,576364544
Hopefully this post has clarified WAF, given some tools to facilitate its
calculation and demonstrated how fio can help. Next steps to take after getting
through this post would be to increase the total time of the experiment and of
the intervals. My experiments usually run for a couple of hours with 5 minute
intervals but this can vary depending on your needs. You should also change
fio.cfg
so it executes a more representative workload.