fio waf

Fio WAF

WAF

This is a guest post by Joel Granados. Many thanks to the people who help with their reviews : Vincent Fu, Adam Manzanares and Javier Gonzalez.

Introduction

Write amplification factor (WAF) is the difference in the number of bytes written at two levels in an application to device storage stack for a given workload. It is a value that can give an indication of how a device will age with time. In this article we explore what WAF is, what are the different ways of calculating it and how we can take advantage of fio to help make those calculations easy. We also provide a recipe that can hopefully be used as a building block in a more general storage telemetry system.

Write Amplification Factor

WAF is a unitless ratio between the number of bytes written at two levels of the storage hierarchy. In the absence of things like compression and encryption, the best that we can hope for is that the amount of bytes written to store a set of data in one level of the storage stack is the same as the amount of bytes written to represent it further down. This, of course is not always the case, as we have metadata, structures to handle the metadata, idiosyncrasies of the physical medium holding the data, etc...

We calculate WAF by dividing the lower layer write count (LLWC) by the upper layer write count (ULWC). The ideal value for WAF is one, which effectively means that there are no additional bytes needed in the lower storage layers to represent those bytes coming from the upper layers. If the ULWC is 1MiB and the LLWC is 1.5MiB because of garbage collection, then the WAF would be 1.5. There is no unique value representing a particular storage stack configuration, instead there are several values depending on where LLWC and ULWC are sampled.

A spectrum of WAFs

Let us use a database application to shed light on the specifics of sampling LLWC and ULWC and how this can result in several WAF values. RocksDB is a log structured database that stores key value pairs. Depending on the workload, RocksDB can write one order of magnitude more to the lower storage layers increasing the WAF to as high as 10. Let's assume that our ULWC is the key/values defined by the user to be stored in the database. These are measured by counting bytes in the "Put" call from the user to the RocksDB API. With this as our starting point we will give three examples of where to place the LLWC.

We can measure the LLWC in the interface between user and kernel space. Metadata storage, write ahead log and internal RocksDB processes like compaction influence this value. To measure these we count writes before RocksDB executes a system call in functions such as "PositionedAppend" and "Append".
If we measure the LLWC further down in the interface between the operating system and the device, we would get a different value that includes filesystem metadata and journaling on top of the before mentioned influential factors.
Finally, if we assume that our main storage device is a Solid State Drive (SSD) and move LLWC further down in the storage hierarchy between the flash translation layer (FTL) and the physical NAND dies, we get a different value that adds garbage collection, over-provisioning and wear leveling (among others) to the already mentioned factors.

In the previous examples we adjusted the LLWC down the stack but we could have easily adjusted the ULWC as well. If we place it after the operating systems (like in 2.) and LLWC at the die level (like in 3.), we measure Device WAF. This represents the additional writes needed by a device to execute any write operation.

Measuring Device Writes

It is worth mentioning that one of the ways to access internal values within an NVMe Specification compliant SSD is something called SMART Health information log page which contains a measure of the writes that reach the controller. Unfortunately it is a value that is defined in units of 500 KiB, which is far too coarse to be used in WAF calculations. Luckily the Open Compute Project Datacenter NVMe SSD Specification defines an extended SMART log that has a value for measuring WAF called "Physical Media Units Written". This extended OCP SMART log data is not supported by all vendors. Some vendors may report physical media writes via other means but details depend on the device and vendor. That could be the third and last option to get the writes at the device level.

WAF in fio

RocksDB is great but it does not really give you a lot of control over your workload and that might not be what you want when you are benchmarking a particular storage stack. This is where fio comes into the picture. In this section I describe a recipe that you can follow to arrive to a WAF value as well as identify what particular fio features can be used for WAF calculation.

Before Starting

I'm assuming that you already have an idea of what you want to measure. Fio has so many options and it is a fair assumption that whatever you are thinking of measuring, fio already has some way of expressing it. You will have to decide on the IO patterns (i.e. random -vs- sequential), block size (512B -vs- 4096B), how much data is being generated (i.e. 50% -vs- 90% of device), what path that data is taking to get to the device (i.e. io_uring -vs- libaio -vs- SPDK), I/O depth if it is asynchronous, what file (device) or file system to target, and how many threads you want to use at the same time. Our recipe section provides a simple starting point.

You do need to ensure the following to get a accurate measure:

The device being tested needs to be exclusively used. No other process can be accessing it during the test.
Since WAF is about write amplification you must use one of the write workloads that fio offers.

Scope

Fio is a user space application that generates I/O in lots of different ways but is not always directly involved in gathering WAF data. For Device WAF for example, fio generates writes but is not involved in the actual calculation. We will not discuss measuring Device WAF as it happens orthogonally.

Fio does not generate writes on higher levels of the stack. It cannot create key/value pairs for a RocksDB database, for example (though you can probably create an io-engine for it). This restricts how high in the stack we can go with our ULWC. For the examples contained in this post we will not need to go up too high.

Fio has a very compelling set of knobs that adjust its behavior as mentioned in the "Before Starting" section. This post is not meant to be exhaustive; it just describes how to calculate WAF but not how to use fio in general.

Recipe

We generate and record writes with fio by using "json" output format to log and timestamp the number of bytes written in an interval. This is effectively the ULWC. Since fio cannot actually give us the writes that arrived to the device, we will use a third party python script that will do that job for us and call it by using fio's exec io-engine. This will effectively be the LLWC.

Looking at the fio configuration file (fio.cfg) we see that the job runs for the number of seconds defined by the FIO_TOTAL_TIME environment variable and creates up to 1000 files of size 500MiB on top of whatever filesystem is mounted on /mnt. The files are randomly written with direct I/O. A second job is spawned through the exec io-engine that runs the measure_dev script which is the one that collects the LLWC data.

fio.cfg

[global]

# Total run time
runtime=${FIO_TOTAL_TIME}

# So its stops after runtime
time_based=1

# One line per report
# The line for the WAF_monitor does not really measure anything
group_reporting

# Section that measures WAF
[WAF_monitor]

# Engine to execute third party apps
ioengine=exec

# Path to the executable
program=./measure_dev

# Arguments to the executable
arguments=/dev/nvme0n1 ${FIO_INTERVAL}

# The actual fio Benchmark
[WAF_filesystem]

# We are testing whatever is mounted on mnt
directory=/mnt

# We want to see perf without caching
direct=1

# fio writes in random order
rw=randwrite

thread=1

# fio will write files of 500 MiB
filesize=500M

# fio will write up to 1000. It will stop if the
# Time expires.
nrfiles=1000

Next is the python script doing the LLWC calculations. It extracts the physical writes for the start and end of every interval by using the ocp argument in nvme-cli which implements the extended health log defined in the Datacenter NVMe SSD Specification. The command requires specifying the device file and returns lots of information related to the device "health". The line containing the physical writes is named "Physical media units written". The script outputs the last measured writes as soon as it receives the kill signal from fio so as not to lose any writes happening between the last sample and the end of the experiment. We run the fio job using the --status-interval option which provides cumulative fio output periodically. In order to calculate WAF values, the script displays elapsed time and cumulative write counts at approximately the same frequency as the fio output.

measure_dev script

#!/bin/python3
import subprocess
import sys
import time
from datetime import datetime
from datetime import timedelta
import signal

class WAFInterval:
    def __init__(self):
        self.phy_writes_init = None
        self.time_stamp = None

    def get_physical_media_units_writes(self, dev):
        value = 0
        cmd = "nvme ocp smart-add-log {0}".format(dev)
        out = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        pattern = "Physical media units written"
        for line in out.stdout.readlines():
            if pattern in str(line):
                value = int(line.decode("utf-8").split()[-1])

        return value

    def start_interval(self):
        self.phy_writes_init = self.get_physical_media_units_writes(dev)
        self.time_stamp = datetime.now()

    def lap_interval(self, last=False):
        now = datetime.now()
        if last: time.sleep(1)
        phy_writes_now = self.get_physical_media_units_writes(dev)
        phy_writes = phy_writes_now - self.phy_writes_init

        return "{0} {1} {2} {3}".format(
                (now - self.time_stamp)/timedelta(milliseconds=1),
                self.phy_writes_init, phy_writes_now, phy_writes)

    def colnames(self):
        return "TIME(ms) P_WRITE_PREV P_WRITE_NOW P_WRITTEN(ACCUM)"

def signal_handler(signal, frame):
    global waf_int
    row = waf_int.lap_interval(last=True)
    print(row)
    sys.exit(1)
signal.signal(signal.SIGTERM, signal_handler)
signal.signal(signal.SIGINT, signal_handler)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("arguments should be: DEVICE INTERVAL(seconds)")
        sys.exit(1)
    dev = sys.argv[1]
    interval = int(sys.argv[2])

    waf_int = WAFInterval()
    print(waf_int.colnames())
    waf_int.start_interval()
    while True:
        try:
            time.sleep(interval)
            row = waf_int.lap_interval()
        except Exception as e:
            print("Error : %s" % e)
            sys.exit(1)
        print(row)

    sys.exit(0)

Finally we have a run.sh script that puts everything together. It tests an EXT4 mount on a predefined device. Note that this script requires awk, tail and jq installed on your machine. The script will output the timestamped ULWC and LLWC values at the end of execution and it also generates two files (fio.json and WAF_monitor.stdout) that contain the raw output of the experiment.

run.sh script

DEV="/dev/nvme0n1"
MOUNT="/mnt"

umount ${DEV} 2>1
mkfs.ext4 ${DEV} -F -q 2>1
mount ${DEV} ${MOUNT} 2>1

./fio --status-interval ${FIO_INTERVAL} --output-format json fio.cfg > fio.json

umount ${DEV} 2>1

echo "ULWC:"
tail --lines=+3 fio.json | jq -r '[.jobs[0].write.runtime,.jobs[0].write.io_bytes] | @csv'

echo "-------------"

echo "LLWC:"
tail --lines=+2 WAF_monitor.stdout | awk '{print $1","$4}'

Let's give it a go with short time intervals so we can quickly see what it does. Let's say that the total time and measuring interval are 10 seconds (FIO_TOTAL_TIME=10 FIO_INTERVAL=10). The example below generates random writes to /dev/nvme0n1 within a QEMU VM on the following platform:

Item	Description
fio	3.32
QEMU	Patched Qemu. Patch on master. Base commit
Host Kernel	5.19.0
Host Distro	Debian 11.4
Guest Kernel	6.1.0-rc3+
Guest Distro	Arch Linux

Remember to adjust the files to your local configuration. You might have to change "program", "arguments" in the fio.cfg file and $DEV and FIO_PATH in the run.sh file. Let's quickly go through the generated output and how to understand it. First thing to notice is that we have two values in ULWC and only one in LLWC. This is because fio printed output just before it stopped execution; we can safely ignore the first line because the values in fio are accumulated and therefore the last line has the writes that are in the first. The second thing to notice is that the times in ULWC and LLWC do not match; they have a difference of 74 (approx.) milliseconds. This comes from small timing differences in the two processes and can mostly be ignored for the purpose of this post.

archlinux# FIO_TOTAL_TIME=10 FIO_INTERVAL=10 ./run.sh
ULWC:
9884,546570240
10001,556687360
-------------
LLWC:
9926.939,557273088

The ULWC values come from the fio.json output where we take the timestamp from jobs[0].write.runtime and the actual bytes written out from jobs[0].write.io_bytes. The LLWC values come from the python script and are directly taken from python's datetime library and the output of the nvme-cli command. We already see that we have all the information necessary to calculate WAF in this particular case. We simply do a (LLWC)557273088 / (ULWC)556687360 = (WAF)1.001052167. Here we don't see a very pronounced ratio but our test was very short and did not push the limits of the filesystem.

Now let's see how we can get a more fine grained measurement. This recipe allows for such an execution by just changing the values of the environment variables. Let's leave the total runtime at 10 seconds but have 5 second intervals to see how WAF changes every 5 seconds. When we run the modified scripts we see that an additional line has appeared in the output. As we saw previously fio is outputting one more line but we can simply ignore the second to the last one. Also notice that the written bytes are accumulated so to calculate the WAF for a certain interval we need to subtract from the previous interval to get the bytes written for one interval. Other than that, the calculation is exactly the same as before. In this particular case we will get two WAF values : 1.035346438, 0.971488105. So what happened there with the last value? Did it really go under 1? In this case the last interval is always the one that has the most error as it is where the time differences is most visible as the two processes (fio and measure_dev) stop at different times. The difference also gets exacerbated by the size of our test: because our time periods are so short any timing discrepancies can have a large impact on WAF. You can actually remove the last interval from longer runs as it will have the most errors.

archlinux# FIO_TOTAL_TIME=10 FIO_INTERVAL=5 ./run.sh
ULWC:
4883,265252864
9882,566841344
10001,575844352
-------------
LLWC:
5000.792,274628608
9925.093,576364544

Conclusion

Hopefully this post has clarified WAF, given some tools to facilitate its calculation and demonstrated how fio can help. Next steps to take after getting through this post would be to increase the total time of the experiment and of the intervals. My experiments usually run for a couple of hours with 5 minute intervals but this can vary depending on your needs. You should also change fio.cfg so it executes a more representative workload.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly