Storage Performance Testing: What to Run, What You're Reading, and What It Actually Means
Your colleagues are debating RAIDZ2 vs RAID 10. Someone runs a benchmark. Numbers appear. A decision gets made. Half the time the benchmark measured the wrong thing entirely — wrong block size, wrong queue depth, read from cache instead of disk, compared averages instead of tail latency.
This post covers storage performance testing from the ground up. What to install, what to run, how to read every number in the output, and the two concepts underneath every serious storage conversation — queue depth and block size. Real commands, real output, honest explanations.
Debian only. NVMe and SATA SSD only — HDD is dead for this conversation.
What Are You Actually Measuring?
Before any command, understand that storage has four independent dimensions. Your colleagues are usually debating one of these without saying which one:
Throughput (MB/s) — how much data moves per second. Matters for large sequential operations — backups, file copies, database dumps, video.
IOPS (I/O Operations Per Second) — how many read/write operations per second, regardless of size. Matters for databases, VMs, anything doing lots of small random reads and writes.
Latency (ms/µs) — how long a single operation takes from request to completion. Matters for anything interactive. A database doing 50,000 IOPS at 0.3ms latency is completely different from 50,000 IOPS at 15ms latency.
Queue Depth — how many operations are in flight simultaneously. SSDs love deep queues. This is why NVMe feels so much faster than SATA SSD under concurrent load — not just interface speed, but queuing architecture. More on this later.
"Is RAIDZ2 fast enough for our database?" is really an IOPS and latency question. "Which is faster for backups?" is a throughput question. Different questions need different tests. Running the wrong test gives you confident but useless numbers.
Install the Tools
apt install fio hdparm iotop sysstat ioping -y
What each one does:
fio — the professional benchmark tool. Simulates real workloads with full control over block size, queue depth, read/write ratio, concurrency. This is what your colleagues use and what this post focuses on.
hdparm — quick raw disk speed test. Good for a 30-second sanity check before anything serious.
ioping — like ping but for disk latency. Measures how long individual I/O operations take. Simple and revealing.
iotop — like top but for disk I/O. Shows which process is hammering storage in real time.
sysstat — provides iostat. Monitors ongoing I/O metrics over time. Run this while fio runs to see what's actually happening at the device level.
Step 1 — Quick Sanity Check with hdparm
Before serious benchmarking, get a raw read speed baseline:
# Test raw read speed — bypasses filesystem and OS cache
hdparm -Tt /dev/nvme0n1
# -T = cached reads (RAM speed — ignore this number completely)
# -t = buffered disk reads (actual device speed)
Output:
/dev/nvme0n1:
Timing cached reads: 24512 MB in 2.00 seconds = 12256 MB/sec ← RAM, ignore
Timing buffered disk reads: 6842 MB in 3.00 seconds = 2280 MB/sec ← real speed
What to expect:
| Device | Typical Result |
|---|---|
| SATA SSD | 400–550 MB/s |
| NVMe (mid-range) | 2000–3500 MB/s |
| NVMe (high-end) | 5000–7000 MB/s |
| RAID 10 (4× SATA SSD) | 800–1800 MB/s |
| RAIDZ2 (4× NVMe) | 4000–12000 MB/s |
This is your ceiling. Actual filesystem performance will always be lower. If your filesystem performance is close to this number, your storage stack is healthy. If it's dramatically lower, something is wrong in the layers above.
Step 2 — Latency Check with ioping
Check latency before throughput. A storage system with great throughput but terrible latency will feel sluggish for anything interactive:
# Latency on the filesystem — what your app actually experiences
ioping -c 30 /data
# Raw device latency — bypass filesystem to see hardware latency
ioping -c 30 /dev/nvme0n1
# Compare both — the difference is your filesystem overhead
Output:
4 KiB <<< /data (ext4 /dev/nvme0n1p1): request=1 time=198.7 us
4 KiB <<< /data (ext4 /dev/nvme0n1p1): request=2 time=201.3 us
4 KiB <<< /data (ext4 /dev/nvme0n1p1): request=3 time=187.4 us
...
--- /data (ext4 /dev/nvme0n1p1) ioping statistics ---
29 requests completed in 28.8 ms, 116 KiB read, 1.01 k iops, 3.96 MiB/s
min/avg/max/mdev = 187.4 us / 243.7 us / 412.3 us / 48.2 us
The bottom line is what matters. avg is your baseline latency. mdev (mean deviation) is consistency — high mdev means unpredictable spikes. An average of 200µs with mdev of 5µs is great. An average of 200µs with mdev of 180µs means the storage is spiking wildly and your application will feel it.
What to expect:
| Device | Expected Latency |
|---|---|
| SATA SSD | 100–500 µs |
| NVMe (mid-range) | 20–100 µs |
| NVMe (high-end) | 10–40 µs |
| RAIDZ2 (NVMe, random) | 50–200 µs (parity overhead) |
| Mirror vdevs (NVMe) | ~same as single NVMe |
RAIDZ2 has higher random I/O latency than mirrors because of parity calculation overhead on every write. This is one of the core arguments for using mirrors instead of RAIDZ for database workloads.
Step 3 — Serious Benchmarking with fio
fio is the tool that actually matters. Every parameter controls something specific — get them right or your numbers are fiction.
Key parameters you need to understand:
--rw — workload type: read, write, randread, randwrite, randrw (mixed)
--bs — block size per operation. 4K for database-like workloads. 1M for sequential large-file workloads. More on this later — it's a deep topic.
--iodepth — queue depth. How many operations in flight simultaneously. Critical for NVMe. More on this later too.
--numjobs — parallel workers. Simulates concurrent users or processes.
--runtime — how long to run. 60 seconds minimum. 10 seconds catches cache, not disk.
--filename — what to test. Always use your actual filesystem mount path, not the raw device. Raw device numbers are theoretical. Your application uses the filesystem.
Critical before every read test — flush the cache:
sync && echo 3 > /proc/sys/vm/drop_caches
Linux caches disk reads in RAM. Without this, your second read test reads from RAM and gives you RAM speeds — completely useless. Do this before every single read benchmark.
Test 1 — Sequential Read (backup/file server scenario)
sync && echo 3 > /proc/sys/vm/drop_caches
fio --name=seq-read \
--filename=/data/fio-test \
--rw=read \
--bs=1M \
--size=4G \
--numjobs=1 \
--iodepth=8 \
--runtime=60 \
--time_based \
--group_reporting
Test 2 — Sequential Write
fio --name=seq-write \
--filename=/data/fio-test \
--rw=write \
--bs=1M \
--size=4G \
--numjobs=1 \
--iodepth=8 \
--runtime=60 \
--time_based \
--group_reporting
Test 3 — Random Read (database reads)
sync && echo 3 > /proc/sys/vm/drop_caches
fio --name=rand-read \
--filename=/data/fio-test \
--rw=randread \
--bs=4K \
--size=4G \
--numjobs=4 \
--iodepth=32 \
--runtime=60 \
--time_based \
--group_reporting
Test 4 — Random Write (database writes, VM disks)
fio --name=rand-write \
--filename=/data/fio-test \
--rw=randwrite \
--bs=4K \
--size=4G \
--numjobs=4 \
--iodepth=32 \
--runtime=60 \
--time_based \
--group_reporting
Test 5 — Mixed Read/Write (most realistic for databases)
sync && echo 3 > /proc/sys/vm/drop_caches
fio --name=mixed \
--filename=/data/fio-test \
--rw=randrw \
--rwmixread=70 \
--bs=4K \
--size=4G \
--numjobs=4 \
--iodepth=32 \
--runtime=60 \
--time_based \
--group_reporting
--rwmixread=70 means 70% reads, 30% writes — typical read-heavy database workload.
Reading fio Output — This Is What Everyone Gets Wrong
Run any fio test and you get output like this for a 4K random read:
rand-read: (g=0): rw=randread, bs=(R) 4096B, iodepth=32
fio-3.28
Starting 4 processes
rand-read: (groupid=0, jobs=4): err= 0: pid=12345
read: IOPS=387k, BW=1512MiB/s (1585MB/s)(88.5GiB/60001msec)
clat (usec): min=48, avg=330.42, max=12834, stdev=187.63
lat (usec): min=49, avg=331.10, max=12835, stdev=187.70
clat percentiles (usec):
| 1.00th=[ 94], 5.00th=[ 131], 10.00th=[ 155],
| 20.00th=[ 186], 30.00th=[ 210], 40.00th=[ 237],
| 50.00th=[ 265], 60.00th=[ 318], 70.00th=[ 383],
| 80.00th=[ 478], 90.00th=[ 553], 95.00th=[ 668],
| 99.00th=[ 1020], 99.50th=[ 1237], 99.90th=[ 2212],
| 99.99th=[ 5800]
cpu : usr=4.21%, sys=18.34%, ctx=3421893, majf=0, minf=42
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.4%
Run status group 0 (all jobs):
READ: bw=1512MiB/s, IOPS=387k, run=60001-60001msec
Let me walk through every important number:
IOPS=387k — 387,000 random 4K read operations per second. For random small-block workloads (databases, VMs) this is your headline number. This is what your colleagues compare when evaluating storage for a database.
BW=1512MiB/s — throughput. For large sequential workloads (backups, file copies) this is your headline number. For 4K random I/O it's just IOPS × block size and less meaningful on its own.
clat avg=330.42 usec — average completion latency. Time from when the I/O was submitted to when it completed. 330µs = 0.33ms. This is what your application waits for on each operation. For NVMe this is excellent. For a RAIDZ2 array under write load you might see 1–5ms.
stdev=187.63 — standard deviation of latency. High relative to the average means inconsistent, spikey behavior. Low means predictable. Predictable storage is often more valuable than fast-but-unpredictable for interactive workloads. A stdev of 187µs against an avg of 330µs is reasonable. A stdev of 2000µs against avg of 330µs would indicate serious spikes.
The percentiles — the most important part that most people completely ignore:
99.00th=[ 1020] ← 99% of all requests finished in under 1.02ms
99.50th=[ 1237] ← 99.5% finished under 1.24ms
99.90th=[ 2212] ← 99.9% finished under 2.2ms
99.99th=[ 5800] ← the worst 0.01% took 5.8ms
A database with average 0.33ms latency but 99th percentile at 50ms will feel broken to users during load spikes. Applications experience their tail latency, not their averages. Your colleagues should be comparing the 99th and 99.9th percentile numbers — not averages, not peak IOPS in isolation. Tail latency is where the real performance story lives.
IO depths : 32=99.4% — confirms 99.4% of operations ran at the full queue depth of 32. If this showed mostly depth 1, your fio configuration wasn't actually stressing the drive.
sys=18.34% — kernel CPU time spent handling I/O. If this is very high (above 30–40%) you may be hitting a CPU bottleneck in the I/O path, not a storage bottleneck. Always check CPU alongside storage metrics.
Step 4 — Monitoring Live I/O with iostat
While fio runs, open a second terminal and watch what's actually happening at the device level:
iostat -xz 2
# -x = extended statistics
# -z = hide devices with zero activity
# 2 = refresh every 2 seconds
Output while a test runs:
Device r/s w/s rkB/s wkB/s r_await w_await aqu-sz %util
nvme0n1 387000 0.0 1548000 0.0 0.33 0.00 12.43 100.00
The columns that matter:
r/s and w/s — reads and writes per second. This is your real-time IOPS.
rkB/s and wkB/s — kilobytes per second. Divide by 1024 for MB/s.
r_await and w_await — average time in milliseconds for requests to complete including queue wait time. For NVMe under moderate load you want single-digit milliseconds or less. High numbers mean the device is overwhelmed.
aqu-sz — average queue size during that interval. This is how many operations were waiting at the device simultaneously. If this is consistently above your configured iodepth, your storage is saturated. If it's near zero under light load, the storage is keeping up easily. More on why this number is critical in the queue depth section below.
%util — percentage of time the device was busy. 100% means saturated — no spare capacity. 50% means you have headroom. This is the most intuitive number to watch in real time.
# Watch specific devices
iostat -xz 2 nvme0n1
# Include CPU stats — always check this alongside storage
iostat -xzc 2
If %util is 100% and CPU is 20%, your storage is the bottleneck. If CPU is 100% and %util is 40%, your storage is fine — CPU is the problem. Always check both. People blame storage for CPU bottlenecks constantly.
Step 5 — Find What's Hitting the Disk with iotop
# Run as root
iotop -o # -o shows only processes doing actual I/O
# Non-interactive, log output every 5 seconds
iotop -bod 5
Output:
Total DISK READ: 1512.00 M/s | Total DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE COMMAND
12345 be/4 root 1512.00 M/s 0.00 B/s fio --name=rand-read ...
Use this when someone says "something is hammering our disk." iotop -o shows exactly which process, which user, and how much. The top line shows total system I/O. The rows show per-process breakdown. Essential for production debugging.
Part 2: Queue Depth — The Concept Underneath All of This
Now the two deep topics. These are what separate benchmarks that mean something from benchmarks that produce confident but useless numbers.
What Queue Depth Actually Is
Think of a restaurant kitchen. A waiter with queue depth 1 takes one order, walks to the kitchen, waits for the food, brings it back, then takes the next order. The kitchen sits idle while the waiter walks. One thing at a time.
A waiter with queue depth 32 takes 32 orders in one round, hands them all to the kitchen simultaneously, and the kitchen works on all of them in parallel. The kitchen is busy the entire time. Completely different throughput.
SSDs are the kitchen. They have internal parallelism — multiple NAND chips that can all work simultaneously. Queue depth is how many I/O operations you allow to be in flight at the same time. Give the drive enough simultaneous work and it keeps all its internal chips busy. Give it one operation at a time and you're throttling a parallel machine into serial behavior.
SATA SSD vs NVMe — The Protocol Gap
This is the real reason NVMe is faster under concurrent load — not just interface speed, but queuing architecture.
SATA SSD uses the AHCI protocol, designed in 2004 for spinning disks. Maximum queue depth of 32. One single queue. One line into the kitchen regardless of how many CPU cores are submitting work.
NVMe SSD uses the NVMe protocol, designed specifically for flash. Up to 65,535 queues with up to 65,535 depth each. In practice most systems use 4–32 queues with depth 32–1024 each. Each CPU core gets its own dedicated queue with zero contention between cores.
SATA SSD:
All CPU cores → [single queue, max depth 32] → SSD
NVMe SSD:
CPU core 0 → [queue 0, depth 1024] ─┐
CPU core 1 → [queue 1, depth 1024] ─┼──→ NVMe SSD
CPU core 2 → [queue 2, depth 1024] ─┘
CPU core 3 → [queue 3, depth 1024] ─┘
No contention. No serialization between cores. Every core has its own lane. This is called Multi-Queue Block Layer (blk-mq) in the Linux kernel and it's the architectural reason NVMe saturates so much more completely than SATA SSD under real concurrent workloads.
Queue Depth and IOPS — The Curve
The relationship between queue depth and IOPS is not linear. It curves and then flattens. A typical NVMe SSD under 4K random reads:
Queue Depth 1: ~15,000 IOPS ← drive mostly idle between requests
Queue Depth 4: ~80,000 IOPS ← internal parallelism starting
Queue Depth 8: ~200,000 IOPS ← significant
Queue Depth 32: ~400,000 IOPS ← near peak
Queue Depth 64: ~420,000 IOPS ← diminishing returns
Queue Depth 128: ~425,000 IOPS ← plateau — no meaningful gain
The drive's IOPS ceiling is only reachable when you give it enough simultaneous work. A single-threaded application with queue depth 1 will never see the drive's real capability. This is why the spec sheet number your SSD manufacturer advertises is measured at queue depth 128 — and why production performance is often a fraction of that number. Your application rarely generates 128 simultaneous operations.
What Queue Depth Your Production Workload Actually Generates
Theory is one thing. What your application actually sends to storage is what drives the real decision. Check it while the application runs under real load:
iostat -xz 2
# Watch the aqu-sz column
Device r/s w/s rkB/s wkB/s r_await w_await aqu-sz %util
nvme0n1 8432.0 2100.0 33728.0 8400.0 0.38 1.12 4.21 87.00
aqu-sz = 4.21 — your workload generates an average queue depth of about 4. Your drive is nowhere near its parallel capacity. You could handle significantly more load before hitting the drive's ceiling. If aqu-sz is consistently above 32 and %util is 100%, you're genuinely storage-bound. If aqu-sz is below 4 and %util is below 50%, your storage has headroom and the bottleneck is elsewhere.
Queue Depth in fio — Set It to Match Reality
--numjobs × --iodepth = total operations in flight.
Single application, sequential workload:
--numjobs=1 --iodepth=8
# Total: 8 in-flight — sequential I/O doesn't need deep queues
Database with moderate concurrent connections:
--numjobs=4 --iodepth=32
# Total: 128 in-flight — realistic for 20-50 concurrent DB connections
Heavy concurrent workload (many VMs, busy database):
--numjobs=8 --iodepth=64
# Total: 512 in-flight — serious pressure test
Find your drive's saturation point:
for QD in 1 4 8 16 32 64 128; do
sync && echo 3 > /proc/sys/vm/drop_caches && sleep 2
echo -n "QD=$QD: "
fio --name=qd-test \
--filename=/data/fio-test \
--rw=randread \
--bs=4K \
--size=4G \
--numjobs=1 \
--iodepth=$QD \
--runtime=20 \
--time_based \
--group_reporting \
--output-format=terse 2>/dev/null | \
awk -F';' '{printf "IOPS=%s\n", $8}'
done
Where IOPS stops growing is the ceiling. Where your production aqu-sz lands on that curve is your operating point. The gap between them is your headroom.
Part 3: Block Size — Two Different Things, Same Name
"Block size" means two different things depending on context. Both matter. People confuse them constantly.
Filesystem block size — the minimum storage unit the filesystem uses. Set at mkfs time. Every file, regardless of how tiny, occupies at least one block. Permanent for ext4 and XFS. Flexible per-dataset for ZFS.
I/O block size — the size of individual read/write requests sent to storage. Decided by the application. PostgreSQL sends 8K reads because its internal page size is 8K. cp might send 128K chunks. fio sends whatever you set with --bs.
They interact with each other but they are independent. Let me take them separately.
Filesystem Block Size
When you format a filesystem you're deciding the granularity of storage allocation. Every file consumes space in multiples of this block size regardless of actual file size.
# ext4 — 4K default, almost always correct
mkfs.ext4 /dev/sdb # 4K blocks (default)
mkfs.ext4 -b 4096 /dev/sdb # explicit 4K
mkfs.ext4 -b 1024 /dev/sdb # 1K — only for massive numbers of tiny files
# XFS — 4K default, can go larger for specific workloads
mkfs.xfs /dev/sdb # 4K default
mkfs.xfs -b size=65536 /dev/sdb # 64K — large file workloads only
# ZFS recordsize — not set at pool creation, set per dataset, changeable anytime
zfs set recordsize=4K mypool/dataset
zfs set recordsize=128K mypool/dataset
zfs set recordsize=1M mypool/dataset
The space waste problem with large block sizes:
A 100-byte config file on a 4K filesystem occupies 4,096 bytes. 3,996 bytes wasted. With a million small files that waste is significant. On a 64K filesystem that same 100-byte file wastes 65,436 bytes — 655× more space than the actual file content.
The performance problem with small block sizes:
Reading a 1GB file stored in 4K blocks means the filesystem tracks 262,144 individual block pointers in its metadata. The same file in 128K blocks requires 8,192 pointers. Fewer metadata lookups, better sequential read performance, less overhead on the filesystem's internal structures.
The default 4K is a compromise that works well for mixed workloads. When you have a specific dominant workload, you can do better — especially with ZFS.
The ZFS Advantage — Recordsize Per Dataset
ext4 and XFS make you choose one block size for the entire filesystem at creation time. One workload, one choice, forever. ZFS lets you set recordsize per dataset and change it anytime (affects new writes, not existing data).
# Different workloads, different recordsizes, same pool
zfs create mypool/postgres
zfs set recordsize=8K mypool/postgres # matches PostgreSQL 8K page size
zfs create mypool/mysql
zfs set recordsize=16K mypool/mysql # matches InnoDB 16K page size
zfs create mypool/files
zfs set recordsize=128K mypool/files # general file storage
zfs create mypool/backups
zfs set recordsize=1M mypool/backups # maximum sequential throughput
zfs create mypool/vms
zfs set recordsize=64K mypool/vms # VM disk images, mixed I/O
zfs create mypool/logs
zfs set recordsize=32K mypool/logs # append-heavy log files
# Verify all settings
zfs get recordsize mypool
One pool. Six workloads. Each tuned precisely. With ext4 or XFS you pick one setting and accept that it's a compromise for everything. With ZFS you pick the right setting for each use case independently.
I/O Block Size — What the Application Sends
This is what --bs in fio controls. The application decides this, not you. PostgreSQL sends 8K reads. MySQL InnoDB sends 16K. The kernel page cache sends 4K. You can't change what your application sends — but you can match your storage configuration to it.
Small blocks (4K–16K):
- Many operations per second required
- IOPS is the bottleneck
- Latency per operation dominates performance
- This is the database world
Large blocks (128K–1M):
- Few operations needed
- Throughput (MB/s) is what matters
- Latency per operation matters less because each transfers so much data
- This is backups, file copies, video storage
The math makes this concrete:
Writing 1GB with 4K blocks:
262,144 individual write operations required
At 100,000 IOPS = 2.6 seconds
Bottleneck: IOPS
Writing 1GB with 1M blocks:
1,024 individual write operations required
At 1,000 IOPS but 1 GB/s throughput = 1 second
Bottleneck: throughput (MB/s)
Same 1GB. Different I/O profile. Different bottleneck. Different things to optimize.
How Filesystem Block Size and I/O Block Size Interact
When the application's I/O size doesn't match the filesystem block size, overhead happens:
Application sends 8K, filesystem block size is 4K — clean:
Application: write 8K
Filesystem (4K blocks):
→ allocates 2 blocks (2 × 4K = 8K)
→ writes block 1, writes block 2
→ 2 metadata entries, clean alignment
Result: no waste, natural
Application sends 4K, filesystem block size is 128K — write amplification:
Application: write 4K
Filesystem (128K record):
→ must allocate 1 full 128K block
→ 4K of real data, 124K of padding or read-modify-write
→ 32× write amplification per operation
Result: catastrophic for a database doing thousands of 4K writes per second
This is write amplification — the filesystem writes far more than the application asked for due to block size mismatch. The principle is simple: match your filesystem block size or ZFS recordsize to your application's dominant I/O size.
SSDs Add Another Layer — The Erase Block
SSDs have their own internal block structure. NAND flash can read and write in small units (typically 4K pages internally) but can only erase in much larger units called erase blocks — typically 256K to 4MB depending on NAND type.
When you write 4K to a location that already has data, the SSD internally:
1. Reads the entire 256K–4MB erase block into an internal buffer
2. Erases the full block (cannot erase just 4K)
3. Modifies the 4K you actually wanted to change
4. Writes the entire block back
You asked to write 4K. The SSD moved 256K–4MB internally.
This internal amplification is why random small writes are expensive on SSDs and why filesystem alignment matters for sustained write performance and drive longevity. Modern mkfs tools auto-detect and align correctly for most SSDs, but verify for high-performance builds:
# Check alignment parameters
cat /sys/block/nvme0n1/queue/optimal_io_size
cat /sys/block/nvme0n1/queue/minimum_io_size
cat /sys/block/nvme0n1/alignment_offset
# Verify ext4 aligned correctly after formatting
tune2fs -l /dev/nvme0n1p1 | grep "Block size"
dumpe2fs /dev/nvme0n1p1 2>/dev/null | grep -E "Block size|stride|stripe width"
The Block Size Sweep — See It Yourself
Run this to map the relationship between I/O block size, IOPS, and throughput on your actual hardware:
for BS in 4K 8K 16K 32K 64K 128K 512K 1M; do
sync && echo 3 > /proc/sys/vm/drop_caches && sleep 2
echo -n "BS=$BS: "
fio --name=bs-test \
--filename=/data/fio-test \
--rw=randread \
--bs=$BS \
--size=4G \
--numjobs=4 \
--iodepth=32 \
--runtime=20 \
--time_based \
--group_reporting \
--output-format=terse 2>/dev/null | \
awk -F';' '{printf "IOPS=%s BW=%.0fMiB/s\n", $8, $7/1024}'
done
A typical NVMe SSD produces something like:
BS=4K: IOPS=380000 BW=1484MiB/s
BS=8K: IOPS=210000 BW=1640MiB/s
BS=16K: IOPS=115000 BW=1796MiB/s
BS=32K: IOPS=62000 BW=1937MiB/s
BS=64K: IOPS=32000 BW=2000MiB/s ← throughput plateau begins
BS=128K: IOPS=16000 BW=2000MiB/s ← pipe is full, IOPS keeps dropping
BS=512K: IOPS=4000 BW=2000MiB/s
BS=1M: IOPS=2000 BW=2000MiB/s
This drive's throughput ceiling is ~2000 MB/s. You hit it at 64K blocks. Beyond 64K, larger block size doesn't move more data — the pipe is already full. Below 64K, IOPS is the constraint. A database at 4K is in the IOPS-constrained region. Backups at 1M are in the throughput-constrained region. Both correct for their workload — the curve tells you whether you have headroom.
Block Size Decision Table
Stop guessing. Match to your workload:
| Workload | Filesystem | Block / Record Size | Why |
|---|---|---|---|
| PostgreSQL | ext4 / XFS | 4K (default) | Safe match for 8K pg pages |
| PostgreSQL | ZFS | recordsize=8K | Exact match to pg page size |
| MySQL / MariaDB | ZFS | recordsize=16K | InnoDB default page size |
| MongoDB | ZFS | recordsize=16K | WiredTiger default |
| General mixed files | ext4 / XFS | 4K (default) | Best all-rounder |
| Large files only | XFS | 64K | Fewer metadata ops |
| Large files only | ZFS | recordsize=128K | Big sequential throughput gain |
| Backup target | ZFS | recordsize=1M | Maximum sequential throughput |
| VM disk images | ZFS | recordsize=64K | Mixed I/O, amortizes overhead |
| Log files | ZFS | recordsize=32K | Append-heavy, medium size |
| Object storage | ZFS | recordsize=1M | Large objects, sequential |
For anything not on this list: find your application's internal I/O size (check its documentation), and match recordsize to it. That is always the right principle.
Putting It All Together — The Full Benchmark Workflow
This is the process. Run it on each storage configuration you're comparing:
Step 1 — Quick ceiling check:
hdparm -t /dev/nvme0n1
ioping -c 30 /data
Step 2 — Map your drive's queue depth curve:
for QD in 1 4 8 32 64; do
sync && echo 3 > /proc/sys/vm/drop_caches && sleep 2
echo -n "QD=$QD: "
fio --name=qd --filename=/data/fio-test \
--rw=randread --bs=4K --size=4G \
--numjobs=1 --iodepth=$QD \
--runtime=20 --time_based --group_reporting \
--output-format=terse 2>/dev/null | awk -F';' '{printf "IOPS=%s\n", $8}'
done
Step 3 — Check your production queue depth:
# Run your actual application under real load, then:
iostat -xz 2
# Note the aqu-sz value — this is what your storage actually sees
Step 4 — Benchmark with your real workload parameters:
# Use the block size your application actually sends
# Use the queue depth your production aqu-sz shows
sync && echo 3 > /proc/sys/vm/drop_caches
fio --name=production-sim \
--filename=/data/fio-test \
--rw=randrw \
--rwmixread=70 \
--bs=8K \
--size=4G \
--numjobs=4 \
--iodepth=16 \
--runtime=60 \
--time_based \
--group_reporting
Step 5 — Monitor while the test runs:
# Second terminal
iostat -xzc 2
# Watch: aqu-sz (queue saturation), %util (capacity headroom),
# r_await/w_await (real latency), sys% CPU (I/O path overhead)
Step 6 — Compare tail latency, not averages:
The decision between RAIDZ2 and mirrors for a database comes down to the 99.00th and 99.90th percentile clat numbers at your production queue depth and block size. Not headline IOPS. Not average latency. The tail. Run the same test on both configurations. The one with lower, more consistent tail latency wins for interactive workloads. The one with higher throughput wins for bulk sequential. These are often different configurations — which is why the answer to "which is better" is always "depends on the workload."
What to Tell Your Colleagues Next Time
When they're debating storage strategies, the right questions are:
- What's the dominant workload — sequential or random?
- What's the application's internal I/O size? Match your block size to it.
- What queue depth does the application generate under real load? Check
aqu-sz. - What matters more — IOPS, throughput, or tail latency?
Run fio with parameters that match your actual workload. Compare 99th percentile latency numbers. Check %util in iostat — if you're hitting 100% before load even peaks, the storage will fall over under production load regardless of how good the benchmark averages look.
The entire debate is one properly configured fio run away from a concrete answer.
Compiled by AI. Proofread by caffeine. ☕