ZFS vs Traditional Stack: What Nobody Explains Until You're Recovering at 2AM
Most Linux admins spend years layering mdadm on top of LVM on top of ext4 without questioning it. Then they watch a colleague spin up a ZFS pool and do something that looks like magic — instant snapshots, block-level replication, silent corruption detection — and suddenly the traditional stack feels like three tools doing one job badly.
This is the explanation I wish I had five years ago. Not a docs summary. The actual concepts, the mental models, the things that confused me first, and the scenarios that made them click.
Block Storage vs Filesystem — Start Here
Before anything else, you need this distinction clear.
Block storage is dumb. Your hard drive, your SSD, /dev/sda1 — these are just devices that store raw chunks of bytes. No concept of files, folders, or permissions. They don't know or care what's stored in them.
A filesystem is the software that sits on top and gives that raw storage a structure. When you run mkfs.ext4 /dev/sda1, you're building a city on empty land. The filesystem creates files, directories, permissions, timestamps. Without it, you just have bytes with no meaning.
The relationship is always: Block device → Filesystem → Your files.
This matters because ZFS blurs this line intentionally — and understanding why requires understanding what the line even was.
RAID: Combining Disks for Speed and Survival
RAID is about combining multiple physical disks to get redundancy, performance, or both. Before you can understand ZFS RAID, you need to understand the traditional RAID levels your colleagues were debating.
RAID 0 — stripes data across disks. If you have two 1TB disks, you get 2TB and double the write speed. But zero redundancy. One disk dies and everything is gone. Not for anything you care about.
RAID 1 — mirroring. Two disks, identical data. Survive one disk failure. You lose half your space but you have a complete copy at all times. Fast reads (can read from either disk), slower writes.
RAID 5 — stripes data across at least 3 disks with a parity block. You can survive one disk failure. Good balance of space efficiency and redundancy. The math: with 4 disks you get 3 disks worth of usable space.
RAID 6 — like RAID 5 but with double parity. Survive two disk failures simultaneously. Slower writes because of the extra parity calculation, but safer. Minimum 4 disks.
RAID 10 — RAID 1 + RAID 0 combined. You mirror pairs of disks, then stripe across the mirrors. Fast random I/O, survive a disk failure, but you lose half your space. Minimum 4 disks. This is what your colleagues were comparing against ZFS.
The debate your colleagues were having — RAID 10 with LVM vs RAIDZ2 with ZFS — is really a debate about the whole philosophy of how you build storage. Not just which RAID level is faster.
The Traditional Stack: Three Layers, Three Jobs
The standard Linux approach before ZFS became mainstream was stacking three separate tools:
mdadm handles RAID at the block level. It takes your physical disks and presents a virtual RAID device like /dev/md0 to the rest of the system.
LVM (Logical Volume Manager) sits on top of RAID. It lets you pool block devices into a volume group, then carve out logical volumes you can resize on the fly. It adds flexibility that raw RAID doesn't have.
ext4 or XFS sits on top of LVM and handles actual file storage — the thing that turns blocks into files and directories.
Physical Disks → mdadm (RAID) → LVM (volumes) → ext4 (files)
Three separate layers. Each one solid on its own. The problem is they don't communicate. ext4 has no idea it's sitting on RAID. LVM has no idea what the filesystem above it is doing. mdadm doesn't know about files at all. They're strangers working the same job.
This isolation creates a specific dangerous gap: none of them verify data integrity end-to-end. A block can go bad in transit between the disk and the filesystem and nobody will notice. The data gets written, confirmed, and the corruption sits there silently until the day you actually read that block and your application crashes trying to parse garbage.
What ZFS Does Differently
ZFS throws out the layered model entirely. It handles RAID, volume management, and the filesystem in one integrated system — written to be aware of all three simultaneously.
But integration isn't the main reason people choose ZFS. This is:
Every single block is checksummed. When ZFS writes a block, it calculates a checksum and stores it in the block's parent pointer — not in the block itself. This is deliberate: a corrupted block cannot fake its own checksum because the checksum lives elsewhere. When ZFS reads that block back, it recalculates and compares. Mismatch means corruption detected. With redundancy, ZFS heals it automatically and silently. Without redundancy, it at least tells you immediately rather than letting you trust bad data.
This is called silent data corruption detection. ext4 on a RAID array has no equivalent. ZFS considers data integrity non-negotiable at the storage layer, not something to bolt on later at the application layer.
Copy-on-Write: The Engine Behind Everything
To understand snapshots, clones, and send/receive you need to understand COW first.
ZFS never overwrites a block in place. Ever. When you modify a file, ZFS writes the new content to a completely new block, updates the pointer to reference the new block, and only then marks the old block as free — but only if nothing else is still pointing to it.
Before modification:
test.txt → [Block A: "hello from feb 28"]
After modifying test.txt:
test.txt → [Block E: "updated march 3"] ← new block written first
Block A: still on disk, reference count dropped
If Block A's reference count hits zero (no snapshot, no dataset references it), it gets marked free and ZFS will reuse it for future writes. ZFS is not hoarding old blocks — it's being precise about when it's safe to release them.
This COW approach is why ZFS is always consistent. There is no "partially written" state. Either the write completes and the pointer updates, or the pointer never changes and the old data is intact. You don't need fsck after a crash because the filesystem is never in an inconsistent state.
ZFS Pool Structure: Vdevs and How They Compose
This is where ZFS terminology trips people up. You need to understand two things: pools and vdevs.
A pool (zpool) is your top-level storage container. Think of it like the entire storage system.
A vdev (virtual device) is the building block of a pool. A vdev can be a single disk, a mirror, a RAIDZ group, or a few special-purpose devices. Your pool is made of one or more vdevs.
zpool "mypool"
├── vdev 1: mirror (disk1 + disk2) ← RAID 1
└── vdev 2: mirror (disk3 + disk4) ← RAID 1
Data is striped across vdevs (like RAID 0 between them), and protected within each vdev (mirror or RAIDZ). This is how ZFS achieves the equivalent of RAID 10 — two mirror vdevs striped.
The critical rule: you cannot change a vdev type after creation. You can add new vdevs to a pool, but you cannot convert a mirror vdev into a RAIDZ vdev. Plan your vdev layout before you create the pool.
ZFS RAIDZ: The ZFS Take on Parity RAID
ZFS implements its own parity RAID, called RAIDZ, with three variants:
RAIDZ1 — single parity, survives 1 disk failure. Equivalent to RAID 5 but without the "RAID 5 write hole" problem (a bug in traditional RAID 5 where a crash during a write can corrupt data — ZFS COW eliminates this completely).
RAIDZ2 — double parity, survives 2 simultaneous disk failures. Equivalent to RAID 6. This is what your colleagues were debating. With 6 disks you get 4 disks of usable space.
RAIDZ3 — triple parity, survives 3 disk failures. Used in large disk arrays where the probability of multiple failures during a rebuild is real.
# RAIDZ1 — survive 1 failure (minimum 3 disks)
zpool create mypool raidz /dev/loop1 /dev/loop2 /dev/loop3
# RAIDZ2 — survive 2 failures (minimum 4 disks)
zpool create mypool raidz2 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4
# RAIDZ3 — survive 3 failures (minimum 5 disks)
zpool create mypool raidz3 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 /dev/loop5
# Mirror vdevs (ZFS equivalent of RAID 10) — best random I/O
zpool create mypool mirror /dev/loop1 /dev/loop2 mirror /dev/loop3 /dev/loop4
RAIDZ2 vs Mirrors — which to choose?
This was the exact debate your colleagues were having. Here is the honest answer:
Mirrors are better for random I/O (databases, VMs, anything with lots of small reads and writes). Reading from a mirror means ZFS can read from either disk — it picks the one that responds faster. Rebuilding after a disk failure is also faster because you just copy one disk to the replacement.
RAIDZ2 is better for sequential workloads (backup targets, bulk storage, NAS) and for maximizing usable space. With 6 disks in RAIDZ2 you get 4 disks of space. Six disks as mirrors gives you only 3 disks of space.
The rebuild time difference matters more than people think. With RAIDZ2 on large disks, rebuilding (resilvering in ZFS terminology) can take days. During that time you're exposed. With mirrors, a resilver is much faster. Many experienced ZFS users default to mirrors for anything critical and only use RAIDZ2 when space efficiency is the priority.
ZFS Storage Types: Datasets vs Volumes
ZFS gives you two types of storage objects inside a pool. Most people learn datasets and never touch volumes. You should know both.
Datasets (the default)
A dataset is a ZFS filesystem — the standard way to store files in ZFS. You create one, ZFS mounts it automatically, you use it like a directory.
zfs create mypool/data
zfs create mypool/backups
zfs create mypool/logs
Datasets are hierarchical. A child dataset inherits properties from its parent unless you override them:
# Set compression on parent — all children inherit it
zfs set compression=lz4 mypool
# Override on a specific child
zfs set compression=off mypool/logs
# Set quota — this dataset cannot use more than 50GB
zfs set quota=50G mypool/data
# Set reservation — guarantee this dataset always has 10GB available
zfs set reservation=10G mypool/backups
Each dataset can have its own snapshot policy, compression, deduplication, quota, and mountpoint. This is the equivalent of having LVM logical volumes but with the filesystem awareness baked in.
Volumes (zvols)
A zvol is a block device backed by ZFS. Instead of a filesystem, ZFS presents a raw block device at /dev/zvol/mypool/myvolume. You use this when something needs a block device directly — like a VM disk, an iSCSI target, or when you want to put a different filesystem on top.
# Create a 20GB zvol
zfs create -V 20G mypool/myvm-disk
# It appears as a block device
ls /dev/zvol/mypool/myvm-disk
# You can put any filesystem on it
mkfs.ext4 /dev/zvol/mypool/myvm-disk
# Or use it as a VM disk image directly in KVM/QEMU
Why use a zvol instead of just giving the VM a dataset? Because some hypervisors and applications expect a raw block device, not a filesystem path. With a zvol, the VM talks directly to the block layer while ZFS still handles checksumming, snapshots, and compression underneath.
When to use which:
| Use Case | Use |
|---|---|
| Files, directories, general storage | Dataset |
| VM disk images | zvol |
| iSCSI targets | zvol |
| Database that manages its own files | Dataset (usually) |
| Docker storage driver | Dataset |
Snapshots: What They Actually Are (The Full Picture)
When you run zfs snapshot mypool/data@snap1, ZFS copies the current pointer tree and freezes it. No data is copied. No blocks are duplicated. The snapshot costs almost zero space because it references the exact same blocks as the live dataset.
Right after snapshot — cost: ~0:
Live dataset → Block A, Block B, Block C, Block D
@snap1 → Block A, Block B, Block C, Block D (same blocks)
Now you modify a file. COW writes to a new block. The live dataset pointer updates. The snapshot pointer stays frozen on the old block.
After modifying data:
Live dataset → Block E, Block B, Block C, Block D (E replaced A)
@snap1 → Block A, Block B, Block C, Block D (A kept alive by snapshot)
Snapshot now "owns" Block A — that space is charged to @snap1
The snapshot grows not because it tracks new data — it grows because it's preserving old blocks that the live dataset moved away from. The snapshot is a preservation of the past, not a copy of the present.
Delete the snapshot, Block A's reference count drops to zero, it gets freed immediately. That's why destroying an old snapshot can free gigabytes instantly.
Multiple Snapshots and the Chain
@snap1 taken: dataset owns A, B, C
@snap2 taken: dataset changed B → F, so @snap2 owns old B
@snap3 taken: dataset changed C → G, so @snap3 owns old C
@snap1 owns: A (from the beginning)
@snap2 owns: B (changed between snap1 and snap2)
@snap3 owns: C (changed between snap2 and snap3)
They don't duplicate each other. Each snapshot only holds the blocks that changed after it was taken and before the next snapshot. This is precisely why incremental send is efficient — zfs send -i @snap2 @snap3 only streams the blocks that @snap3 uniquely owns.
Send and Receive: Block-Level Replication With Built-In Integrity
zfs send creates a binary stream of all blocks referenced by a snapshot, including their checksums. zfs receive writes those blocks to the remote pool and verifies every checksum as it lands. If any block is corrupted in transit, receive fails — loudly and completely. Nothing partial gets committed.
# Full send — first time only
zfs send mypool/data@snap1 | ssh user@remote zfs receive backuppool/data
# Incremental — only the diff between two snapshots
zfs send -i mypool/data@snap1 mypool/data@snap2 | ssh user@remote zfs receive backuppool/data
# With verbose progress
zfs send -v mypool/data@snap2 | ssh user@remote zfs receive -v backuppool/data
After a successful receive, the remote has:
backuppool/data ← a real, mountable, live dataset
backuppool/data@snap1 ← the snapshot, kept as the anchor for next incremental
The snapshot on the remote is mandatory — it's the reference point for the next incremental send. If you delete @snap1 on the remote before sending @snap2, the incremental will fail because the remote no longer has the baseline to apply the diff on top of.
The size question: the stream size equals the actual data blocks, not the pool's allocated or provisioned size. A dataset with 100MB of real files produces a ~100MB send stream, regardless of whether the pool is 2TB. You're sending data, not empty space.
The Snapshot Pyramid: Why You Need More Than Just Dailies
The remote faithfully mirrors whatever you send it — corrupted data included. If corruption happens on March 1st and you sync every night, the remote has 21 days of corrupted syncs by March 22nd. The remote is your disaster recovery (disk failure, fire, hardware loss). It is not your time machine.
The local snapshot pyramid is your time machine. The ability to say "give me back March 1st, or February 28th, or last month."
Hourly → keep 24 Fine-grained recovery for today
Daily → keep 7 Something went wrong this week
Weekly → keep 4 Problem started more than a week ago, daily is gone
Monthly → keep 3 What did this look like before that big change last month?
Each tier is a separately named, independently managed snapshot. After day 8 your oldest daily gets pruned — but the weekly from that same day is still alive with its own 4-week lifetime. After 5 weeks the weekly is gone — but the monthly is still there with a 3-month lifetime.
The total space cost is not 38 full copies. It's the cumulative total of what actually changed across that entire window. Stable data = cheap pyramid. High-churn data = more expensive, but that's also the data you most need to be able to recover.
Sanoid manages this automatically. You declare the policy, it handles creation and pruning via cron:
# /etc/sanoid/sanoid.conf
[mypool/data]
use_template = production
[template_production]
hourly = 24
daily = 7
weekly = 4
monthly = 3
autosnap = yes
autoprune = yes
Syncoid (sanoid's companion) handles the automated send/receive to your remote:
# Run manually or via cron — handles incremental automatically
syncoid mypool/data user@remote:backuppool/data
Recovery: The Real Test of Everything Above
Scenario: corruption happened March 1st. Today is March 22nd. Nobody noticed until now. Your 7 daily snapshots are all post-corruption. Your remote received corrupted syncs for 21 days. But your local @weekly-feb-28 is clean.
Step 1 — Stop the app. Non-negotiable.
systemctl stop yourapp
lsof /mypool/data # make sure nothing is still writing
Every second the app runs, the situation gets more complex.
Step 2 — Audit before touching anything
zfs list -t snapshot -o name,creation,used -s creation mypool/data
Map out exactly what you have and when each snapshot was taken. Identify your last clean candidate.
Step 3 — Inspect the candidate snapshot
Mounting a snapshot is instant. ZFS doesn't reconstruct anything — it just reads from the frozen pointer tree, which points to blocks physically sitting on disk right now. Complete, readable, the exact state of the data at that moment.
mkdir /mnt/recovery-check
mount -t zfs mypool/data@weekly-feb-28 /mnt/recovery-check
ls -la /mnt/recovery-check
# Check your files, your database, whatever matters
umount /mnt/recovery-check
Run a scrub to verify block-level integrity:
zpool scrub mypool
zpool status mypool # wait for completion, 0 errors is what you want
Step 4 — Take a snapshot of the corrupted state before doing anything destructive
zfs snapshot mypool/data@corrupted-mar-22
This costs almost nothing. If Feb 28th turns out to be missing something important you didn't expect, you still have a way back. Never throw away data before you're certain you don't need it.
Step 5a — Clone approach (safer, keeps options open)
# Leave corrupted dataset completely intact
zfs clone mypool/data@weekly-feb-28 mypool/data-recovered
# Point your app at the recovered dataset
# Verify everything works correctly in production
# Only after confirming it's good:
zfs destroy -r mypool/data
zfs rename mypool/data-recovered mypool/data
Step 5b — Rollback approach (faster, more direct)
# Why -r is required:
# Without it ZFS refuses. Intermediate snapshots (daily-mar-01 through mar-22)
# would be left referencing a timeline that no longer exists.
# ZFS won't allow a dataset to have two timelines.
# -r destroys all snapshots newer than your target.
zfs rollback -r mypool/data@weekly-feb-28
Step 6 — Rebuild the remote
The remote's snapshot chain no longer matches the source. You can't send incrementally. Destroy and re-send:
ssh user@remote zfs destroy -r backuppool/data
zfs send mypool/data@weekly-feb-28 | ssh user@remote zfs receive backuppool/data
Or if you want to keep the corrupted remote data for investigation, rename it first before destroying.
Step 7 — Go live and reestablish the chain
zfs snapshot mypool/data@recovery-mar-22
systemctl start yourapp
# Verify app is healthy, then reestablish incremental chain to remote
zfs send -i mypool/data@weekly-feb-28 mypool/data@recovery-mar-22 | \
ssh user@remote zfs receive backuppool/data
# Let sanoid resume its normal schedule
The Practical Lab: Try Both Stacks on Any VM
You don't need extra disks. Use loop devices — files pretending to be disks.
# Create 4 x 1GB fake disks
for i in 1 2 3 4; do dd if=/dev/zero of=/tmp/disk$i.img bs=1M count=1024; done
for i in 1 2 3 4; do losetup /dev/loop$i /tmp/disk$i.img; done
# Verify
losetup -l
Traditional stack:
apt install mdadm lvm2 -y
# RAID 10
mdadm --create /dev/md0 --level=10 --raid-devices=4 \
/dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4
# LVM on top
pvcreate /dev/md0
vgcreate vgdata /dev/md0
lvcreate -l 100%FREE -n lvdata vgdata
# Filesystem
mkfs.ext4 /dev/vgdata/lvdata
mount /dev/vgdata/lvdata /mnt/traditional
# See the layers
lsblk
ZFS stack:
apt install zfsutils-linux -y
# RAIDZ2 pool
zpool create mypool raidz2 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4
# Dataset with compression
zfs create mypool/data
zfs set compression=lz4 mypool/data
# Check status
zpool status mypool
lsblk # notice: no mdadm, no LVM between disks and pool
Play with snapshots:
echo "original content" > /mypool/data/test.txt
zfs snapshot mypool/data@v1
zfs list -t snapshot # notice ~0 size
echo "changed content" > /mypool/data/test.txt
dd if=/dev/urandom of=/mypool/data/bigfile bs=1M count=50
zfs list -t snapshot # notice snapshot grew — it's holding the old blocks
# Rollback instantly
zfs rollback mypool/data@v1
cat /mypool/data/test.txt # "original content" — bigfile is also gone
Once you see that rollback happen in under a second after writing 50MB, the mental model locks in permanently.
Honest Verdict
| mdadm + LVM + ext4 | ZFS | |
|---|---|---|
| Data integrity | None — silent corruption is real | Every block checksummed, auto-heal |
| Snapshots | LVM snapshots, painful | Instant, space-efficient, powerful |
| Backup transfer | rsync, file by file | zfs send, block level, incremental |
| Backup verification | You manage it | Built into receive protocol |
| RAID write hole bug | Present in RAID 5/6 | Eliminated by COW |
| Layers to manage | 3 separate tools | 1 integrated system |
| Kernel integration | Native mainline | DKMS (Ubuntu ships it officially) |
| RAM usage | Low | Loves RAM for ARC read cache |
| Resize flexibility | Painful | Add vdevs; within vdev still limited |
| Recovery tooling | Standard Linux tools | ZFS-specific, but well documented |
The traditional stack is not wrong. It is well understood, kernel-native, and every Linux admin already knows the tools. For simple setups where data integrity isn't critical, it does the job.
ZFS is the answer when you need to know — not guess, not hope — that your data is exactly what you wrote. When you're storing databases, backups, medical records, financial data, or anything where silent corruption is a catastrophic failure mode, ZFS removes an entire category of risk from the table permanently.
The debate your colleagues were having isn't really RAID 10 vs RAIDZ2. It's "do I want to know when my data is wrong?" ZFS answers yes by default. The traditional stack makes that your problem.
Where to Start
Install ZFS on a spare VM. Create a pool with loop devices. Write data, take snapshots, modify data, roll back. Mount a snapshot and see that all the data is right there instantly. Then run zpool scrub and watch it report zero errors on every block.
After that, the choice between the traditional stack and ZFS will be obvious for your use case.
Official docs: https://openzfs.github.io/openzfs-docs/
Sanoid/Syncoid (snapshot management): https://github.com/jimsalterjrs/sanoid
Compiled by AI. Proofread by caffeine. ☕