Database Storage in Kubernetes: What Your Colleague Got Wrong and Everything Else You Need to Know
Someone on your team will eventually say "just give all the database replicas the same NFS share — the filesystem handles concurrent writes." It sounds reasonable. It will corrupt your data. This post starts there and covers everything else: storage architectures, operators, HA vs load balancing, inside vs outside cluster storage, performance tradeoffs, PVC naming, and backup strategy.
The Wrong Assumption — Why Shared NFS Corrupts Databases
The claim: NFS handles concurrent writes because the filesystem manages consistency.
This is confusing two completely different things. A filesystem managing its own internal metadata consistency is not the same as providing safe concurrent access for a database engine.
A database is not a simple file writer. PostgreSQL, MySQL, MongoDB — they all do something like this on every write:
1. Read block 42 into memory
2. Modify it in the buffer pool
3. Write it back to disk
4. Update the transaction log
5. Fsync to confirm durability
6. Update internal page cache and lock state
Every one of those steps assumes exclusive ownership of the data files. If two database processes execute step 1 simultaneously, they both read the same block, both modify it independently in their own memory, and whoever writes last silently destroys the other's change. The filesystem wrote both operations successfully — it has no idea what a database transaction is. It just stores bytes at addresses.
This is called split-brain. Three database replicas all writing to the same files with no coordination equals guaranteed silent data corruption. NFS does not prevent this. No general-purpose filesystem prevents this on its own. The filesystem's job is to manage its own structure — not to understand your database's transaction semantics.
The Architecture Your Colleague Was Missing — Shared-Nothing
Most databases are designed as shared-nothing. Each replica owns its storage exclusively. Replicas never share files. They coordinate at the application layer — exchanging WAL records, binary logs, or oplogs through the database's own replication protocol.
WRONG — what your colleague implied:
Replica 1 ─┐
Replica 2 ─┼──→ same NFS share ← all writing simultaneously → corruption
Replica 3 ─┘
RIGHT — shared-nothing:
Replica 1 → own storage │
├── database replication protocol (WAL streaming)
Replica 2 → own storage │
Replica 3 → own storage │
PostgreSQL streams WAL records from primary to standbys. MySQL uses binary log replication. MongoDB uses an oplog. Different names, same principle — the database engine handles replication through its own protocol, not through shared files.
The storage itself never knows any of this is happening. Three independent volumes, three independent database processes, kept consistent by the database engine. Not by the filesystem.
The Only Legitimate Shared Storage Pattern — Shared-Disk Architecture
There is a valid pattern where multiple database nodes access the same physical storage simultaneously. It is called shared-disk architecture and it requires purpose-built databases.
Oracle RAC is the canonical example. Every node has a Distributed Lock Manager (DLM) — software that coordinates block-level access across all nodes in real time. Before any node writes block 42, it asks the DLM for an exclusive lock. If another node holds it, the requester waits. When released, the DLM notifies the waiter. This coordination happens thousands of times per second.
Oracle RAC:
Node A wants block 42 → asks DLM → "Node B has it, wait"
Node B finishes → DLM notifies Node A → Node A proceeds
All nodes share the same SAN block storage
Oracle spent decades building this correctly. It is the most complex part of RAC and a significant portion of the licensing cost. No general-purpose database does this. When your colleague says "the filesystem handles it" — they are describing a system that does not exist. Oracle RAC exists specifically because the filesystem does not handle it.
Database Storage in Kubernetes — The Right Way
StatefulSet with VolumeClaimTemplates
The key is volumeClaimTemplates — not a shared volume, but a template that creates one independent PVC per replica automatically:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
replicas: 3
volumeClaimTemplates: # ← creates one PVC per pod, not shared
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ] # ← only one node can mount this volume
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
ReadWriteOnce means Kubernetes enforces exclusive access at the storage layer — only one node can mount this volume. Each replica gets:
postgres-0 → PVC: data-postgres-0 → its own 100GB volume → exclusive access
postgres-1 → PVC: data-postgres-1 → its own 100GB volume → exclusive access
postgres-2 → PVC: data-postgres-2 → its own 100GB volume → exclusive access
The VolumeClaimTemplate does not handle sync or replication. It does exactly one thing: guarantees exclusive storage per replica. The database engine handles everything else.
How Replication Actually Works
When the StatefulSet starts:
postgres-0 starts first
→ initializes fresh database on its own PVC
→ becomes PRIMARY — accepts reads and writes
→ starts writing WAL to its own storage
postgres-1 starts
→ empty PVC
→ connects to postgres-0: "give me a base backup"
→ postgres-0 streams a full copy of its data files
→ postgres-1 replays WAL to catch up
→ becomes STANDBY — continuously receiving WAL stream
postgres-2 → same process → second STANDBY
After startup, continuously:
App writes to postgres-0
→ WAL record written to postgres-0's own storage
→ WAL record streamed in real time to postgres-1 and postgres-2
→ standbys replay and apply to their own storage
→ all three have independent, consistent copies
Three volumes. Three independent databases. One replication protocol. No shared files anywhere.
Operators — Why You Should Never Run Databases in Kubernetes Without One
An operator is a piece of software running inside your cluster that understands a specific database deeply. It watches for custom resource definitions you create and does whatever is needed to make a real, working, correctly configured database cluster exist.
Normal Kubernetes:
You define Deployment/StatefulSet
Kubernetes controller runs your containers
Operator pattern:
You define PostgresCluster (your intent)
Operator controller makes a real, working Postgres cluster exist
and keeps it that way continuously
With CloudNativePG, this is all you write:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: my-postgres
spec:
instances: 3
storage:
size: 100Gi
storageClass: fast-ssd
The operator automatically:
1. Creates StatefulSet with 3 pods
2. Creates one PVC per pod (ReadWriteOnce)
3. Initializes postgres-0 as primary
4. Configures postgresql.conf and pg_hba.conf correctly
5. Sets up replication user and streaming replication slots
6. Bootstraps standbys from primary base backup
7. Creates a Service pointing to the primary (for writes)
8. Creates a Service pointing to standbys (for reads)
9. Continuously monitors health
10. Primary dies → promotes a standby → updates Service endpoints
11. Old primary recovers → rejoins as standby → resyncs automatically
12. Handles backups, connection pooling, certificate rotation
All from three lines of YAML. The operator knows the PostgreSQL protocol. It knows the right order of operations. It knows how to do safe failover without split-brain. Running a database StatefulSet manually without an operator means implementing all of that yourself — and getting the edge cases wrong.
Common operators by database:
| Database | Operator |
|---|---|
| PostgreSQL | CloudNativePG, Zalando Postgres Operator |
| MySQL/MariaDB | Percona Operator, Oracle MySQL Operator |
| MongoDB | MongoDB Community Operator |
| Redis | Redis Operator, Spotahome Operator |
| Cassandra | K8ssandra Operator |
HA vs Load Balancing — These Are Different Problems
Most people conflate these. They solve different things.
High Availability
The cluster survives failures. Primary dies, a standby is promoted, cluster continues. Writes always go to one node — the current primary. Standbys are warm spares, not active participants in write traffic.
Normal operation:
All writes → postgres-0 (primary)
postgres-1, postgres-2 → receiving WAL, ready to take over
Primary fails:
Operator promotes postgres-1 → new primary
postgres-0 comes back → rejoins as standby
Zero manual intervention
This is what the StatefulSet + operator gives you by default. Three replicas for HA means you can survive one failure with one remaining standby.
Read Scale-Out
Multiple replicas serving read queries simultaneously. Writes still go to primary only. Reads are distributed across all replicas.
Writes → primary only
Reads → distributed across primary + all standbys
At 3 replicas: read throughput ≈ 3× a single instance
Operators expose a separate Service for read traffic that load-balances across standbys. Your application uses two connection strings — one for writes, one for reads.
Write Load Balancing — The Hard Problem
Multiple nodes accepting writes simultaneously. This is what everyone wants and what is genuinely difficult. The options:
Shared-disk (Oracle RAC) — as discussed. Purpose-built DLM. Expensive. Complex. All nodes write to same storage.
Multi-master replication — multiple nodes accept writes, coordinate via consensus protocol. Each node has its own storage. Galera Cluster and MySQL Group Replication work this way. Any node accepts writes. Synchronous coordination before confirming to client.
Galera Cluster:
Client writes to any node
→ Node broadcasts write to all peers via group communication
→ All nodes certify: no conflicting write from another node?
→ All nodes apply the write
→ Client gets confirmation
Any node can receive writes
Any node failure is transparent to the application
Distributed SQL databases — CockroachDB, YugabyteDB, TiDB. Designed from scratch for horizontal write scaling. Any node accepts reads and writes. Data automatically sharded across nodes. Raft consensus keeps replicas consistent. Far more complex operationally but genuinely scales writes horizontally.
Decide based on your actual need:
| Need | Solution |
|---|---|
| Survive primary failure | HA with operator (StatefulSet + streaming replication) |
| Survive failure + scale reads | HA + read replicas + PgBouncer or ProxySQL |
| Multiple nodes accepting writes | Galera Cluster or MySQL Group Replication |
| Genuinely distributed write scaling | CockroachDB, YugabyteDB, TiDB |
For most teams: HA with read replicas covers 90% of real-world requirements. True write scaling is operationally expensive and usually the last resort after exhausting caching, connection pooling, and read/write splitting.
Storage Backend — Inside vs Outside the Cluster
Every PVC needs a backend. Kubernetes has no storage of its own — a CSI driver translates PVC requests into real storage operations on whatever backend you provide.
PVC request
→ StorageClass → CSI driver → storage backend → physical disk
Outside the Cluster — Dedicated Storage Server
A separate machine exists purely to serve storage. Kubernetes nodes are compute only.
┌──────────────────────────────┐ ┌──────────────────────┐
│ Kubernetes Cluster │ │ Storage Server │
│ │ │ │
│ Node 1 (compute) ───────────┼─────┼──→ ZFS + NFS/iSCSI │
│ Node 2 (compute) ───────────┼─────┼──→ NVMe-oF │
│ Node 3 (compute) ───────────┼─────┼──→ Ceph (external) │
└──────────────────────────────┘ └──────────────────────┘
Advantages: storage lifecycle independent of cluster — rebuild or upgrade the cluster, data is untouched. Easier to size separately. ZFS gives you snapshots, compression, checksumming at the storage layer. Simpler mental model — storage is one place.
Disadvantages: network is in the critical I/O path. Storage server is a potential single point of failure unless clustered. NFS specifically has performance limitations under high concurrency.
Inside the Cluster — Hyperconverged
Worker nodes contribute their local disks to a distributed storage pool. Storage software runs as pods on the same nodes as your application.
┌─────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ Node 1: app pods + storage daemon (NVMe SSD) │
│ Node 2: app pods + storage daemon (NVMe SSD) │
│ Node 3: app pods + storage daemon (NVMe SSD) │
│ │
│ Distributed storage pool across all nodes │
└─────────────────────────────────────────────────────┘
Advantages: data locality — Kubernetes can schedule a pod on the node holding its data, eliminating the network hop. No separate infrastructure. Scales with the cluster.
Disadvantages: storage and compute compete for the same resources. Cluster upgrades touch storage nodes. If you need to rebuild the cluster, storage is entangled. Ceph is operationally complex.
ZFS + NFS — Full Configuration
The most common small-to-medium on-prem setup. A dedicated ZFS storage server exporting NFS shares, with a Kubernetes CSI driver creating one directory per PVC automatically.
On the Storage Server
# Install ZFS
apt install zfsutils-linux nfs-kernel-server -y
# Create pool — mirror vdevs for database workloads
zpool create k8s-storage \
mirror /dev/nvme0n1 /dev/nvme1n1 \
mirror /dev/nvme2n1 /dev/nvme3n1
# Enable compression
zfs set compression=lz4 k8s-storage
# Create parent dataset for Kubernetes
zfs create k8s-storage/k8s
# Export via NFS
cat >> /etc/exports << 'EOF'
/k8s-storage/k8s *(rw,sync,no_subtree_check,no_root_squash)
EOF
exportfs -rav
showmount -e localhost
sync — writes confirmed only after hitting disk. Safer for databases.
no_root_squash — CSI driver needs to create directories as root.
no_subtree_check — improves reliability when exporting subdirectories.
For Database Replicas — One Dataset Per Pod
For databases you want exclusive storage per replica, not a shared directory. Create one ZFS dataset per replica with its own dedicated export:
# One dataset per replica, tuned for the database
for i in 0 1 2; do
zfs create k8s-storage/k8s/postgres-$i
zfs set recordsize=8K k8s-storage/k8s/postgres-$i # PostgreSQL page size
zfs set compression=lz4 k8s-storage/k8s/postgres-$i
# Export exclusively to the node running that replica
zfs set sharenfs="[email protected]$i,sync,no_root_squash" \
k8s-storage/k8s/postgres-$i
done
Three NFS exports, not one. Each pod gets its own dedicated export. Each export only accepts connections from the specific Kubernetes node running that replica. Storage isolation is maintained. The database still has exclusive file access.
On Each Kubernetes Node
apt install nfs-common -y
# Test before configuring CSI
mount -t nfs 192.168.1.100:/k8s-storage/k8s /mnt/test
umount /mnt/test
Install NFS CSI Driver in Kubernetes
helm repo add nfs-subdir-external-provisioner \
https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
helm install nfs-provisioner \
nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
--set nfs.server=192.168.1.100 \
--set nfs.path=/k8s-storage/k8s \
--set storageClass.name=nfs-storage \
--namespace storage \
--create-namespace
# Verify
kubectl get storageclass
Other Storage Backends — What They Are and When to Use Them
iSCSI — Block Storage Over Network
Instead of a filesystem (NFS), you export raw block devices over the network. The Kubernetes node receives what looks like a local disk, formats it with ext4 or XFS, and mounts it.
ZFS storage server → iSCSI target (zvol per PVC)
↓
Kubernetes node iSCSI initiator
↓
Appears as /dev/sdb — raw block device
↓
Kubernetes formats with ext4, mounts into pod
ZFS zvols are ideal iSCSI targets — thin provisioned, snapshotted instantly, checksummed. The democratic-csi driver automates the entire lifecycle: PVC created → zvol created → iSCSI target configured → node connects → PV bound.
Significantly faster than NFS because it is a block protocol — no filesystem overhead at the network layer, the filesystem runs on the Kubernetes node, full queue depth available.
NVMe-oF — Near-Local Performance Over Network
NVMe protocol running over a network (RDMA or TCP) instead of a PCIe bus. The Kubernetes node sees what looks like a local NVMe drive. Network latency is in microseconds rather than milliseconds.
Storage server NVMe SSDs → NVMe-oF target
↓ (RDMA or TCP network)
Kubernetes node NVMe-oF initiator
↓
Appears as /dev/nvme1n1 — full NVMe queue depth, microsecond latency
RDMA-capable NICs give you the best performance but cost more. NVMe-oF over TCP is lower cost and still dramatically better than iSCSI. Mayastor (OpenEBS) uses NVMe-oF internally for its high-performance storage engine.
Ceph / Rook — The Complete On-Prem Solution
Ceph is a distributed storage system. Rook is the Kubernetes operator that runs Ceph inside your cluster. Together they provide block storage (RBD), shared filesystem (CephFS), and S3-compatible object storage — all from one system.
# Install Rook operator
kubectl apply -f https://raw.githubusercontent.com/rook/rook/master/deploy/examples/crds.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/master/deploy/examples/common.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/master/deploy/examples/operator.yaml
# Create Ceph cluster using node-local SSDs
cat << 'EOF' | kubectl apply -f -
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
dataDirHostPath: /var/lib/rook
mon:
count: 3
storage:
useAllNodes: true
useAllDevices: false
devices:
- name: nvme0n1
- name: nvme1n1
EOF
Ceph then provides via CSI:
- RBD → ReadWriteOnce block storage for databases
- CephFS → ReadWriteMany for shared workloads
- RGW → S3-compatible object storage for backups
Best full-featured on-prem solution. Operationally complex — requires expertise and dedicated resources. Minimum 3 nodes, prefers dedicated storage nodes at scale.
Longhorn — Simpler Distributed Block Storage
Purpose-built for Kubernetes. Each node's local disk contributes to a distributed pool. Easier to operate than Ceph, lighter resource footprint.
Node 1 (local NVMe) ─┐
Node 2 (local NVMe) ─┼──→ Longhorn pool → CSI → PVCs
Node 3 (local NVMe) ─┘
Each volume automatically replicated across multiple nodes
Node failure → volume still available from other replicas
Good for smaller teams where Ceph's operational complexity is too much. Lower peak performance than Ceph under heavy concurrent load. Install with a single Helm command.
Performance Comparison — Honest Numbers
Network speed is the ceiling for everything except local storage. Invest in network before optimizing protocol.
| Solution | Latency | Throughput | IOPS | Notes |
|---|---|---|---|---|
| Local NVMe | 20–100 µs | 5–7 GB/s | 500k–1M | Baseline — no network |
| NVMe-oF over RDMA | 50–150 µs | 4–6 GB/s | 400k–800k | Near-local, expensive NICs |
| NVMe-oF over TCP | 100–300 µs | 3–5 GB/s | 200k–500k | Good balance of cost/perf |
| iSCSI (10GbE) | 200–500 µs | 1–2 GB/s | 100k–300k | Solid, widely supported |
| Ceph RBD (10GbE) | 500 µs–2 ms | 1–3 GB/s | 50k–200k | Higher latency due to 3× replication |
| Longhorn (10GbE) | 1–5 ms | 500 MB–1.5 GB/s | 20k–100k | Easy to operate |
| NFS (10GbE) | 500 µs–3 ms | 500 MB–1.5 GB/s | 10k–50k | Worst for concurrent random I/O |
Key observations:
NFS is the slowest for concurrent random I/O — filesystem protocol overhead, metadata round trips, no deep queue pipelining. Degrades fastest under concurrent load. Fine for shared config and static files. Not appropriate for databases under real load.
iSCSI is significantly faster than NFS — block protocol, no filesystem overhead at network layer, full queue depth available. Same physical network, much better database performance.
Ceph RBD latency is higher than iSCSI despite similar raw hardware — each write replicates to 3 OSDs before confirming. The replication overhead is the price for Ceph's built-in redundancy.
NVMe-oF approaches local NVMe — that's the whole point of the protocol. 25GbE network with RDMA-capable NICs and your pods see near-local storage performance.
A 10GbE → 25GbE upgrade does more for storage performance than switching from NFS to iSCSI. Always fix the network bottleneck first.
PVC Naming — It's Not As Random As It Looks
This confuses everyone at first because there are two different objects: the PVC and the PV.
PVC (PersistentVolumeClaim) — your named handle. For StatefulSets this is always predictable:
Pattern: <volumeClaimTemplate-name>-<statefulset-name>-<ordinal>
postgres StatefulSet with data template:
data-postgres-0 ← pod postgres-0's storage
data-postgres-1 ← pod postgres-1's storage
data-postgres-2 ← pod postgres-2's storage
Always human-readable. Always traceable to the pod.
PV (PersistentVolume) — the actual storage allocation on the backend. This gets a UUID name because Kubernetes generates it dynamically. This is the "random string" you were seeing.
# See PVCs — these names are yours, meaningful
kubectl get pvc -A
# Output:
NAMESPACE NAME STATUS VOLUME CAPACITY
default data-postgres-0 Bound pvc-3f8a2b1c-4d5e-6f7a-8b9c-0d1e2f3a4b5c 100Gi
# NAME = your PVC — meaningful
# VOLUME = the PV — UUID, backend implementation detail
Tracing the Full Chain
# Which pod uses which PVC
kubectl get pods -n your-namespace -o json | \
jq '.items[] | {
pod: .metadata.name,
pvcs: [.spec.volumes[]? | select(.persistentVolumeClaim) | .persistentVolumeClaim.claimName]
}'
# Describe a PVC — shows everything including which pod uses it
kubectl describe pvc data-postgres-0 -n your-namespace
# Shows: Used By: postgres-0
# Describe the PV — shows where on the backend the data physically lives
kubectl describe pv pvc-3f8a2b1c-4d5e-6f7a-8b9c-0d1e2f3a4b5c
# For NFS backend shows:
# NFS:
# Server: 192.168.1.100
# Path: /k8s-storage/k8s/default-data-postgres-0-pvc-3f8a2b1c...
# For Ceph RBD shows:
# RBD:
# Pool: kubernetes
# Image: csi-vol-3f8a2b1c...
The chain is always fully traceable:
Pod name (postgres-0)
→ PVC name (data-postgres-0) kubectl get pvc
→ PV name (pvc-3f8a2b1c...) kubectl describe pvc
→ Backend path or device kubectl describe pv
→ Physical storage on disk
On the NFS storage server, the CSI driver creates directories named <namespace>-<pvc-name>-<pv-uuid> — both the meaningful PVC name and the UUID together:
/k8s-storage/k8s/
default-data-postgres-0-pvc-3f8a2b1c.../ ← postgres-0's data
default-data-postgres-1-pvc-7a8b9c0d.../ ← postgres-1's data
default-data-postgres-2-pvc-9e0f1a2b.../ ← postgres-2's data
You can look at the directory name and immediately know which pod and namespace it belongs to.
Pod Deleted — Data Is Safe
PVCs survive pod deletion by design. This is not an accident:
# Delete the postgres-0 pod deliberately or accidentally
kubectl delete pod postgres-0 -n your-namespace
# PVC still exists
kubectl get pvc -n your-namespace
# data-postgres-0 Bound pvc-3f8a2b1c... 100Gi ← still here
# StatefulSet controller recreates the pod
# New postgres-0 pod looks for PVC named "data-postgres-0" — finds it
# Mounts existing PVC — all data intact, nothing lost
StatefulSets are specifically designed for this. The predictable naming means the pod always finds its PVC on restart. You don't need to know the UUID — the pod finds its storage by name.
Check your reclaim policy — it determines what happens to the PV if the PVC is explicitly deleted:
# Check storage class reclaim policy
kubectl get storageclass
# RECLAIMPOLICY column — should be Retain for important data
# Retain means: even if PVC is deleted, PV and data survive
# Delete means: PVC deleted → PV deleted → data gone
# Check a specific PV
kubectl get pv pvc-3f8a2b1c... -o jsonpath='{.spec.persistentVolumeReclaimPolicy}'
Set Retain for any production database storage class.
Backup Strategy — Three Layers, Each Solving a Different Problem
Layer 1 — Application Dump (Most Important, Most Portable)
A logical export of the actual data. Independent of storage, Kubernetes, or infrastructure. Restore anywhere.
# PostgreSQL
kubectl exec -n your-namespace postgres-0 -- \
pg_dump -U postgres -d mydb > mydb-$(date +%Y%m%d).sql
# PostgreSQL — all databases
kubectl exec -n your-namespace postgres-0 -- \
pg_dumpall -U postgres > all-dbs-$(date +%Y%m%d).sql
# MySQL
kubectl exec -n your-namespace mysql-0 -- \
mysqldump -u root -p$MYSQL_ROOT_PASSWORD \
--all-databases > all-dbs-$(date +%Y%m%d).sql
# MongoDB
kubectl exec -n your-namespace mongo-0 -- \
mongodump --archive > mongodump-$(date +%Y%m%d).archive
Restore to a fresh pod after disaster:
# New pod, fresh empty PVC — restore from dump
kubectl exec -i -n your-namespace postgres-0 -- \
psql -U postgres < mydb-20240301.sql
Layer 2 — Volume Snapshot (Fast, Same Storage System)
If your CSI driver supports it (Ceph, most cloud providers, ZFS-backed CSI):
# Take a snapshot of a specific PVC
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-0-snap-20240301
namespace: your-namespace
spec:
volumeSnapshotClassName: csi-ceph-blockpool
source:
persistentVolumeClaimName: data-postgres-0
Restore — create a new PVC from the snapshot:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-postgres-0-restored
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
dataSource:
name: postgres-0-snap-20240301
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
Fast (ZFS COW — instant snapshot, instant clone). But the backup lives on the same storage system. Not protection against storage server failure.
Layer 3 — Velero (Full Cluster Backup)
Velero backs up both Kubernetes objects (StatefulSets, Services, ConfigMaps, Secrets, PVC definitions) and the actual volume data. This is what you need to fully restore a workload after a cluster disaster.
# Install Velero with S3-compatible backend (MinIO on-prem or AWS S3)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket velero-backups \
--secret-file ./credentials-velero \
--backup-storage-location-config \
region=minio,s3ForcePathStyle=true,s3Url=http://minio.storage:9000
# Backup an entire namespace
velero backup create postgres-backup-20240301 \
--include-namespaces your-namespace
# Check backup status
velero backup describe postgres-backup-20240301
Restore to a new or rebuilt cluster:
velero restore create --from-backup postgres-backup-20240301
# Velero recreates:
# PVCs with correct names
# StatefulSet and pods
# Services, ConfigMaps, Secrets
# Restores data into PVCs
# Everything comes back with the same names and configuration
Decision Guide — What to Choose for Your Situation
Storage backend:
| Situation | Recommendation |
|---|---|
| Small team, want simplicity | ZFS + iSCSI via democratic-csi |
| Medium team, want everything in Kubernetes | Rook/Ceph hyperconverged |
| Need best performance, budget available | NVMe-oF over TCP with Mayastor |
| Development/staging | Longhorn |
| Already have NAS/ZFS server | ZFS + NFS (one export per pod for databases) |
Database architecture:
| Need | Solution |
|---|---|
| Survive primary failure | HA with operator + ReadWriteOnce PVCs |
| Survive failure + scale reads | HA + read replica Service + PgBouncer |
| Multiple nodes accepting writes | Galera or MySQL Group Replication |
| Horizontal write scaling | CockroachDB, YugabyteDB |
Backup:
| Backup type | Tool | Protects against |
|---|---|---|
| Logical dump | pg_dump, mysqldump | Data corruption, migration, portability |
| Volume snapshot | VolumeSnapshot API | Fast recovery, same storage system |
| Full cluster backup | Velero | Cluster disaster, rebuilds |
The Common Thread
Your colleague's assumption collapsed two separate concerns into one — storage consistency and database consistency — and assumed the filesystem handles both. It handles neither for concurrent database writers.
The correct architecture separates them cleanly: the storage backend provides durable, exclusive block storage. The database engine provides replication and transaction consistency. The operator manages the database lifecycle. Velero manages backup and disaster recovery. Each layer does its own job. None of them substitute for the others.
Compiled by AI. Proofread by caffeine. ☕