Database Storage in Kubernetes: What Your Colleague Got Wrong and Everything Else You Need to Know

Someone on your team will eventually say "just give all the database replicas the same NFS share — the filesystem handles concurrent writes." It sounds reasonable. It will corrupt your data. This post starts there and covers everything else: storage architectures, operators, HA vs load balancing, inside vs outside cluster storage, performance tradeoffs, PVC naming, and backup strategy.

The Wrong Assumption — Why Shared NFS Corrupts Databases

The claim: NFS handles concurrent writes because the filesystem manages consistency.

This is confusing two completely different things. A filesystem managing its own internal metadata consistency is not the same as providing safe concurrent access for a database engine.

A database is not a simple file writer. PostgreSQL, MySQL, MongoDB — they all do something like this on every write:

1. Read block 42 into memory
2. Modify it in the buffer pool
3. Write it back to disk
4. Update the transaction log
5. Fsync to confirm durability
6. Update internal page cache and lock state

Every one of those steps assumes exclusive ownership of the data files. If two database processes execute step 1 simultaneously, they both read the same block, both modify it independently in their own memory, and whoever writes last silently destroys the other's change. The filesystem wrote both operations successfully — it has no idea what a database transaction is. It just stores bytes at addresses.

This is called split-brain. Three database replicas all writing to the same files with no coordination equals guaranteed silent data corruption. NFS does not prevent this. No general-purpose filesystem prevents this on its own. The filesystem's job is to manage its own structure — not to understand your database's transaction semantics.

The Architecture Your Colleague Was Missing — Shared-Nothing

Most databases are designed as shared-nothing. Each replica owns its storage exclusively. Replicas never share files. They coordinate at the application layer — exchanging WAL records, binary logs, or oplogs through the database's own replication protocol.

WRONG — what your colleague implied:
  Replica 1 ─┐
  Replica 2  ─┼──→ same NFS share ← all writing simultaneously → corruption
  Replica 3 ─┘

RIGHT — shared-nothing:
  Replica 1 → own storage │
                           ├── database replication protocol (WAL streaming)
  Replica 2 → own storage │
  Replica 3 → own storage │

PostgreSQL streams WAL records from primary to standbys. MySQL uses binary log replication. MongoDB uses an oplog. Different names, same principle — the database engine handles replication through its own protocol, not through shared files.

The storage itself never knows any of this is happening. Three independent volumes, three independent database processes, kept consistent by the database engine. Not by the filesystem.

The Only Legitimate Shared Storage Pattern — Shared-Disk Architecture

There is a valid pattern where multiple database nodes access the same physical storage simultaneously. It is called shared-disk architecture and it requires purpose-built databases.

Oracle RAC is the canonical example. Every node has a Distributed Lock Manager (DLM) — software that coordinates block-level access across all nodes in real time. Before any node writes block 42, it asks the DLM for an exclusive lock. If another node holds it, the requester waits. When released, the DLM notifies the waiter. This coordination happens thousands of times per second.

Oracle RAC:
  Node A wants block 42 → asks DLM → "Node B has it, wait"
  Node B finishes       → DLM notifies Node A → Node A proceeds
  All nodes share the same SAN block storage

Oracle spent decades building this correctly. It is the most complex part of RAC and a significant portion of the licensing cost. No general-purpose database does this. When your colleague says "the filesystem handles it" — they are describing a system that does not exist. Oracle RAC exists specifically because the filesystem does not handle it.

Database Storage in Kubernetes — The Right Way

StatefulSet with VolumeClaimTemplates

The key is volumeClaimTemplates — not a shared volume, but a template that creates one independent PVC per replica automatically:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  replicas: 3
  volumeClaimTemplates:              # ← creates one PVC per pod, not shared
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]   # ← only one node can mount this volume
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi

ReadWriteOnce means Kubernetes enforces exclusive access at the storage layer — only one node can mount this volume. Each replica gets:

postgres-0 → PVC: data-postgres-0 → its own 100GB volume → exclusive access
postgres-1 → PVC: data-postgres-1 → its own 100GB volume → exclusive access
postgres-2 → PVC: data-postgres-2 → its own 100GB volume → exclusive access

The VolumeClaimTemplate does not handle sync or replication. It does exactly one thing: guarantees exclusive storage per replica. The database engine handles everything else.

How Replication Actually Works

When the StatefulSet starts:

postgres-0 starts first
  → initializes fresh database on its own PVC
  → becomes PRIMARY — accepts reads and writes
  → starts writing WAL to its own storage

postgres-1 starts
  → empty PVC
  → connects to postgres-0: "give me a base backup"
  → postgres-0 streams a full copy of its data files
  → postgres-1 replays WAL to catch up
  → becomes STANDBY — continuously receiving WAL stream

postgres-2 → same process → second STANDBY

After startup, continuously:

App writes to postgres-0
  → WAL record written to postgres-0's own storage
  → WAL record streamed in real time to postgres-1 and postgres-2
  → standbys replay and apply to their own storage
  → all three have independent, consistent copies

Three volumes. Three independent databases. One replication protocol. No shared files anywhere.

Operators — Why You Should Never Run Databases in Kubernetes Without One

An operator is a piece of software running inside your cluster that understands a specific database deeply. It watches for custom resource definitions you create and does whatever is needed to make a real, working, correctly configured database cluster exist.

Normal Kubernetes:
  You define Deployment/StatefulSet
  Kubernetes controller runs your containers

Operator pattern:
  You define PostgresCluster (your intent)
  Operator controller makes a real, working Postgres cluster exist
  and keeps it that way continuously

With CloudNativePG, this is all you write:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: my-postgres
spec:
  instances: 3
  storage:
    size: 100Gi
    storageClass: fast-ssd

The operator automatically:

1. Creates StatefulSet with 3 pods
2. Creates one PVC per pod (ReadWriteOnce)
3. Initializes postgres-0 as primary
4. Configures postgresql.conf and pg_hba.conf correctly
5. Sets up replication user and streaming replication slots
6. Bootstraps standbys from primary base backup
7. Creates a Service pointing to the primary (for writes)
8. Creates a Service pointing to standbys (for reads)
9. Continuously monitors health
10. Primary dies → promotes a standby → updates Service endpoints
11. Old primary recovers → rejoins as standby → resyncs automatically
12. Handles backups, connection pooling, certificate rotation

All from three lines of YAML. The operator knows the PostgreSQL protocol. It knows the right order of operations. It knows how to do safe failover without split-brain. Running a database StatefulSet manually without an operator means implementing all of that yourself — and getting the edge cases wrong.

Common operators by database:

Database	Operator
PostgreSQL	CloudNativePG, Zalando Postgres Operator
MySQL/MariaDB	Percona Operator, Oracle MySQL Operator
MongoDB	MongoDB Community Operator
Redis	Redis Operator, Spotahome Operator
Cassandra	K8ssandra Operator

HA vs Load Balancing — These Are Different Problems

Most people conflate these. They solve different things.

High Availability

The cluster survives failures. Primary dies, a standby is promoted, cluster continues. Writes always go to one node — the current primary. Standbys are warm spares, not active participants in write traffic.

Normal operation:
  All writes → postgres-0 (primary)
  postgres-1, postgres-2 → receiving WAL, ready to take over

Primary fails:
  Operator promotes postgres-1 → new primary
  postgres-0 comes back → rejoins as standby
  Zero manual intervention

This is what the StatefulSet + operator gives you by default. Three replicas for HA means you can survive one failure with one remaining standby.

Read Scale-Out

Multiple replicas serving read queries simultaneously. Writes still go to primary only. Reads are distributed across all replicas.

Writes → primary only
Reads  → distributed across primary + all standbys

At 3 replicas: read throughput ≈ 3× a single instance

Operators expose a separate Service for read traffic that load-balances across standbys. Your application uses two connection strings — one for writes, one for reads.

Write Load Balancing — The Hard Problem

Multiple nodes accepting writes simultaneously. This is what everyone wants and what is genuinely difficult. The options:

Shared-disk (Oracle RAC) — as discussed. Purpose-built DLM. Expensive. Complex. All nodes write to same storage.

Multi-master replication — multiple nodes accept writes, coordinate via consensus protocol. Each node has its own storage. Galera Cluster and MySQL Group Replication work this way. Any node accepts writes. Synchronous coordination before confirming to client.

Galera Cluster:
  Client writes to any node
  → Node broadcasts write to all peers via group communication
  → All nodes certify: no conflicting write from another node?
  → All nodes apply the write
  → Client gets confirmation

  Any node can receive writes
  Any node failure is transparent to the application

Distributed SQL databases — CockroachDB, YugabyteDB, TiDB. Designed from scratch for horizontal write scaling. Any node accepts reads and writes. Data automatically sharded across nodes. Raft consensus keeps replicas consistent. Far more complex operationally but genuinely scales writes horizontally.

Decide based on your actual need:

Need	Solution
Survive primary failure	HA with operator (StatefulSet + streaming replication)
Survive failure + scale reads	HA + read replicas + PgBouncer or ProxySQL
Multiple nodes accepting writes	Galera Cluster or MySQL Group Replication
Genuinely distributed write scaling	CockroachDB, YugabyteDB, TiDB

For most teams: HA with read replicas covers 90% of real-world requirements. True write scaling is operationally expensive and usually the last resort after exhausting caching, connection pooling, and read/write splitting.

Storage Backend — Inside vs Outside the Cluster

Every PVC needs a backend. Kubernetes has no storage of its own — a CSI driver translates PVC requests into real storage operations on whatever backend you provide.

PVC request
  → StorageClass → CSI driver → storage backend → physical disk

Outside the Cluster — Dedicated Storage Server

A separate machine exists purely to serve storage. Kubernetes nodes are compute only.

┌──────────────────────────────┐     ┌──────────────────────┐
│     Kubernetes Cluster       │     │   Storage Server     │
│                              │     │                      │
│  Node 1 (compute) ───────────┼─────┼──→ ZFS + NFS/iSCSI  │
│  Node 2 (compute) ───────────┼─────┼──→ NVMe-oF          │
│  Node 3 (compute) ───────────┼─────┼──→ Ceph (external)  │
└──────────────────────────────┘     └──────────────────────┘

Advantages: storage lifecycle independent of cluster — rebuild or upgrade the cluster, data is untouched. Easier to size separately. ZFS gives you snapshots, compression, checksumming at the storage layer. Simpler mental model — storage is one place.

Disadvantages: network is in the critical I/O path. Storage server is a potential single point of failure unless clustered. NFS specifically has performance limitations under high concurrency.

Inside the Cluster — Hyperconverged

Worker nodes contribute their local disks to a distributed storage pool. Storage software runs as pods on the same nodes as your application.

┌─────────────────────────────────────────────────────┐
│                 Kubernetes Cluster                   │
│                                                      │
│  Node 1: app pods + storage daemon (NVMe SSD)        │
│  Node 2: app pods + storage daemon (NVMe SSD)        │
│  Node 3: app pods + storage daemon (NVMe SSD)        │
│                                                      │
│  Distributed storage pool across all nodes           │
└─────────────────────────────────────────────────────┘

Advantages: data locality — Kubernetes can schedule a pod on the node holding its data, eliminating the network hop. No separate infrastructure. Scales with the cluster.

Disadvantages: storage and compute compete for the same resources. Cluster upgrades touch storage nodes. If you need to rebuild the cluster, storage is entangled. Ceph is operationally complex.

ZFS + NFS — Full Configuration

The most common small-to-medium on-prem setup. A dedicated ZFS storage server exporting NFS shares, with a Kubernetes CSI driver creating one directory per PVC automatically.

On the Storage Server

# Install ZFS
apt install zfsutils-linux nfs-kernel-server -y

# Create pool — mirror vdevs for database workloads
zpool create k8s-storage \
  mirror /dev/nvme0n1 /dev/nvme1n1 \
  mirror /dev/nvme2n1 /dev/nvme3n1

# Enable compression
zfs set compression=lz4 k8s-storage

# Create parent dataset for Kubernetes
zfs create k8s-storage/k8s

# Export via NFS
cat >> /etc/exports << 'EOF'
/k8s-storage/k8s  *(rw,sync,no_subtree_check,no_root_squash)
EOF

exportfs -rav
showmount -e localhost

sync — writes confirmed only after hitting disk. Safer for databases.
no_root_squash — CSI driver needs to create directories as root.
no_subtree_check — improves reliability when exporting subdirectories.

For Database Replicas — One Dataset Per Pod

For databases you want exclusive storage per replica, not a shared directory. Create one ZFS dataset per replica with its own dedicated export:

# One dataset per replica, tuned for the database
for i in 0 1 2; do
  zfs create k8s-storage/k8s/postgres-$i
  zfs set recordsize=8K k8s-storage/k8s/postgres-$i    # PostgreSQL page size
  zfs set compression=lz4 k8s-storage/k8s/postgres-$i

  # Export exclusively to the node running that replica
  zfs set sharenfs="[email protected]$i,sync,no_root_squash" \
    k8s-storage/k8s/postgres-$i
done

Three NFS exports, not one. Each pod gets its own dedicated export. Each export only accepts connections from the specific Kubernetes node running that replica. Storage isolation is maintained. The database still has exclusive file access.

On Each Kubernetes Node

apt install nfs-common -y

# Test before configuring CSI
mount -t nfs 192.168.1.100:/k8s-storage/k8s /mnt/test
umount /mnt/test

Install NFS CSI Driver in Kubernetes

helm repo add nfs-subdir-external-provisioner \
  https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner

helm install nfs-provisioner \
  nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  --set nfs.server=192.168.1.100 \
  --set nfs.path=/k8s-storage/k8s \
  --set storageClass.name=nfs-storage \
  --namespace storage \
  --create-namespace

# Verify
kubectl get storageclass

Other Storage Backends — What They Are and When to Use Them

iSCSI — Block Storage Over Network

Instead of a filesystem (NFS), you export raw block devices over the network. The Kubernetes node receives what looks like a local disk, formats it with ext4 or XFS, and mounts it.

ZFS storage server → iSCSI target (zvol per PVC)
  ↓
Kubernetes node iSCSI initiator
  ↓
Appears as /dev/sdb — raw block device
  ↓
Kubernetes formats with ext4, mounts into pod

ZFS zvols are ideal iSCSI targets — thin provisioned, snapshotted instantly, checksummed. The democratic-csi driver automates the entire lifecycle: PVC created → zvol created → iSCSI target configured → node connects → PV bound.

Significantly faster than NFS because it is a block protocol — no filesystem overhead at the network layer, the filesystem runs on the Kubernetes node, full queue depth available.

NVMe-oF — Near-Local Performance Over Network

NVMe protocol running over a network (RDMA or TCP) instead of a PCIe bus. The Kubernetes node sees what looks like a local NVMe drive. Network latency is in microseconds rather than milliseconds.

Storage server NVMe SSDs → NVMe-oF target
  ↓ (RDMA or TCP network)
Kubernetes node NVMe-oF initiator
  ↓
Appears as /dev/nvme1n1 — full NVMe queue depth, microsecond latency

RDMA-capable NICs give you the best performance but cost more. NVMe-oF over TCP is lower cost and still dramatically better than iSCSI. Mayastor (OpenEBS) uses NVMe-oF internally for its high-performance storage engine.

Ceph / Rook — The Complete On-Prem Solution

Ceph is a distributed storage system. Rook is the Kubernetes operator that runs Ceph inside your cluster. Together they provide block storage (RBD), shared filesystem (CephFS), and S3-compatible object storage — all from one system.

# Install Rook operator
kubectl apply -f https://raw.githubusercontent.com/rook/rook/master/deploy/examples/crds.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/master/deploy/examples/common.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/master/deploy/examples/operator.yaml

# Create Ceph cluster using node-local SSDs
cat << 'EOF' | kubectl apply -f -
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
  storage:
    useAllNodes: true
    useAllDevices: false
    devices:
      - name: nvme0n1
      - name: nvme1n1
EOF

Ceph then provides via CSI:

RBD → ReadWriteOnce block storage for databases
CephFS → ReadWriteMany for shared workloads
RGW → S3-compatible object storage for backups

Best full-featured on-prem solution. Operationally complex — requires expertise and dedicated resources. Minimum 3 nodes, prefers dedicated storage nodes at scale.

Longhorn — Simpler Distributed Block Storage

Purpose-built for Kubernetes. Each node's local disk contributes to a distributed pool. Easier to operate than Ceph, lighter resource footprint.

Node 1 (local NVMe) ─┐
Node 2 (local NVMe) ─┼──→ Longhorn pool → CSI → PVCs
Node 3 (local NVMe) ─┘

Each volume automatically replicated across multiple nodes
Node failure → volume still available from other replicas

Good for smaller teams where Ceph's operational complexity is too much. Lower peak performance than Ceph under heavy concurrent load. Install with a single Helm command.

Performance Comparison — Honest Numbers

Network speed is the ceiling for everything except local storage. Invest in network before optimizing protocol.

Solution	Latency	Throughput	IOPS	Notes
Local NVMe	20–100 µs	5–7 GB/s	500k–1M	Baseline — no network
NVMe-oF over RDMA	50–150 µs	4–6 GB/s	400k–800k	Near-local, expensive NICs
NVMe-oF over TCP	100–300 µs	3–5 GB/s	200k–500k	Good balance of cost/perf
iSCSI (10GbE)	200–500 µs	1–2 GB/s	100k–300k	Solid, widely supported
Ceph RBD (10GbE)	500 µs–2 ms	1–3 GB/s	50k–200k	Higher latency due to 3× replication
Longhorn (10GbE)	1–5 ms	500 MB–1.5 GB/s	20k–100k	Easy to operate
NFS (10GbE)	500 µs–3 ms	500 MB–1.5 GB/s	10k–50k	Worst for concurrent random I/O

Key observations:

NFS is the slowest for concurrent random I/O — filesystem protocol overhead, metadata round trips, no deep queue pipelining. Degrades fastest under concurrent load. Fine for shared config and static files. Not appropriate for databases under real load.

iSCSI is significantly faster than NFS — block protocol, no filesystem overhead at network layer, full queue depth available. Same physical network, much better database performance.

Ceph RBD latency is higher than iSCSI despite similar raw hardware — each write replicates to 3 OSDs before confirming. The replication overhead is the price for Ceph's built-in redundancy.

NVMe-oF approaches local NVMe — that's the whole point of the protocol. 25GbE network with RDMA-capable NICs and your pods see near-local storage performance.

A 10GbE → 25GbE upgrade does more for storage performance than switching from NFS to iSCSI. Always fix the network bottleneck first.

PVC Naming — It's Not As Random As It Looks

This confuses everyone at first because there are two different objects: the PVC and the PV.

PVC (PersistentVolumeClaim) — your named handle. For StatefulSets this is always predictable:

Pattern: <volumeClaimTemplate-name>-<statefulset-name>-<ordinal>

postgres StatefulSet with data template:
  data-postgres-0    ← pod postgres-0's storage
  data-postgres-1    ← pod postgres-1's storage
  data-postgres-2    ← pod postgres-2's storage

Always human-readable. Always traceable to the pod.

PV (PersistentVolume) — the actual storage allocation on the backend. This gets a UUID name because Kubernetes generates it dynamically. This is the "random string" you were seeing.

# See PVCs — these names are yours, meaningful
kubectl get pvc -A

# Output:
NAMESPACE  NAME             STATUS  VOLUME                                    CAPACITY
default    data-postgres-0  Bound   pvc-3f8a2b1c-4d5e-6f7a-8b9c-0d1e2f3a4b5c  100Gi

# NAME = your PVC — meaningful
# VOLUME = the PV — UUID, backend implementation detail

Tracing the Full Chain

# Which pod uses which PVC
kubectl get pods -n your-namespace -o json | \
  jq '.items[] | {
    pod: .metadata.name,
    pvcs: [.spec.volumes[]? | select(.persistentVolumeClaim) | .persistentVolumeClaim.claimName]
  }'

# Describe a PVC — shows everything including which pod uses it
kubectl describe pvc data-postgres-0 -n your-namespace
# Shows: Used By: postgres-0

# Describe the PV — shows where on the backend the data physically lives
kubectl describe pv pvc-3f8a2b1c-4d5e-6f7a-8b9c-0d1e2f3a4b5c

# For NFS backend shows:
#   NFS:
#     Server: 192.168.1.100
#     Path:   /k8s-storage/k8s/default-data-postgres-0-pvc-3f8a2b1c...

# For Ceph RBD shows:
#   RBD:
#     Pool:  kubernetes
#     Image: csi-vol-3f8a2b1c...

The chain is always fully traceable:

Pod name (postgres-0)
  → PVC name (data-postgres-0)          kubectl get pvc
    → PV name (pvc-3f8a2b1c...)         kubectl describe pvc
      → Backend path or device          kubectl describe pv
        → Physical storage on disk

On the NFS storage server, the CSI driver creates directories named <namespace>-<pvc-name>-<pv-uuid> — both the meaningful PVC name and the UUID together:

/k8s-storage/k8s/
  default-data-postgres-0-pvc-3f8a2b1c.../   ← postgres-0's data
  default-data-postgres-1-pvc-7a8b9c0d.../   ← postgres-1's data
  default-data-postgres-2-pvc-9e0f1a2b.../   ← postgres-2's data

You can look at the directory name and immediately know which pod and namespace it belongs to.

Pod Deleted — Data Is Safe

PVCs survive pod deletion by design. This is not an accident:

# Delete the postgres-0 pod deliberately or accidentally
kubectl delete pod postgres-0 -n your-namespace

# PVC still exists
kubectl get pvc -n your-namespace
# data-postgres-0   Bound   pvc-3f8a2b1c...   100Gi   ← still here

# StatefulSet controller recreates the pod
# New postgres-0 pod looks for PVC named "data-postgres-0" — finds it
# Mounts existing PVC — all data intact, nothing lost

StatefulSets are specifically designed for this. The predictable naming means the pod always finds its PVC on restart. You don't need to know the UUID — the pod finds its storage by name.

Check your reclaim policy — it determines what happens to the PV if the PVC is explicitly deleted:

# Check storage class reclaim policy
kubectl get storageclass
# RECLAIMPOLICY column — should be Retain for important data

# Retain means: even if PVC is deleted, PV and data survive
# Delete means: PVC deleted → PV deleted → data gone

# Check a specific PV
kubectl get pv pvc-3f8a2b1c... -o jsonpath='{.spec.persistentVolumeReclaimPolicy}'

Set Retain for any production database storage class.

Backup Strategy — Three Layers, Each Solving a Different Problem

Layer 1 — Application Dump (Most Important, Most Portable)

A logical export of the actual data. Independent of storage, Kubernetes, or infrastructure. Restore anywhere.

# PostgreSQL
kubectl exec -n your-namespace postgres-0 -- \
  pg_dump -U postgres -d mydb > mydb-$(date +%Y%m%d).sql

# PostgreSQL — all databases
kubectl exec -n your-namespace postgres-0 -- \
  pg_dumpall -U postgres > all-dbs-$(date +%Y%m%d).sql

# MySQL
kubectl exec -n your-namespace mysql-0 -- \
  mysqldump -u root -p$MYSQL_ROOT_PASSWORD \
  --all-databases > all-dbs-$(date +%Y%m%d).sql

# MongoDB
kubectl exec -n your-namespace mongo-0 -- \
  mongodump --archive > mongodump-$(date +%Y%m%d).archive

Restore to a fresh pod after disaster:

# New pod, fresh empty PVC — restore from dump
kubectl exec -i -n your-namespace postgres-0 -- \
  psql -U postgres < mydb-20240301.sql

Layer 2 — Volume Snapshot (Fast, Same Storage System)

If your CSI driver supports it (Ceph, most cloud providers, ZFS-backed CSI):

# Take a snapshot of a specific PVC
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: postgres-0-snap-20240301
  namespace: your-namespace
spec:
  volumeSnapshotClassName: csi-ceph-blockpool
  source:
    persistentVolumeClaimName: data-postgres-0

Restore — create a new PVC from the snapshot:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-postgres-0-restored
spec:
  accessModes: [ "ReadWriteOnce" ]
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 100Gi
  dataSource:
    name: postgres-0-snap-20240301
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

Fast (ZFS COW — instant snapshot, instant clone). But the backup lives on the same storage system. Not protection against storage server failure.

Layer 3 — Velero (Full Cluster Backup)

Velero backs up both Kubernetes objects (StatefulSets, Services, ConfigMaps, Secrets, PVC definitions) and the actual volume data. This is what you need to fully restore a workload after a cluster disaster.

# Install Velero with S3-compatible backend (MinIO on-prem or AWS S3)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --backup-storage-location-config \
    region=minio,s3ForcePathStyle=true,s3Url=http://minio.storage:9000

# Backup an entire namespace
velero backup create postgres-backup-20240301 \
  --include-namespaces your-namespace

# Check backup status
velero backup describe postgres-backup-20240301

Restore to a new or rebuilt cluster:

velero restore create --from-backup postgres-backup-20240301

# Velero recreates:
#   PVCs with correct names
#   StatefulSet and pods
#   Services, ConfigMaps, Secrets
#   Restores data into PVCs
# Everything comes back with the same names and configuration

Decision Guide — What to Choose for Your Situation

Storage backend:

Situation	Recommendation
Small team, want simplicity	ZFS + iSCSI via democratic-csi
Medium team, want everything in Kubernetes	Rook/Ceph hyperconverged
Need best performance, budget available	NVMe-oF over TCP with Mayastor
Development/staging	Longhorn
Already have NAS/ZFS server	ZFS + NFS (one export per pod for databases)

Database architecture:

Need	Solution
Survive primary failure	HA with operator + ReadWriteOnce PVCs
Survive failure + scale reads	HA + read replica Service + PgBouncer
Multiple nodes accepting writes	Galera or MySQL Group Replication
Horizontal write scaling	CockroachDB, YugabyteDB

Backup:

Backup type	Tool	Protects against
Logical dump	pg_dump, mysqldump	Data corruption, migration, portability
Volume snapshot	VolumeSnapshot API	Fast recovery, same storage system
Full cluster backup	Velero	Cluster disaster, rebuilds

The Common Thread

Your colleague's assumption collapsed two separate concerns into one — storage consistency and database consistency — and assumed the filesystem handles both. It handles neither for concurrent database writers.

The correct architecture separates them cleanly: the storage backend provides durable, exclusive block storage. The database engine provides replication and transaction consistency. The operator manages the database lifecycle. Velero manages backup and disaster recovery. Each layer does its own job. None of them substitute for the others.

Compiled by AI. Proofread by caffeine. ☕