You have services scattered across machines. They need to find each other. You need to deploy new versions without downtime. Something crashes at 3 AM and the system should heal itself.

Kubernetes is the default answer. It's also 6+ moving parts, a certification ecosystem, and enough YAML to wallpaper a house.

There's another way. Two binaries. Two tools. Nomad schedules your workloads. Consul handles discovery and connectivity. Together, they do what Kubernetes does — with dramatically less complexity.

This is the complete guide to both.


Nomad: The Workload Orchestrator

Nomad is a single binary that schedules and runs your applications. No etcd. No API server. No controller manager. No kubelet. One binary.

It runs Docker containers, yes. But also raw binaries, Java JARs, batch jobs, QEMU VMs, and Podman containers. That last part matters — if you have legacy applications that can't be containerized, Nomad doesn't care. It runs them anyway.

Roblox runs Nomad at scale. So do eBay, PagerDuty, and Trivago. This isn't a toy.

Architecture: Servers and Clients

Nomad has two roles:

Nomad Server — the brain. Makes scheduling decisions, stores cluster state using Raft consensus. You run 3 or 5 of these for high availability (always odd numbers — Raft needs a majority to elect a leader).

Nomad Client — the worker. Runs on every machine that executes workloads. Reports available resources (CPU, RAM, disk, GPU) back to the servers. Servers look at what's available and decide where to place your jobs.

                    ┌─────────────┐
                    │ Nomad Server│ (Leader)
                    │  (3 or 5)   │
                    └──────┬──────┘
                           │ schedule jobs
              ┌────────────┼────────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │  Client  │ │  Client  │ │  Client  │
        │ (worker) │ │ (worker) │ │ (worker) │
        │ Docker,  │ │ Docker,  │ │ Docker,  │
        │ exec,etc │ │ exec,etc │ │ exec,etc │
        └──────────┘ └──────────┘ └──────────┘

Servers elect a leader. Clients register with servers. When you submit a job, the leader evaluates which clients have enough resources, picks the best candidates, and creates allocations.

Core Concepts

Job — a specification of what to run. Written in HCL (HashiCorp Configuration Language). Think of it as a Kubernetes Deployment, but more readable.

Task Group — a set of tasks that must run together on the same machine. This is Nomad's equivalent of a Kubernetes Pod. If your API needs a sidecar logging agent, they go in the same task group.

Task — a single unit of work. One Docker container. One binary. One script.

Allocation — an instance of a task group placed on a specific client. If your job says count = 3, you get 3 allocations, potentially on 3 different clients.

Driver — how Nomad actually runs the task. The docker driver pulls and runs containers. The exec driver runs binaries in isolation. The raw_exec driver runs binaries without isolation (use carefully). There's also java, qemu, and podman.

Namespace — logical isolation for multi-team environments. Team A's jobs don't collide with Team B's.

Job Types

service — long-running processes. Web servers, APIs, databases. If they crash, Nomad restarts them. This is what you'll use most.

batch — run-to-completion jobs. Data processing, database migrations, report generation. Runs once, exits, done.

system — runs one instance on every client node. Log collectors, monitoring agents, anything that needs to be everywhere.

A Real Job File

This deploys 3 instances of a web API behind health checks:

job "web" {
  datacenters = ["dc1"]
  type = "service"

  group "app" {
    count = 3

    network {
      port "http" { to = 8080 }
    }

    task "api" {
      driver = "docker"

      config {
        image = "myapp:latest"
        ports = ["http"]
      }

      resources {
        cpu    = 500   # MHz
        memory = 256   # MB
      }

      service {
        name = "web-api"
        port = "http"

        check {
          type     = "http"
          path     = "/health"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Read it top to bottom: one job called "web", targeting datacenter "dc1", running as a long-lived service. Inside, a group called "app" with 3 instances. Each instance runs one task — a Docker container from myapp:latest, exposed on port 8080, with 500 MHz CPU and 256 MB RAM allocated.

The service block registers this with Consul (more on that shortly) and defines a health check hitting /health every 10 seconds.

Deploy it with one command: nomad job run web.nomad


Consul: Service Discovery + Service Mesh

Nomad tells your services where to run. Consul tells your services how to find each other.

Consul is the phone book for your infrastructure. Every service registers itself — name, IP, port. Other services look up the name and get back healthy endpoints. No hardcoded IPs. No config files listing every service address. Services come and go, and Consul tracks all of it.

But it's more than a phone book. Consul also provides:

  • Health checking — monitors every registered service, automatically removes unhealthy ones from discovery
  • DNS interface — query web-api.service.consul and get back healthy IPs, no code changes needed
  • KV store — shared key-value configuration data across your cluster
  • Service mesh — mTLS encryption between services, plus access control rules (which service can talk to which)

Architecture

Like Nomad, Consul has servers and agents:

Consul Server — stores the service catalog and KV data, runs Raft consensus. Run 3 or 5 for HA.

Consul Agent (Client mode) — runs on every node. Registers local services, executes health checks, forwards queries to servers. Lightweight.

        ┌──────────────┐
        │Consul Server │ (3 or 5, Raft consensus)
        │  (catalog,   │
        │   KV store)  │
        └──────┬───────┘
               │
    ┌──────────┼──────────┐
    ▼          ▼          ▼
┌────────┐ ┌────────┐ ┌────────┐
│ Agent  │ │ Agent  │ │ Agent  │
│(node 1)│ │(node 2)│ │(node 3)│
│        │ │        │ │        │
│ web-api│ │  db    │ │ cache  │
│ ↕ proxy│ │ ↕ proxy│ │ ↕ proxy│
└────────┘ └────────┘ └────────┘

Service Discovery in Practice

A service registers itself with Consul:

{
  "service": {
    "name": "web-api",
    "port": 8080,
    "check": {
      "http": "http://localhost:8080/health",
      "interval": "10s"
    }
  }
}

Now any other service can find it:

Via DNS:

dig web-api.service.consul

Returns the healthy IPs. Point your app at web-api.service.consul and Consul handles the rest.

Via HTTP API:

curl http://localhost:8500/v1/health/service/web-api?passing

Returns JSON with all healthy instances — IPs, ports, metadata.

No service mesh. No sidecar. Just DNS or HTTP. Your services discover each other the same way the internet works.

Service Mesh (Consul Connect)

DNS-based discovery is great for finding services. But it doesn't encrypt the traffic between them, and it doesn't control who can talk to whom.

Consul Connect solves both.

It deploys sidecar proxies (Envoy, by default) alongside each service. All service-to-service traffic flows through these proxies, which handle:

  • Automatic mTLS — every connection is encrypted. Your app code doesn't change. The proxy handles certificates, rotation, all of it.
  • Intentions — access control rules at the service level. "web-api CAN talk to database." "frontend CANNOT talk to database directly." Zero-trust networking without firewall rules.

This is service mesh without the Kubernetes tax. Two words: consul connect.


The Integration: Nomad + Consul Together

Here's where it gets good.

When Nomad and Consul run on the same nodes, they integrate automatically. No glue code. No adapters. They were built for each other.

What happens:

  1. Automatic clustering — Nomad servers discover each other through Consul. No manual IP configuration. Start both, they find each other.

  2. Service registration — when Nomad deploys a service with a service block, it automatically registers in Consul's catalog. You deploy with Nomad, you discover with Consul.

  3. Health checks flow through — the health checks you define in your Nomad job feed directly into Consul. If a service fails its check, Consul removes it from DNS responses and the service catalog.

  4. Service mesh injection — add a connect block to your Nomad job, and Nomad automatically creates and manages the Envoy sidecar proxy. No manual proxy configuration.

  5. Dynamic configuration — Nomad's template block can pull values from Consul's KV store and render them into config files or environment variables. Change a value in Consul, your services pick it up.

  6. DNS discovery — every service Nomad deploys is queryable via servicename.service.consul. Your frontend finds your API. Your API finds your database. Automatically.

┌─────────────────────────────────────────┐
│            Control Plane                │
│  ┌──────────────┐  ┌──────────────┐    │
│  │ Nomad Server │  │Consul Server │    │
│  │ (scheduler)  │←→│ (catalog)    │    │
│  └──────────────┘  └──────────────┘    │
└─────────────────────────────────────────┘
              │                │
    ┌─────────┼────────────────┼─────────┐
    │         ▼                ▼         │
    │  ┌────────────────────────────┐    │
    │  │         Node 1             │    │
    │  │ Nomad Client + Consul Agent│    │
    │  │                            │    │
    │  │  ┌──────┐  ┌──────────┐   │    │
    │  │  │web-api│→│Envoy Proxy│  │    │
    │  │  └──────┘  └──────────┘   │    │
    │  └────────────────────────────┘    │
    │                                    │
    │  ┌────────────────────────────┐    │
    │  │         Node 2             │    │
    │  │ Nomad Client + Consul Agent│    │
    │  │                            │    │
    │  │  ┌──────┐  ┌──────────┐   │    │
    │  │  │  db  │→│Envoy Proxy│   │    │
    │  │  └──────┘  └──────────┘   │    │
    │  └────────────────────────────┘    │
    └────────────────────────────────────┘

The Flow in Practice

  1. You run: nomad job run web.nomad
  2. Nomad finds a client with enough CPU and RAM, creates an allocation
  3. The Docker container starts on that client
  4. The service block automatically registers web-api with the local Consul agent
  5. Consul starts running the health check every 10 seconds
  6. Other services query web-api.service.consul and get the healthy IP
  7. If the container crashes, Consul immediately removes it from discovery, and Nomad restarts it on the same or different client
  8. The new instance re-registers, health check passes, traffic flows again

No manual intervention. No scripts. No webhook chains. It just works because both tools speak the same language.


Hands-On: Running Nomad + Consul Locally

The fastest way to see this in action. Two terminals, two commands.

Option 1: Native Binaries

Download both from developer.hashicorp.com. Each is a single binary. Drop them in your PATH.

Terminal 1 — start Consul in dev mode:

consul agent -dev -client=0.0.0.0

Terminal 2 — start Nomad in dev mode (auto-discovers Consul):

nomad agent -dev -consul-address=127.0.0.1:8500

That's it. Nomad registered itself with Consul. Open http://localhost:4646 for the Nomad UI, http://localhost:8500 for Consul.

Option 2: Docker

docker run -d --name consul -p 8500:8500 hashicorp/consul agent -dev -client=0.0.0.0

docker run -d --name nomad --net=host \
  -v /var/run/docker.sock:/var/run/docker.sock \
  hashicorp/nomad agent -dev

Now deploy something:

nomad job run web.nomad

Check Consul's catalog — your service appears automatically:

consul catalog services

Query it via DNS:

dig @127.0.0.1 -p 8600 web-api.service.consul

This is the feedback loop. Deploy with Nomad, discover with Consul. In under 5 minutes you've got a working orchestration + discovery stack.


Real-World Scenarios

Theory is nice. Here's how this works in practice.

Scenario 1: Microservices Deployment

Three services: a frontend, an API, and a database. Each discovers the others via Consul DNS.

job "platform" {
  datacenters = ["dc1"]
  type = "service"

  group "frontend" {
    count = 2

    network {
      port "http" { to = 3000 }
    }

    task "web" {
      driver = "docker"
      config {
        image = "frontend:latest"
        ports = ["http"]
      }
      env {
        API_URL = "http://api.service.consul:8080"
      }
      resources {
        cpu    = 300
        memory = 256
      }
      service {
        name = "frontend"
        port = "http"
        check {
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }

  group "api" {
    count = 3

    network {
      port "http" { to = 8080 }
    }

    task "server" {
      driver = "docker"
      config {
        image = "api:latest"
        ports = ["http"]
      }
      env {
        DB_HOST = "database.service.consul"
      }
      resources {
        cpu    = 500
        memory = 512
      }
      service {
        name = "api"
        port = "http"
        check {
          type     = "http"
          path     = "/health"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }

  group "db" {
    count = 1

    network {
      port "db" { to = 5432 }
    }

    task "postgres" {
      driver = "docker"
      config {
        image = "postgres:16"
        ports = ["db"]
      }
      env {
        POSTGRES_PASSWORD = "secretpassword"
      }
      resources {
        cpu    = 1000
        memory = 1024
      }
      service {
        name = "database"
        port = "db"
        check {
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

One job file. Three service groups. The frontend finds the API via api.service.consul. The API finds Postgres via database.service.consul. No hardcoded IPs anywhere.

Deploy: nomad job run platform.nomad

Scale the API: change count = 3 to count = 5, re-run the job. Nomad adds 2 more allocations. Consul DNS automatically includes them. Load distributes across all 5.

Scenario 2: Rolling Updates

You push a new API image. You want zero-downtime deployment. Nomad's update stanza handles this:

group "api" {
  count = 3

  update {
    max_parallel     = 1
    min_healthy_time = "30s"
    healthy_deadline = "5m"
    auto_revert      = true
  }

  # ... task definition ...
}

What happens when you deploy api:v2:

  1. Nomad stops one allocation running v1
  2. Starts one allocation with v2
  3. Waits for the health check to pass for 30 seconds (min_healthy_time)
  4. If healthy — moves to the next allocation
  5. If unhealthy within 5 minutes (healthy_deadline) — triggers auto_revert, rolls everything back to v1
  6. Repeats until all 3 allocations run v2

One at a time. Health-checked. Auto-rollback on failure. You change the image tag and re-run nomad job run. That's the entire deployment process.

Scenario 3: Dynamic Configuration with Consul KV

Store configuration in Consul's KV store:

consul kv put config/db_host "database.service.consul"
consul kv put config/db_port "5432"
consul kv put config/log_level "info"

Pull it into your Nomad job with the template block:

task "api" {
  driver = "docker"

  template {
    data = <<EOF
DB_HOST={{ key "config/db_host" }}
DB_PORT={{ key "config/db_port" }}
LOG_LEVEL={{ key "config/log_level" }}
EOF
    destination = "local/env.txt"
    env         = true
  }

  # ... rest of task ...
}

Nomad renders the template using live values from Consul KV. The variables become environment variables inside the container. Change config/log_level to "debug" in Consul, and Nomad can automatically restart the task to pick up the change.

Configuration management without baking values into images or maintaining separate config files per environment.

Scenario 4: Service Mesh with mTLS

Enable Consul Connect in your Nomad job to get automatic encryption and access control:

job "secure-api" {
  datacenters = ["dc1"]
  type = "service"

  group "api" {
    count = 3

    network {
      mode = "bridge"
      port "http" { to = 8080 }
    }

    service {
      name = "api"
      port = "8080"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "database"
              local_bind_port  = 5432
            }
          }
        }
      }
    }

    task "server" {
      driver = "docker"
      config {
        image = "api:latest"
      }
      env {
        DB_HOST = "127.0.0.1"
        DB_PORT = "5432"
      }
    }
  }
}

The connect block tells Nomad to inject an Envoy sidecar proxy. The upstreams block says "I need to talk to the database service — make it available on localhost:5432."

Your app connects to localhost:5432. The Envoy proxy intercepts that, establishes a mutual TLS connection to the database's Envoy proxy, and forwards the traffic. Encrypted end-to-end. Your application code has no idea.

Then set intentions in Consul:

consul intention create -allow api database
consul intention create -deny frontend database

The API can reach the database. The frontend cannot. Zero-trust networking defined in two commands.

Scenario 5: Blue-Green Deployment

Deploy v2 alongside v1 using tagged services:

job "api-v2" {
  datacenters = ["dc1"]
  type = "service"

  group "api" {
    count = 3

    task "server" {
      driver = "docker"
      config {
        image = "api:v2"
      }

      service {
        name = "api"
        port = "http"
        tags = ["v2", "canary"]

        check {
          type     = "http"
          path     = "/health"
          interval = "5s"
          timeout  = "2s"
        }
      }
    }
  }
}

Both versions register as "api" in Consul. Route canary traffic using tags. Once v2 health checks pass and you've validated it — stop the v1 job:

nomad job stop api-v1

Roll back by stopping v2 instead. Both versions are live, both are health-checked, you control the cutover.


Nomad + Consul vs Kubernetes: The Honest Take

Kubernetes won the mindshare war. It has the bigger ecosystem — Helm charts, Operators, CRDs, a certification industry, and a conference circuit. If you're building a platform team at a large company and you need the ecosystem, Kubernetes is the safe bet.

But Kubernetes is also 6+ components minimum: etcd, API server, scheduler, controller manager, kubelet, kube-proxy. Each needs configuration, monitoring, and upgrades. Managed Kubernetes (EKS, GKE, AKS) hides some of this — until it doesn't, and you're debugging kubelet certificate rotation at 2 AM.

Nomad + Consul is 2 binaries. That's not a simplification for the sales pitch — it's the actual architecture. Two binaries, two data directories, two sets of config.

Choose Nomad + Consul when:

  • You have mixed workloads — containers and legacy apps that can't be containerized
  • You want simpler operations and fewer moving parts
  • You're already in the HashiCorp ecosystem (Vault, Terraform)
  • Your team is small and you can't afford a dedicated platform team
  • You want to understand your entire stack, not just the abstraction on top

Choose Kubernetes when:

  • You need the ecosystem — Helm, Operators, CRDs, the whole catalog
  • Your team is large enough to absorb the operational complexity
  • You're mostly running containers and don't need non-container workload support
  • You want the widest hiring pool (more people know Kubernetes)
  • Managed Kubernetes is available and acceptable for your use case

Neither is wrong. But if you've been assuming Kubernetes is the only option, you should know there's a stack that does 80% of what Kubernetes does with 20% of the complexity.


Production Checklist

Before you go live, nail these:

High Availability

  • Run 3 or 5 servers for both Nomad and Consul (odd numbers — Raft consensus needs a majority)
  • Never run 2 or 4 — split-brain risk

Security

  • Enable ACLs on both Nomad and Consul — without ACLs, anyone with network access can deploy jobs or read your service catalog
  • Enable TLS between all components — server-to-server, client-to-server, everything
  • Use Vault for secrets — don't put passwords in job files

Monitoring

  • Both Nomad and Consul expose Prometheus metrics out of the box
  • Set up Grafana dashboards for cluster health, allocation status, and Consul service health
  • Alert on leader elections (frequent elections mean something is wrong)

Backups

  • Backup Raft data regularly: consul snapshot save backup.snap and nomad operator snapshot save backup.snap
  • Test restores. Untested backups aren't backups.

Networking

  • Ensure servers can reach each other on their RPC and Serf ports
  • Consul: 8300 (RPC), 8301 (LAN Serf), 8302 (WAN Serf), 8500 (HTTP API), 8600 (DNS)
  • Nomad: 4646 (HTTP API), 4647 (RPC), 4648 (Serf)

Go Try It

Download both binaries. Run consul agent -dev in one terminal. Run nomad agent -dev in another. Deploy the example job file from this guide.

Watch it appear in the Consul catalog. Query it via DNS. Change the count and redeploy. Break the health check and watch Consul remove it.

The whole loop — deploy, discover, health check, recover — running on your laptop in under 5 minutes.

Official docs:


Compiled by AI. Proofread by caffeine. ☕