Your data doesn't belong to OpenAI. Here's how to run a ChatGPT-grade assistant on your own hardware — no cloud, no subscriptions, no compromises on privacy.

The Stack

Two pieces. That's it.

Ollama is a model runtime. Think Docker, but for large language models. It downloads, quantizes, and serves AI models through a local API. No Python environments, no dependency hell.

Open WebUI is a slick ChatGPT-style web interface that talks to Ollama. Chat history, model switching, admin controls — the whole package, running in a Docker container.

Together, they give you a private AI chatbot that works offline, costs nothing to run, and never phones home.

Why Bother

Cloud AI is great until it isn't. Maybe you're pasting proprietary code into ChatGPT and hoping the terms of service actually mean something. Maybe you're tired of rate limits killing your flow. Maybe you just want to ask questions without an audience.

Local AI flips the model:

Local (Ollama) Cloud (ChatGPT/Claude)
Privacy 100% local — nothing leaves your machine Your data hits third-party servers
Cost Free forever $20/mo or pay-per-token
Internet Works offline Requires connection
Speed ~7 tokens/sec (honest) Near-instant
Quality Solid for most tasks Better for complex reasoning
Limits No rate limits, no content filters Rate limits, content policies

The speed tradeoff is real. 7 tokens per second isn't fast. But for private questions, code help, translations, and experimentation — it's more than enough.

Hardware

Nothing exotic. A Ryzen 7 7840HS with 28GB RAM running Debian 12. No GPU. Pure CPU inference. This is a regular server, not a deep learning rig.

Setting It Up

1. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

One line. It handles everything.

2. Start the Server

OLLAMA_HOST=0.0.0.0 ollama serve

⚠️ The gotcha that will waste your time: Ollama defaults to 127.0.0.1. If you skip OLLAMA_HOST=0.0.0.0, Docker containers can't reach it. You'll stare at connection errors wondering what you broke. Set it to 0.0.0.0 from the start.

3. Pull a Model

ollama pull qwen2.5:14b

This downloads Qwen2.5 14B (Q4_K_M quantization) — about 9GB. It'll use roughly 9GB of RAM when loaded.

Ollama auto-loads the model on first request, keeps it in RAM for ~5 minutes, then unloads it when idle. Smart resource management with zero config.

4. Launch Open WebUI

docker run -d \
  --name open-webui \
  --restart always \
  -p 3100:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Hit http://your-server:3100, sign up (first user becomes admin), and you're in. Open WebUI auto-discovers every model Ollama has downloaded. Chat history lives in the open-webui Docker volume.

Useful Ollama Commands

ollama list              # Show downloaded models
ollama ps                # Show currently loaded models
ollama run qwen2.5:14b   # Chat directly in terminal
ollama show qwen2.5:14b  # Model details
ollama rm qwen2.5:14b    # Delete a model

How It Actually Performs

I ran Qwen2.5 14B through real tasks on this exact hardware. No cherry-picking.

  • Logic puzzle (river crossing): ✅ Perfect solution
  • Python coding (IPv4 validator, no imports): ✅ Clean, correct — even caught leading zeros
  • Math (derivative with product rule): ✅ Step-by-step, properly factored
  • Arabic paragraph: ✅ Fluent, natural output
  • Trick question ("I have 3 apples, I eat 2 bananas"): ✅ Didn't fall for it
  • Technical explanation (reverse proxy): ✅ Clear and concise

Speed held steady at 6.5–6.8 tokens/sec. Responses landed in 10–15 seconds. Not instant, but perfectly usable.

Other Models Worth Trying

With 28GB RAM, you have options. Only one model loads at a time, so download several and switch as needed:

Model Size Best For
llama3:8b 4.7GB Fast general use
gemma2:9b 5.4GB Quick chat
qwen2.5:14b 9GB All-rounder (my pick)
deepseek-r1:14b 9GB Math and reasoning
codestral:22b 13GB Code generation

What You Should Know

Ollama is stateless. It has zero memory between requests. Open WebUI creates the illusion of memory by re-sending the full conversation context with every message. This means longer conversations get slower — more tokens to process each turn.

Use local AI for: Private questions, quick code help, translations, Arabic text, experimenting without API costs, offline work.

Use cloud AI when: You need complex multi-step reasoning, very long document analysis, or accuracy-critical output. Local models are good. Cloud models are still better for hard problems.

The Honest Verdict

Running a local AI chatbot in 2025 is surprisingly undramatic. Two installs, one Docker command, and you have a private ChatGPT alternative that works offline and costs nothing.

Is it as good as GPT-4 or Claude? No. Is it good enough for 80% of what most people use AI for? Absolutely.

The real win isn't performance — it's ownership. Your questions stay yours. Your data stays on your hardware. And when the next API price hike hits, you won't even notice.


Compiled by AI. Proofread by caffeine. ☕