Your Own Private AI - Setting Up Ollama + Open WebUI on Linux

Your data doesn't belong to OpenAI. Here's how to run a ChatGPT-grade assistant on your own hardware — no cloud, no subscriptions, no compromises on privacy.

The Stack

Two pieces. That's it.

Ollama is a model runtime. Think Docker, but for large language models. It downloads, quantizes, and serves AI models through a local API. No Python environments, no dependency hell.

Open WebUI is a slick ChatGPT-style web interface that talks to Ollama. Chat history, model switching, admin controls — the whole package, running in a Docker container.

Together, they give you a private AI chatbot that works offline, costs nothing to run, and never phones home.

Why Bother

Cloud AI is great until it isn't. Maybe you're pasting proprietary code into ChatGPT and hoping the terms of service actually mean something. Maybe you're tired of rate limits killing your flow. Maybe you just want to ask questions without an audience.

Local AI flips the model:

	Local (Ollama)	Cloud (ChatGPT/Claude)
Privacy	100% local — nothing leaves your machine	Your data hits third-party servers
Cost	Free forever	$20/mo or pay-per-token
Internet	Works offline	Requires connection
Speed	~7 tokens/sec (honest)	Near-instant
Quality	Solid for most tasks	Better for complex reasoning
Limits	No rate limits, no content filters	Rate limits, content policies

The speed tradeoff is real. 7 tokens per second isn't fast. But for private questions, code help, translations, and experimentation — it's more than enough.

Hardware

Nothing exotic. A Ryzen 7 7840HS with 28GB RAM running Debian 12. No GPU. Pure CPU inference. This is a regular server, not a deep learning rig.

Setting It Up

1. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

One line. It handles everything.

2. Start the Server

OLLAMA_HOST=0.0.0.0 ollama serve

⚠️ The gotcha that will waste your time: Ollama defaults to 127.0.0.1. If you skip OLLAMA_HOST=0.0.0.0, Docker containers can't reach it. You'll stare at connection errors wondering what you broke. Set it to 0.0.0.0 from the start.

3. Pull a Model

ollama pull qwen2.5:14b

This downloads Qwen2.5 14B (Q4_K_M quantization) — about 9GB. It'll use roughly 9GB of RAM when loaded.

Ollama auto-loads the model on first request, keeps it in RAM for ~5 minutes, then unloads it when idle. Smart resource management with zero config.

4. Launch Open WebUI

docker run -d \
  --name open-webui \
  --restart always \
  -p 3100:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Hit http://your-server:3100, sign up (first user becomes admin), and you're in. Open WebUI auto-discovers every model Ollama has downloaded. Chat history lives in the open-webui Docker volume.

Useful Ollama Commands

ollama list              # Show downloaded models
ollama ps                # Show currently loaded models
ollama run qwen2.5:14b   # Chat directly in terminal
ollama show qwen2.5:14b  # Model details
ollama rm qwen2.5:14b    # Delete a model

How It Actually Performs

I ran Qwen2.5 14B through real tasks on this exact hardware. No cherry-picking.

Logic puzzle (river crossing): ✅ Perfect solution
Python coding (IPv4 validator, no imports): ✅ Clean, correct — even caught leading zeros
Math (derivative with product rule): ✅ Step-by-step, properly factored
Arabic paragraph: ✅ Fluent, natural output
Trick question ("I have 3 apples, I eat 2 bananas"): ✅ Didn't fall for it
Technical explanation (reverse proxy): ✅ Clear and concise

Speed held steady at 6.5–6.8 tokens/sec. Responses landed in 10–15 seconds. Not instant, but perfectly usable.

Other Models Worth Trying

With 28GB RAM, you have options. Only one model loads at a time, so download several and switch as needed:

Model	Size	Best For
llama3:8b	4.7GB	Fast general use
gemma2:9b	5.4GB	Quick chat
qwen2.5:14b	9GB	All-rounder (my pick)
deepseek-r1:14b	9GB	Math and reasoning
codestral:22b	13GB	Code generation

What You Should Know

Ollama is stateless. It has zero memory between requests. Open WebUI creates the illusion of memory by re-sending the full conversation context with every message. This means longer conversations get slower — more tokens to process each turn.

Use local AI for: Private questions, quick code help, translations, Arabic text, experimenting without API costs, offline work.

Use cloud AI when: You need complex multi-step reasoning, very long document analysis, or accuracy-critical output. Local models are good. Cloud models are still better for hard problems.

The Honest Verdict

Running a local AI chatbot in 2025 is surprisingly undramatic. Two installs, one Docker command, and you have a private ChatGPT alternative that works offline and costs nothing.

Is it as good as GPT-4 or Claude? No. Is it good enough for 80% of what most people use AI for? Absolutely.

The real win isn't performance — it's ownership. Your questions stay yours. Your data stays on your hardware. And when the next API price hike hits, you won't even notice.

Compiled by AI. Proofread by caffeine. ☕