AI Models Decoded: What Every Self-Hoster Needs to Know

You've probably tried ChatGPT. Maybe you've heard about running "local AI" on your home server. But between the hype and the technical papers lies a gap — you want to understand what's actually happening under the hood before you commit 32GB of RAM to something you don't understand.

This isn't another "AI will change everything" post. This is the technical breakdown you need to make informed decisions about what to run locally and why.

Parameters: The Brain Cells That Actually Matter

A model's parameters are learned numbers — billions of them. Think floating-point weights that the model adjusts during training to get better at predicting the next word, pixel, or token.

When you see "Llama 3.2 3B" or "GPT-4 with 1.7T parameters," that number tells you everything about memory requirements and capability.

Here's the simple math:

Model Size (B) × 2 bytes = Minimum RAM needed (GB)

But reality is messier. You need overhead for the runtime, operating system, and inference engine. Rule of thumb: multiply by 1.5x.

Model Size Parameters RAM Needed Good For
3B 3 billion ~6GB Classification, simple chat, coding help
7B 7 billion ~14GB General conversation, documentation, reasoning
14B 14 billion ~28GB Complex reasoning, better coding, analysis
70B 70 billion ~140GB Production-quality responses, research

Think of it like resolution. A 3B model has 100 crayons to draw with. A 14B model has 2000 crayons. Same basic colors, but one can render fine details the other can't.

Parameters live in two dimensions: width (how many neurons per layer) and depth (how many layers). Wide models capture more concepts. Deep models do more reasoning steps. Modern architectures balance both.

But here's the kicker: training quality beats raw size. A well-trained 3B model will outperform a garbage 7B model every time. Parameters are potential, not performance.

What did we just learn? Parameters = memory footprint + capability ceiling. More isn't always better, but it's usually hungrier.

Tokens: How Models Actually Read Text

Models don't see words or characters. They see tokens — chunks of text that average about 4 characters or 0.75 words in English.

"Hello world" becomes two tokens: ["Hello", " world"]. Note the space. "I'm running Linux" might be ["I", "'m", " running", " Linux"] — four tokens.

Why does this matter for self-hosting? Token efficiency varies wildly by language and content type:

  • English: ~0.75 words per token
  • Arabic/Chinese: ~0.3-0.5 words per token (worse tokenization = higher costs)
  • Code: Highly variable. Python keywords are often single tokens, but variable names get chunked

Test it yourself:

# Using OpenAI's tokenizer
echo "Your text here" | tiktoken --encoding cl100k_base

# Or with Ollama
ollama run llama3.2 "Count the tokens in this message" --verbose

The good news: when running local models, tokens are free. You only care about token count for paid APIs or context window limits.

What did we just learn? Tokens are the atomic unit of model input/output. English is well-optimized, other languages pay a tax, code varies wildly.

Context Window: The Shared Whiteboard

Every model has a context window — the total budget for input AND output combined. Think of it as shared RAM between you and the model.

System Prompt + Chat History + Your Message + Response = Must fit in context window

Current typical sizes:

  • Llama models: 128K tokens (~100,000 words)
  • GPT-4o: 128K tokens
  • Claude: 200K tokens

Here's what burns context:

System prompt:     500 tokens
Chat history:    25,000 tokens  
Your message:    1,000 tokens
Model response:   2,500 tokens
─────────────────────────────
Total used:      29,000 tokens
Remaining:       99,000 tokens

When the context fills up, the oldest messages get dropped. This is why long conversations seem to get "dumber" — the model isn't getting tired, it's literally forgetting the beginning of your conversation.

Input hogs steal from output space. Paste a 50K-token document, and the model can only respond with whatever tokens remain. Plan accordingly.

For most local tasks (classification, tagging, simple queries), you'll use less than 1% of the available context window. But for RAG applications or long conversations, context management becomes critical.

What did we just learn? Context is shared budget. Longer input = shorter possible output. Models forget old messages when context fills up.

The Model Zoo: Every Type That Matters

Not all AI models do the same thing. Here's the complete map of what exists and what each type is actually for:

Text Models (LLMs)

Base/Foundation Models
These are raw next-word prediction engines trained on the entire internet. You never use these directly — they're the foundation that everything else builds on. Think of them as the assembly language of AI.

Chat/Instruct Models
Base models fine-tuned to follow instructions and have conversations. This is what most people mean when they say "AI" — GPT-4o, Claude, Llama 3.2 Instruct. They're trained to be helpful, harmless, and honest. Mostly.

Reasoning Models
The new generation. Instead of immediately blurting out an answer, these models think step-by-step internally before responding. DeepSeek-R1 and OpenAI's o1 are the current leaders.

Regular model: "What's 23 * 47?"
Response: "1081"

Reasoning model: "What's 23 * 47?"
Internal thoughts: [Let me break this down... 23 * 40 = 920, 23 * 7 = 161, so 920 + 161 = 1081]
Response: "1081"

The thinking happens in hidden tokens you never see. Slower but dramatically better at math, logic, and coding.

Code-Specialized Models
LLMs trained specifically on code repositories. CodeLlama, Qwen2.5-Coder, DeepSeek-Coder. They understand syntax, patterns, and common programming idioms better than general models. If you're self-hosting for development work, start here.

Embedding Models
These don't generate text. They convert text into vectors — lists of numbers that represent semantic meaning. Purpose-built, tiny (usually under 1GB), and fast. Examples: nomic-embed-text, all-MiniLM-L6-v2. Essential for search and RAG applications.

Image Models

Diffusion Models
These generate images from text descriptions. The training process is clever: take real images, add random noise, train the model to remove that noise. During generation, start with pure static and let the model gradually remove noise guided by your text prompt.

Stable Diffusion, FLUX, DALL-E, Midjourney — all diffusion models. They need GPUs for practical use. CPU generation takes 5-10 minutes per image.

Vision Models
These understand and describe existing images. Llama 3.2 Vision, GPT-4o. Point them at a screenshot, they'll tell you what they see. Useful for automation and accessibility.

Audio Models

Speech-to-Text (STT)
Whisper is the king here. Converts audio to text with near-human accuracy. Runs fine on CPU, though GPU is faster.

Text-to-Speech (TTS)
ElevenLabs (cloud), Coqui (local). Converts text to natural-sounding speech. Quality has improved dramatically in the past year.

Video & Multimodal

Video Generation: Sora, Runway. Still expensive and limited.
Multimodal Understanding: Gemini can process text + images + audio + video simultaneously. The future of AI interfaces.

What did we just learn? Different models for different jobs. Chat models for conversation, reasoning models for complex problems, embedding models for search, diffusion for images. Pick the right tool.

Vectors and RAG: Making AI Actually Useful

Here's where theory meets practice. Vectors are numerical fingerprints of meaning. Similar concepts get similar numbers.

The classic example:

King - Man + Woman ≈ Queen

But think practically. "Database optimization" and "query performance tuning" would have similar vectors even though they share no words. That's semantic similarity.

Vector databases store these fingerprints: Weaviate, ChromaDB, pgvector (PostgreSQL extension). You search by meaning, not exact text matches.

RAG: Retrieval-Augmented Generation

RAG is the killer app for local AI. The workflow:

1. Index: Documents → Embedding Model → Vector Database
2. Query: Question → Embedding Model → Similar Chunks
3. Generate: Chunks + Question → LLM → Answer

Critical insight: RAG is application-side logic. The LLM is just the final step — reading relevant chunks and writing an answer. All the intelligence is in finding the right context.

# Simplified RAG workflow
def rag_query(question, vector_db, llm):
    # Step 1: Convert question to vector
    query_vector = embed_model.encode(question)
    
    # Step 2: Find similar content
    relevant_chunks = vector_db.similarity_search(query_vector, k=5)
    
    # Step 3: Build prompt with context
    prompt = f"Context:\n{relevant_chunks}\n\nQuestion: {question}\nAnswer:"
    
    # Step 4: Generate response
    return llm.generate(prompt)

This is how you build an AI that knows about your infrastructure, your documentation, your codebase. Feed it your wiki, your logs, your config files. Now you have a chatbot that actually knows your environment.

What did we just learn? Vectors capture meaning as numbers. RAG finds relevant context, then lets the LLM read and respond. It's search + generation, not magic.

What Can You Actually Run Locally?

Time for reality. Here's what fits on a typical home server with 28GB RAM and no dedicated GPU:

Comfortable (under 8GB total)

  • Llama 3.2 3B Instruct (~2GB) + nomic-embed-text (~0.3GB)
  • Good for: Classification, simple Q&A, basic coding help, RAG applications
  • Response time: Near-instant on modern CPUs

Doable (10-15GB, one model at a time)

  • Qwen2.5 14B or DeepSeek-R1 14B
  • Good for: Complex reasoning, better coding, research assistance
  • Response time: 1-3 seconds per response

Too Heavy (20GB+)

  • 70B models: Require 140GB+ RAM or multi-GPU setups
  • Great quality, but impractical for most home servers

CPU vs GPU Reality Check

Text generation: CPU is fine. Modern processors handle 7B models at reasonable speeds. No GPU required.

Image generation: GPU or bust. Stable Diffusion needs VRAM. CPU generation exists but takes 5-10 minutes per image. Not practical.

Audio: Whisper runs great on CPU. TTS varies by model but CPU is usually acceptable.

Tools That Actually Work

Ollama: Dead simple model runner. curl commands to interact with models.

# Install and run a model
curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama3.2:3b

Open WebUI: Web interface for Ollama. ChatGPT-like UI for your local models.

PrivateGPT: RAG application ready to go. Point it at your documents.

AnythingLLM: All-in-one workspace for local AI with RAG, embeddings, and document management.

The Honest Verdict: Local AI in 2025

Local AI is finally practical for text. Not "toy project" practical — production practical.

What works great:

  • Document summarization and Q&A
  • Code generation and debugging
  • Classification and data extraction
  • Personal knowledge base (RAG)

What's getting there:

  • Image generation (need GPU, quality improving)
  • Voice synthesis (CPU-friendly options emerging)

What's still cloud-only:

  • Cutting-edge reasoning (o1, Claude Opus)
  • Video generation
  • Real-time multimodal

The math is simple: If your use case needs the absolute best quality, you'll still use cloud APIs. If you need privacy, control, and "good enough" quality, local models deliver.

For most self-hosted applications — chatbots, document processing, automation — a well-configured 7B or 14B model beats cloud APIs on everything except raw intelligence.

Get Started Today

  1. Install Ollama on your server
  2. Pull Llama 3.2 3B for quick tasks: ollama pull llama3.2:3b
  3. Add nomic-embed-text for RAG: ollama pull nomic-embed-text
  4. Install Open WebUI for a friendly interface
  5. Test with your documentation — this is where local AI shines

Start small. Learn the fundamentals. Scale up as you understand what you actually need.

The AI revolution isn't coming. It's here. The question is whether you're running it on someone else's servers or your own.

Compiled by AI. Proofread by caffeine. ☕