You've tried GitHub Copilot. You've used ChatGPT for code review. You've pasted stack traces into Claude.
Then you looked at the bill. Or your company's data policy. Or you tried to use it on a codebase you can't send to someone else's API.
This post is about replacing all of it — locally, privately, with models that are genuinely good at code.
Not a toy setup. A real engineering workflow running on your own hardware.
The Honest State of Local Code AI in 2026
Two years ago, local models for coding were a novelty. You'd run them, get mediocre completions, and go back to Copilot.
That changed.
The gap between local models and cloud models for coding tasks has closed dramatically — especially for the work most engineers actually do day-to-day: understanding existing code, writing tests, explaining what something does, reviewing a diff, generating boilerplate.
For cutting-edge research or massive cross-file reasoning, frontier models still win. For 80% of real engineering work, a well-configured local stack is competitive and sometimes better — because it can see your entire codebase without token limits, never sends proprietary code anywhere, and has zero latency cost from API rate limits.
Here's what that stack looks like.
The Models Worth Running
Not all models are equal at code. The general-purpose rankings don't translate directly to coding performance. Here's what actually works:
For code completion and generation:
- Qwen2.5-Coder 14B — the best local coding model right now for most hardware. Purpose-built for code, not a general model fine-tuned as an afterthought. Handles Python, Go, Rust, JavaScript, shell scripting, and infrastructure code well. Fits comfortably in 12GB VRAM.
- DeepSeek-Coder-V2 — strong for multi-language projects. Particularly good at reasoning about what code should do, not just completing patterns.
- CodeLlama 34B — if you have the VRAM (24GB+), this is the heavy option. Better on larger context, good for explaining legacy codebases.
For general reasoning about code (architecture, reviews, debugging):
- Qwen2.5 32B — general model that's excellent at reasoning. Use this when you're not looking for completions but for a conversation about design decisions.
- Mistral Small 3 — fast and punchy. Good for quick explanations when you don't want to wait for a 30B model to think.
For terminal and shell work:
Smaller is better here. You want fast responses for shell commands, not a model that takes 8 seconds to suggest a find invocation.
- Qwen2.5-Coder 7B — fast, accurate for shell, fits in 8GB VRAM or runs well on CPU.
- Mistral 7B — good fallback if Qwen isn't available.
Hardware requirements vary by model size. Check the hardware section at the end before pulling anything — especially the 32B.
Run everything through Ollama. It's the runtime that manages models the way Docker manages containers.
# Pull the models you'll use
ollama pull qwen2.5-coder:14b
ollama pull qwen2.5:32b
ollama pull qwen2.5-coder:7b
# Check what's running
ollama list
# Test it directly
ollama run qwen2.5-coder:14b "Write a Python function that parses a JWT without a library"
Code Completion in Your Editor
This is where most engineers spend their time. You want completions that feel like Copilot — inline suggestions as you type, context-aware, fast enough not to interrupt your flow.
Continue.dev — the open source Copilot replacement
Continue is a VS Code and JetBrains extension that connects to local models. It handles inline completions, a chat sidebar, and slash commands for common tasks.
Install it from the VS Code marketplace, then configure it to point at your Ollama instance.
Create or edit ~/.continue/config.json:
{
"models": [
{
"title": "Qwen2.5-Coder 14B",
"provider": "ollama",
"model": "qwen2.5-coder:14b",
"apiBase": "http://localhost:11434"
},
{
"title": "Qwen2.5 32B (reasoning)",
"provider": "ollama",
"model": "qwen2.5:32b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen2.5-Coder 7B (fast)",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"apiBase": "http://localhost:11434"
},
"contextProviders": [
{ "name": "code" },
{ "name": "docs" },
{ "name": "diff" },
{ "name": "terminal" },
{ "name": "problems" },
{ "name": "folder" },
{ "name": "codebase" }
],
"slashCommands": [
{ "name": "edit", "description": "Edit selected code" },
{ "name": "comment", "description": "Add comments to code" },
{ "name": "share", "description": "Export conversation" },
{ "name": "cmd", "description": "Generate terminal command" },
{ "name": "commit", "description": "Generate commit message" }
]
}
The split model approach is intentional. Tab completions use the 7B model — fast, low latency, inline. The chat sidebar uses 14B or 32B — you're asking a question and can wait 2 seconds.
What it can do that Copilot can't:
/codebase— index your entire repo and ask questions across all files. "Where is the authentication middleware?" works on a 50,000 line codebase.@diff— feed it your current git diff and ask "what did I just change and does anything look wrong?"@terminal— paste terminal output directly into context. "Here's the error from my last command, what's wrong?"@docs— point it at documentation URLs, it reads them and answers questions about them.
If you're not on VS Code
Neovim users: ollama.nvim or gen.nvim give you chat-style interaction without leaving the terminal. For completions, cmp-ai hooks into nvim-cmp and calls local models.
Aider — AI Pair Programming in the Terminal
Aider is a terminal tool that does something different from Continue. Instead of inline completions, it takes natural language instructions and makes actual edits to your files — with git integration.
pip install aider-chat
# Run it pointing at your local Ollama
aider --model ollama/qwen2.5-coder:14b --no-auto-commits
Inside an aider session:
> /add src/auth/middleware.py src/auth/models.py
> The JWT validation is not handling expired tokens correctly. Fix it and add a test.
Aider reads the files, reasons about the change, writes the fix, writes the test, and shows you a diff before applying. You approve or reject.
This is the right tool for:
- Refactoring a specific module
- Adding tests to existing code
- Fixing a bug when you know which files are involved
- Making consistent changes across multiple files ("rename this function everywhere it's used")
The --no-auto-commits flag means it stages changes but doesn't commit — you stay in control of your git history.
For bigger models that need more thinking time, add --thinking-tokens 8000 — it lets the model reason before writing code, which dramatically improves quality on complex changes.
Shell and Terminal Assistance
This is the underrated use case. You spend a lot of time in a terminal. Having AI available there — without switching context to a browser — changes how you work.
Shell-GPT with Ollama backend
Shell-GPT gives you AI in your terminal. Configure it to use Ollama:
pip install shell-gpt
# Configure for local Ollama
export OPENAI_API_HOST=http://localhost:11434
export OPENAI_API_KEY=ollama # anything non-empty
Edit ~/.config/shell_gpt/.sgptrc:
DEFAULT_MODEL=ollama/qwen2.5-coder:7b
API_BASE_URL=http://localhost:11434/v1
Now use it:
# Ask a question
sgpt "how do I find all files modified in the last 24 hours, excluding .git"
# Generate and execute a shell command
sgpt --shell "compress this directory but exclude node_modules and .git"
# Output: tar --exclude='./node_modules' --exclude='./.git' -czf archive.tar.gz .
# Execute? [y/n]
# Pipe output into it
journalctl -u nginx --since "1 hour ago" | sgpt "summarize any errors or warnings"
# Explain a command you found
sgpt --code "explain: awk 'NR==FNR{a[$1]=$2;next} $1 in a{print $0,a[$1]}' file1 file2"
The pipe usage is where this gets powerful. Paste in stack traces, log output, strace results, dmesg output — ask what it means. You never leave the terminal and nothing leaves your machine.
A simple shell function
Add this to your .bashrc or .zshrc:
# Ask AI directly from the terminal
ai() {
curl -s http://localhost:11434/api/generate \
-d "{\"model\": \"qwen2.5-coder:7b\", \"prompt\": \"$*\", \"stream\": false}" \
| python3 -c "import sys,json; print(json.load(sys.stdin)['response'])"
}
# Usage
ai "write a one-liner to extract all IP addresses from this nginx log file"
ai "what does SIGPIPE mean and when does it happen"
No extra tools, no configuration. Just curl talking to Ollama.
Code Review and Diff Analysis
Before pushing a PR, run your diff through the local model.
# Review your staged changes
git diff --staged | ollama run qwen2.5:32b \
"You are a senior engineer doing a code review. Review this diff for: bugs, security issues, missing error handling, and anything that looks wrong. Be specific and direct."
# Review the last commit
git show HEAD | ollama run qwen2.5:32b \
"Review this commit. Point out any problems. Don't summarize what it does — I can read. Focus on what could go wrong."
The 32B model is worth the wait here. You're not asking for a completion — you're asking for judgment. Give it time to think.
Make it a git alias:
git config --global alias.aireview '!git diff --staged | ollama run qwen2.5:32b "Review this diff as a senior engineer. Be specific about bugs and problems."'
# Now use it before every commit
git add -p
git aireview
git commit -m "..."
Understanding Legacy Code
This is the killer use case that no benchmark captures. You inherit a 5,000 line Python file with no comments, written by someone who left the company. You need to understand it.
# Feed a whole file and ask
cat legacy_monolith.py | ollama run qwen2.5:32b \
"Explain what this code does. Focus on the data flow and any non-obvious behavior. What are the side effects?"
# Ask about a specific function
sed -n '250,320p' legacy_monolith.py | ollama run qwen2.5-coder:14b \
"What does this function do? What are the edge cases? What could go wrong?"
In Continue.dev, highlight the function, hit Cmd+L (or Ctrl+L), and type "explain this." It adds the selected code to context automatically.
For very large files, split them first:
# Split a large file into chunks and analyze each
split -l 200 large_file.py chunk_
for f in chunk_*; do
echo "=== $f ==="
cat $f | ollama run qwen2.5-coder:14b "Summarize what this section does in 3 sentences."
done
Test Generation
Writing tests for existing code is tedious. It's one of the best uses of local AI.
cat src/payment/processor.py | ollama run qwen2.5-coder:14b \
"Write pytest unit tests for this code. Cover: happy path, edge cases, error conditions. Use mocks where the code makes external calls. Don't test implementation details — test behavior."
The instruction "don't test implementation details" matters. Without it, models write brittle tests that break on every refactor.
In Continue.dev, highlight a function and use /edit:
/edit Write comprehensive pytest tests for this function. Mock external dependencies.
It edits a test file in place — or creates one if it doesn't exist.
Generating Commit Messages
You know what the change does. Writing the message is just friction.
# Generate commit message from staged diff
git diff --staged | ollama run qwen2.5-coder:7b \
"Write a git commit message for this diff. Follow conventional commits format. First line max 72 chars. Add a body if the change is complex. Be specific about what changed and why."
As a shell function:
gcommit() {
msg=$(git diff --staged | ollama run qwen2.5-coder:7b \
"Write a git commit message following conventional commits. Max 72 chars first line. Be specific.")
echo "Proposed: $msg"
read -p "Use this? [y/n/e] " choice
case $choice in
y) git commit -m "$msg" ;;
e) git commit -e -m "$msg" ;; # opens editor with message pre-filled
*) echo "Aborted" ;;
esac
}
The Full Stack: What Runs Where
Here's the complete picture of a working setup:
┌─────────────────────────────────────────────────────┐
│ Your Machine │
│ │
│ VS Code + Continue.dev │
│ ├─ Tab completions → Qwen2.5-Coder 7B (fast) │
│ ├─ Chat sidebar → Qwen2.5-Coder 14B (smart) │
│ └─ Architecture → Qwen2.5 32B (deep) │
│ │
│ Terminal │
│ ├─ sgpt → Qwen2.5-Coder 7B │
│ ├─ aider → Qwen2.5-Coder 14B │
│ └─ git aliases → Qwen2.5 32B │
│ │
│ Ollama (localhost:11434) │
│ ├─ qwen2.5-coder:7b (fast completions) │
│ ├─ qwen2.5-coder:14b (code generation) │
│ └─ qwen2.5:32b (reasoning/review) │
└─────────────────────────────────────────────────────┘
One Ollama instance. Multiple models. Multiple tools all talking to the same backend. Each tool picks the right model for the task.
Hardware Reality Check
What you can run and how well:
8GB VRAM (RTX 3060, RTX 4060):
- Qwen2.5-Coder 7B comfortably — good for completions and shell help
- Qwen2.5-Coder 14B at reduced quality (4-bit quantization)
- This is a usable setup. Completions feel fast, code quality is decent.
12GB VRAM (RTX 3060 12GB, RTX 4070):
- Qwen2.5-Coder 14B at full quality — this is the sweet spot
- Solid completions, good code generation, real code review capability
- Best price/performance for a dedicated coding machine.
24GB VRAM (RTX 3090, RTX 4090, A5000):
- Qwen2.5 32B at decent quality — strong reasoning, architecture discussions
- Multiple models loaded simultaneously
- This is where local AI becomes genuinely hard to distinguish from cloud tools for most tasks.
No GPU / CPU only:
- Qwen2.5-Coder 7B at 4-bit quantization is usable for chat — just slow
- Tab completions will be too slow for inline use
- Treat it as a chat assistant, not a completion engine
RAM matters too. Models load into VRAM but RAM holds the context. 32GB system RAM is the practical minimum for running large models plus a development environment simultaneously.
What This Doesn't Replace
Be honest about the limits:
Long context across many files — local models handle 32K-128K context depending on the model, which is fine for most tasks. But if you're asking questions that require understanding an entire large monorepo simultaneously, frontier models with million-token contexts still have an edge.
Latest APIs and frameworks — your local model's training data has a cutoff. If you're working with a library that shipped major changes recently, the model might not know. Check docs directly for anything cutting-edge.
Speculative/creative architecture — for "what's the best way to design this system from scratch" conversations involving trade-offs across many dimensions, the largest frontier models still reason better. Local 32B models are good, but not GPT-4-level good on pure reasoning benchmarks.
Everything else — the day-to-day code that makes up most engineering work — local models handle fine.
Honest Verdict
A year ago I'd have said "use local models for privacy-sensitive work and cloud models for everything important."
I don't say that anymore.
The Qwen2.5-Coder line crossed a threshold. For real engineering tasks — understanding existing code, writing tests, reviewing diffs, helping with shell scripting — the 14B model is good enough that I reach for cloud tools less and less.
The workflow takes an afternoon to set up. Continue.dev, Ollama, a few shell functions, one git alias. After that it runs invisibly.
Code never leaves your machine. No API key rotation. No rate limits at 3am when you're debugging something urgent.
If you've been meaning to try local AI for development and keep putting it off — stop putting it off.
Go Try It
Start with just Ollama and the 7B model:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull the fast coding model
ollama pull qwen2.5-coder:7b
# Test it on something real
cat your_most_confusing_file.py | ollama run qwen2.5-coder:7b "Explain what this does"
Then install Continue.dev, point it at http://localhost:11434, and try the tab completions for a day.
That's all it takes to know if this workflow fits how you work.
Compiled by AI. Proofread by caffeine. ☕