How to Self-Host an LLM at Home in 2026 (Ollama + Open WebUI on a Mini PC)
Run a private, offline ChatGPT alternative on your own hardware for $0/month in ongoing subscription costs. No GPU required. Full setup from hardware selection through first conversation.
This post contains affiliate links. If you buy hardware through these links, I earn a small commission at no extra cost to you.
ChatGPT Plus costs $20/month. Claude Pro costs $20/month. Both send your prompts to someone else’s servers.
A decent mini PC runs 7B models locally, costs $180-320 depending on specs, and draws about as much power as a dim light bulb. If you were going to pay for an AI subscription anyway, you break even on hardware in a few months — and every conversation stays on your network.
This guide covers the whole setup: picking the right hardware, installing Ollama, choosing models that fit your specs, and adding Open WebUI for a proper chat interface. If you’ve never run a local LLM before, this is the place to start.
For hardware context outside the LLM-specific picks below, the best mini PCs for a homelab in 2026 covers the broader landscape. If you’re deciding what else to self-host, see replacing Google One and other subscriptions with self-hosted tools.
Hardware You Need
Ollama runs on CPU without any GPU. What determines which models you can run is RAM — models load entirely into RAM during inference.
| RAM | What you can run | Approx. response speed |
|---|---|---|
| 8GB | 3B models (Phi-3 Mini, Llama 3.2 3B) | Fast on modern CPU |
| 16GB | 7B-8B models (Llama 3.1 8B, Mistral 7B) | 3-8 tokens/sec |
| 32GB | 13B models, or 7B with room for other services | Same speed, bigger models |
| 32GB + GPU (12GB VRAM) | 13B fully GPU-loaded | 40-60+ tokens/sec |
“3-8 tokens per second” means roughly one word every 0.3-0.6 seconds. Slow for real-time conversation, workable for “generate this code block” or “summarize this document.” Most people find it usable and consider the privacy tradeoff worth it.
Mini PC Picks
For CPU inference, you want a fast multi-core processor and as much RAM as the board supports. AMD Ryzen 7000/8000 series mini PCs are the current sweet spot.
Budget tier (~$180-220):
The Beelink EQ12 with Intel N100 runs 3B models adequately. The N100 has fast efficiency cores, but single-core throughput limits it with 7B models. Fine for light use.
Mid-tier (~$250-320, recommended starting point):
The Beelink SER8 (Ryzen 9 8945HS) and MinisForum UM790 Pro (Ryzen 9 7940HS) are the machines I’d recommend for anyone serious about local LLMs. The Ryzen cores are meaningfully faster for inference than Intel N-series machines. Both also carry AMD Radeon 780M integrated graphics, which supports partial GPU offloading — it doesn’t transform CPU performance, but it helps with smaller models.
Both ship with 16GB DDR5. Upgrade to 32GB with a 32GB DDR5 SO-DIMM kit (~$45-60) if you want to run 13B models or keep RAM free for other homelab services.
Storage: Models eat disk space. Llama 3.1 8B alone is about 5GB. If you want several models installed, plan for at least a 1TB NVMe drive. Most mini PCs have a spare M.2 slot so you can add storage without replacing the built-in drive.
GPU upgrade path (~$150-300 used): A used NVIDIA RTX 3060 12GB changes the experience significantly — 13B models run fully GPU-accelerated at 40-60+ tokens per second. A used RTX 3090 24GB handles 70B quantized models, which puts you at the same capability tier as top commercial offerings. This path requires a desktop, not a mini PC, but it’s worth knowing the ceiling.
Install Ollama
Ollama is the engine. It loads models, serves them over a local API, and runs as a background service.
Linux (one-line install):
curl -fsSL https://ollama.com/install.sh | sh
This installs Ollama as a systemd service, starts it immediately, and sets it to start on boot. It listens on port 11434.
Verify it’s running:
systemctl status ollama
You should see active (running).
macOS: Download the native app from ollama.com. It runs in the menu bar and sets up the same local API.
Pull Your First Model
Models are separate downloads from Ollama itself. Start with:
ollama pull llama3.1:8b
About 4.7GB download. On a typical home connection, 5-10 minutes.
Test it immediately:
ollama run llama3.1:8b "What's the capital of France?"
If you get a coherent response, Ollama is working. Type /bye to exit interactive mode.
Other models worth trying once the basics are confirmed:
ollama pull mistral:7b # Strong at following instructions
ollama pull qwen2.5-coder:7b # Best code model in this size class
ollama pull gemma2:9b # Google Gemma 2 -- good reasoning
ollama pull phi3:mini # Microsoft Phi-3 Mini -- very fast, capable for its size
Model size reference
| Model | Download size | RAM needed | Best for |
|---|---|---|---|
| phi3:mini | 2.2GB | 4GB | Fast responses, simple tasks |
| llama3.2:3b | 2.0GB | 4GB | Quick answers, low-RAM systems |
| llama3.1:8b | 4.7GB | 8GB | General purpose, strong starting point |
| mistral:7b | 4.1GB | 8GB | Instruction following, Q&A |
| qwen2.5-coder:7b | 4.7GB | 8GB | Code generation and review |
| gemma2:9b | 5.5GB | 10GB | Reasoning tasks |
For a 16GB mini PC running other homelab services alongside it, llama3.1:8b is the practical ceiling. With 32GB, you can load a 13B model and still have RAM left over.
Set Up Open WebUI
The terminal interface works but isn’t what most people want for daily use. Open WebUI provides a ChatGPT-like interface with conversation history, model switching, and document uploads.
It runs in Docker and connects to your Ollama instance. If you don’t have Docker installed, the Docker Compose basics guide covers installation.
Create a directory:
mkdir -p /opt/open-webui
Create /opt/open-webui/docker-compose.yml:
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- /opt/open-webui/data:/app/backend/data
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://YOUR_SERVER_IP:11434
restart: unless-stopped
Replace YOUR_SERVER_IP with your server’s LAN IP (e.g., 192.168.1.50). If Open WebUI and Ollama are on the same machine, you can use the Docker bridge IP 172.17.0.1 instead.
Start it:
cd /opt/open-webui
docker compose up -d
Navigate to http://YOUR_SERVER_IP:3000. Create an account — the first account created gets admin privileges. Select your model from the top dropdown and start a conversation.
What You Can Actually Do With This
Code help: Paste a function and ask what it does, how to improve it, or to write a test. CPU inference speed is fine here because the interaction is write-once, wait-once rather than back-and-forth conversation.
Summarization: Paste a long article, meeting notes, or documentation section. Ask for a bullet summary. 7B models handle this well.
Document Q&A: Open WebUI supports file uploads. Upload a PDF and ask questions about its contents. I use this for technical documentation I need to search without reading everything.
Private use cases: A local LLM you control is valuable for anything you’d never type into a commercial service — personal decisions, health questions, sensitive work topics. It’s not connected to anything outside your network.
API access for other apps: Ollama exposes an OpenAI-compatible API at http://YOUR_IP:11434/v1. Applications that support a custom OpenAI endpoint can often point at your local Ollama instead of the real OpenAI API.
GPU Acceleration
CPU inference is usable. GPU inference is a different tier of experience.
For NVIDIA GPU passthrough in Docker, install the NVIDIA Container Toolkit on the host first, then update your Ollama deployment:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- /opt/ollama/models:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
Ollama detects and uses the GPU automatically after restart. A 7B model that generates text at 5 tokens/sec on CPU will run at 50+ tokens/sec on a GPU — fast enough to feel like a real conversation.
VRAM is the hard limit for GPU inference. Models must fit in VRAM to run GPU-accelerated; overflow spills to CPU and slows everything down.
| GPU VRAM | Models that fit fully | Typical speed |
|---|---|---|
| 8GB | Up to 7B models | 30-50 tokens/sec |
| 12GB (RTX 3060) | Up to 13B models | 40-60 tokens/sec |
| 24GB (RTX 3090) | 34B models; 70B Q4 quantized | 60-80 tokens/sec |
If you’re already running a homelab desktop with a capable GPU sitting idle, enabling GPU passthrough for Ollama costs nothing.
Remote Access
Open WebUI is accessible from any device on your local network at http://YOUR_IP:3000. For access from outside your home, Tailscale is the right path.
If Tailscale isn’t already on your server, the Tailscale homelab setup guide walks through it. Once installed, Open WebUI is accessible at http://YOUR_TAILSCALE_IP:3000 from any device on your tailnet — no port forwarding, no public exposure, no SSL required for this use case.
Maintenance
Models don’t auto-update. When a new version releases, pull it manually:
ollama pull llama3.1:8b # pulls the latest version
ollama list # see what's installed + storage used
ollama rm gemma2:9b # remove a model you're not using
Update Open WebUI when new versions drop:
cd /opt/open-webui
docker compose pull
docker compose up -d
Conversation history is stored in the data volume and persists through updates.
A self-hosted 7B or 13B model won’t replace frontier AI for complex reasoning tasks. But for code help, summarization, private Q&A, and most everyday AI assistance, a $250-320 mini PC running Ollama competes well with a $20/month subscription — and everything stays on your hardware.