Running a Local AI with Ollama on Your Homelab
Set up Ollama on Docker to run large language models locally (Llama 3, Mistral, Gemma, and more), with a web UI for chatting and an API for integrations.
Ollama lets you run large language models on your own hardware. You get a ChatGPT-like experience — ask questions, get code help, summarize documents — with the model running locally. No API key, no per-token costs, no data leaving your network.
The catch is hardware. Small models (7B parameters) run acceptably on a mini PC with 16GB RAM. Larger models need more RAM, and GPU acceleration makes everything dramatically faster. This guide covers setup on a typical homelab server and is honest about what to expect from CPU-only inference.
Hardware Reality Check
CPU-only (no GPU), 16GB RAM: You can run 7B models (Llama 3.1 8B, Mistral 7B). Response speed is slow. Expect 3-8 tokens per second depending on your CPU. Usable for summarization and code help where you can wait a few seconds per sentence. Not great for interactive conversation.
CPU-only, 32GB RAM: Can load 13B models. Still slow. The RAM lets you load bigger models; the CPU is still the bottleneck.
Dedicated GPU (NVIDIA with 8GB+ VRAM): 7B models run at 40-60+ tokens per second. That’s fast enough to feel like a real conversation. A used RTX 3060 (12GB VRAM) handles 13B models well. If you’re considering adding a GPU specifically for local AI, the GPU buyer’s guide for self-hosted AI covers the used 3060, 3090, P40, and Intel Arc with real VRAM and power numbers.
AMD integrated graphics (Radeon 780M in mini PCs like the Minisforum UM790): Partial GPU offloading works with ROCm, but setup is complex and performance is inconsistent. Easier to just use CPU for these.
If you have an Intel Arc or NVIDIA GPU, set up GPU acceleration. If you’re on a CPU-only setup, Ollama still works. Just don’t expect ChatGPT-level response speed.
Install Ollama
Two options: native install on the host, or Docker.
Native install (recommended for GPU support):
curl -fsSL https://ollama.com/install.sh | sh
This installs the Ollama service, which starts automatically and listens on port 11434. Native install is simpler for GPU passthrough because Docker GPU support requires the NVIDIA Container Toolkit for NVIDIA, or additional setup for AMD.
Docker (CPU-only):
mkdir -p /opt/ollama
Create /opt/ollama/docker-compose.yml:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- /opt/ollama/models:/root/.ollama
ports:
- "11434:11434"
restart: unless-stopped
cd /opt/ollama
docker compose up -d
Docker with NVIDIA GPU:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- /opt/ollama/models:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
Requires NVIDIA Container Toolkit installed on the host first.
Pull Your First Model
ollama pull llama3.2
This downloads the Llama 3.2 3B model (~2GB). Good starting point: small enough to load quickly, capable enough to be useful.
Other models worth trying:
ollama pull mistral # Mistral 7B — solid general-purpose model
ollama pull gemma2:9b # Google Gemma 2 9B — good at code
ollama pull phi3:mini # Microsoft Phi-3 Mini — very fast, small
ollama pull llama3.1:8b # Llama 3.1 8B — Meta's current 8B model
Model files are stored in /root/.ollama/models (or the volume mount path). Each 7B model is 4–5GB. Plan your storage accordingly.
Chat From the Terminal
ollama run llama3.2
This opens an interactive chat session. Type your message, press Enter. Type /bye to exit.
Useful for quick checks:
ollama run llama3.2 "Explain Docker networking modes in two paragraphs"
Add a Web Interface (Open WebUI)
The terminal is functional but not great for regular use. Open WebUI is a polished chat interface that connects to Ollama and gives you a proper ChatGPT-like UI.
Add it to your compose file, or create separately at /opt/open-webui/docker-compose.yml:
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- /opt/open-webui/data:/app/backend/data
ports:
- "3001:8080"
environment:
- OLLAMA_BASE_URL=http://YOUR_SERVER_IP:11434
restart: always
Replace YOUR_SERVER_IP with your actual server IP. If Ollama is running in Docker on the same server, use the Docker bridge IP (usually 172.17.0.1) or the container name if they’re on the same network.
cd /opt/open-webui
docker compose up -d
Access Open WebUI at http://YOUR_SERVER_IP:3001. Create an account (the first account created becomes admin). Select your model from the dropdown and start chatting.
Practical Uses
Code explanation and review: paste a function and ask what it does. Works well even on slow hardware since you’re not waiting for a back-and-forth conversation.
Summarization: paste a long article or documentation section, ask for a summary. CPU inference is fine here because you type once and wait once.
Document Q&A: Open WebUI supports document uploads. Upload a PDF and ask questions about it. Useful for technical documentation you need to search through.
Local API for other services: Ollama’s API is compatible with OpenAI’s API format. Services that support an OpenAI API endpoint can often point at your local Ollama instead. Linkwarden’s AI tagging feature works this way.
Model Size vs. Quality vs. Speed
| Model | Size | RAM Needed | Speed (CPU) | Notes |
|---|---|---|---|---|
| Phi-3 Mini | 2.3GB | 4GB | Fast | Good for simple tasks |
| Llama 3.2 3B | 2GB | 4GB | Fast | Good starting point |
| Llama 3.1 8B | 4.7GB | 8GB | Moderate | Strong general purpose |
| Mistral 7B | 4.1GB | 8GB | Moderate | Good at instruction following |
| Gemma 2 9B | 5.5GB | 10GB | Slow | Excellent code model |
For a 16GB RAM server running other services, Llama 3.1 8B is about the largest model you can load while keeping the rest of the stack running.
Keeping Models Updated
Models don’t auto-update. When a new version of a model releases:
ollama pull llama3.1:8b # pulls the latest version
Ollama keeps the old version until you remove it:
ollama rm llama3.1:8b:old-version-tag
# or list all models
ollama list
Access via Tailscale
If you want to use Open WebUI from other devices on your tailnet:
The WebUI is accessible at http://YOUR_TAILSCALE_IP:3001 from any device in your network. No configuration changes needed. Tailscale handles the connectivity.
For the Ollama API itself (port 11434), be aware that the default configuration accepts connections from any IP. If you’re exposing your homelab over Tailscale, that’s fine. If you’re exposing ports publicly, restrict the Ollama port to localhost or your LAN only.