Running a Local AI with Ollama on Your Homelab

Ollama lets you run large language models on your own hardware. You get a ChatGPT-like experience (ask questions, get code help, summarize documents) with the model running locally. No API key, no per-token costs, no data leaving your network.

The catch is hardware. Small models (7B parameters) run acceptably on a mini PC with 16GB RAM. Larger models need more RAM, and GPU acceleration makes everything dramatically faster. This guide covers setup on a typical homelab server and is honest about what to expect from CPU-only inference.

Hardware Reality Check

CPU-only (no GPU), 16GB RAM: You can run 7B models (Llama 3.1 8B, Mistral 7B). Response speed is slow. Expect 3-8 tokens per second depending on your CPU. Usable for summarization and code help where you can wait a few seconds per sentence. Not great for interactive conversation.

CPU-only, 32GB RAM: Can load 13B models. Still slow. The RAM lets you load bigger models; the CPU is still the bottleneck.

Dedicated GPU (NVIDIA with 8GB+ VRAM): 7B models run at 40-60+ tokens per second. That’s fast enough to feel like a real conversation. A used RTX 3060 (12GB VRAM) handles 13B models well. If you’re considering adding a GPU specifically for local AI, the GPU buyer’s guide for self-hosted AI covers the used 3060, 3090, P40, and Intel Arc with real VRAM and power numbers.

AMD integrated graphics (Radeon 780M in mini PCs like the Minisforum UM790): Partial GPU offloading works with ROCm, but setup is complex and performance is inconsistent. Easier to just use CPU for these.

If you have an Intel Arc or NVIDIA GPU, set up GPU acceleration. If you’re on a CPU-only setup, Ollama still works. Just don’t expect ChatGPT-level response speed.

Install Ollama

Two options: native install on the host, or Docker.

Native install (recommended for GPU support):

curl -fsSL https://ollama.com/install.sh | sh

This installs the Ollama service, which starts automatically and listens on port 11434. Native install is simpler for GPU passthrough because Docker GPU support requires the NVIDIA Container Toolkit for NVIDIA, or additional setup for AMD.

Docker (CPU-only):

mkdir -p /opt/ollama

Create /opt/ollama/docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - /opt/ollama/models:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped

cd /opt/ollama
docker compose up -d

Docker with NVIDIA GPU:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - /opt/ollama/models:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

Requires NVIDIA Container Toolkit installed on the host first.

Pull Your First Model

ollama pull llama3.2

This downloads the Llama 3.2 3B model (~2GB). Good starting point: small enough to load quickly, capable enough to be useful.

Other models worth trying:

ollama pull mistral          # Mistral 7B (solid general-purpose model)
ollama pull gemma2:9b        # Google Gemma 2 9B (good at code)
ollama pull phi3:mini        # Microsoft Phi-3 Mini (very fast, small)
ollama pull llama3.1:8b      # Llama 3.1 8B (Meta's current 8B model)

Model files are stored in /root/.ollama/models (or the volume mount path). Each 7B model is 4-5GB. Plan your storage accordingly.

Chat From the Terminal

ollama run llama3.2

This opens an interactive chat session. Type your message, press Enter. Type /bye to exit.

Useful for quick checks:

ollama run llama3.2 "Explain Docker networking modes in two paragraphs"

Add a Web Interface (Open WebUI)

The terminal is functional but not great for regular use. Open WebUI is a polished chat interface that connects to Ollama and gives you a proper ChatGPT-like UI.

Add it to your compose file, or create separately at /opt/open-webui/docker-compose.yml:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    volumes:
      - /opt/open-webui/data:/app/backend/data
    ports:
      - "3001:8080"
    environment:
      - OLLAMA_BASE_URL=http://YOUR_SERVER_IP:11434
    restart: always

Replace YOUR_SERVER_IP with your actual server IP. If Ollama is running in Docker on the same server, use the Docker bridge IP (usually 172.17.0.1) or the container name if they’re on the same network.

cd /opt/open-webui
docker compose up -d

Access Open WebUI at http://YOUR_SERVER_IP:3001. Create an account (the first account created becomes admin). Select your model from the dropdown and start chatting.

Practical Uses

Code explanation and review: paste a function and ask what it does. Works well even on slow hardware since you’re not waiting for a back-and-forth conversation.

Summarization: paste a long article or documentation section, ask for a summary. CPU inference is fine here because you type once and wait once.

Document Q&A: Open WebUI supports document uploads. Upload a PDF and ask questions about it. Useful for technical documentation you need to search through.

Local API for other services: Ollama’s API is compatible with OpenAI’s API format. Services that support an OpenAI API endpoint can often point at your local Ollama instead. Linkwarden’s AI tagging feature works this way.

Model Size vs. Quality vs. Speed

Model	Size	RAM Needed	Speed (CPU)	Notes
Phi-3 Mini	2.3GB	4GB	Fast	Good for simple tasks
Llama 3.2 3B	2GB	4GB	Fast	Good starting point
Llama 3.1 8B	4.7GB	8GB	Moderate	Strong general purpose
Mistral 7B	4.1GB	8GB	Moderate	Good at instruction following
Gemma 2 9B	5.5GB	10GB	Slow	Excellent code model

For a 16GB RAM server running other services, Llama 3.1 8B is about the largest model you can load while keeping the rest of the stack running.

Keeping Models Updated

Models don’t auto-update. When a new version of a model releases:

ollama pull llama3.1:8b   # pulls the latest version

Ollama keeps the old version until you remove it:

ollama rm llama3.1:8b:old-version-tag
# or list all models
ollama list

Access via Tailscale

If you want to use Open WebUI from other devices on your tailnet:

The WebUI is accessible at http://YOUR_TAILSCALE_IP:3001 from any device in your network. No configuration changes needed. Tailscale handles the connectivity.

For the Ollama API itself (port 11434), be aware that the default configuration accepts connections from any IP. If you’re exposing your homelab over Tailscale, that’s fine. If you’re exposing ports publicly, restrict the Ollama port to localhost or your LAN only.

If you’re still deciding on hardware or want a more complete walkthrough that covers mini PC selection alongside the Ollama and Open WebUI install, the self-hosted LLM guide covers that full arc end to end.