← All Guides
intermediate

Running a Local AI with Ollama on Your Homelab

Set up Ollama on Docker to run large language models locally (Llama 3, Mistral, Gemma, and more), with a web UI for chatting and an API for integrations.

Budget Homelab ·
dockeraihomelab

Ollama lets you run large language models on your own hardware. You get a ChatGPT-like experience — ask questions, get code help, summarize documents — with the model running locally. No API key, no per-token costs, no data leaving your network.

The catch is hardware. Small models (7B parameters) run acceptably on a mini PC with 16GB RAM. Larger models need more RAM, and GPU acceleration makes everything dramatically faster. This guide covers setup on a typical homelab server and is honest about what to expect from CPU-only inference.

Hardware Reality Check

CPU-only (no GPU), 16GB RAM: You can run 7B models (Llama 3.1 8B, Mistral 7B). Response speed is slow. Expect 3-8 tokens per second depending on your CPU. Usable for summarization and code help where you can wait a few seconds per sentence. Not great for interactive conversation.

CPU-only, 32GB RAM: Can load 13B models. Still slow. The RAM lets you load bigger models; the CPU is still the bottleneck.

Dedicated GPU (NVIDIA with 8GB+ VRAM): 7B models run at 40-60+ tokens per second. That’s fast enough to feel like a real conversation. A used RTX 3060 (12GB VRAM) handles 13B models well. If you’re considering adding a GPU specifically for local AI, the GPU buyer’s guide for self-hosted AI covers the used 3060, 3090, P40, and Intel Arc with real VRAM and power numbers.

AMD integrated graphics (Radeon 780M in mini PCs like the Minisforum UM790): Partial GPU offloading works with ROCm, but setup is complex and performance is inconsistent. Easier to just use CPU for these.

If you have an Intel Arc or NVIDIA GPU, set up GPU acceleration. If you’re on a CPU-only setup, Ollama still works. Just don’t expect ChatGPT-level response speed.

Install Ollama

Two options: native install on the host, or Docker.

Native install (recommended for GPU support):

curl -fsSL https://ollama.com/install.sh | sh

This installs the Ollama service, which starts automatically and listens on port 11434. Native install is simpler for GPU passthrough because Docker GPU support requires the NVIDIA Container Toolkit for NVIDIA, or additional setup for AMD.

Docker (CPU-only):

mkdir -p /opt/ollama

Create /opt/ollama/docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - /opt/ollama/models:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped
cd /opt/ollama
docker compose up -d

Docker with NVIDIA GPU:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - /opt/ollama/models:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

Requires NVIDIA Container Toolkit installed on the host first.

Pull Your First Model

ollama pull llama3.2

This downloads the Llama 3.2 3B model (~2GB). Good starting point: small enough to load quickly, capable enough to be useful.

Other models worth trying:

ollama pull mistral          # Mistral 7B — solid general-purpose model
ollama pull gemma2:9b        # Google Gemma 2 9B — good at code
ollama pull phi3:mini        # Microsoft Phi-3 Mini — very fast, small
ollama pull llama3.1:8b      # Llama 3.1 8B — Meta's current 8B model

Model files are stored in /root/.ollama/models (or the volume mount path). Each 7B model is 4–5GB. Plan your storage accordingly.

Chat From the Terminal

ollama run llama3.2

This opens an interactive chat session. Type your message, press Enter. Type /bye to exit.

Useful for quick checks:

ollama run llama3.2 "Explain Docker networking modes in two paragraphs"

Add a Web Interface (Open WebUI)

The terminal is functional but not great for regular use. Open WebUI is a polished chat interface that connects to Ollama and gives you a proper ChatGPT-like UI.

Add it to your compose file, or create separately at /opt/open-webui/docker-compose.yml:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    volumes:
      - /opt/open-webui/data:/app/backend/data
    ports:
      - "3001:8080"
    environment:
      - OLLAMA_BASE_URL=http://YOUR_SERVER_IP:11434
    restart: always

Replace YOUR_SERVER_IP with your actual server IP. If Ollama is running in Docker on the same server, use the Docker bridge IP (usually 172.17.0.1) or the container name if they’re on the same network.

cd /opt/open-webui
docker compose up -d

Access Open WebUI at http://YOUR_SERVER_IP:3001. Create an account (the first account created becomes admin). Select your model from the dropdown and start chatting.

Practical Uses

Code explanation and review: paste a function and ask what it does. Works well even on slow hardware since you’re not waiting for a back-and-forth conversation.

Summarization: paste a long article or documentation section, ask for a summary. CPU inference is fine here because you type once and wait once.

Document Q&A: Open WebUI supports document uploads. Upload a PDF and ask questions about it. Useful for technical documentation you need to search through.

Local API for other services: Ollama’s API is compatible with OpenAI’s API format. Services that support an OpenAI API endpoint can often point at your local Ollama instead. Linkwarden’s AI tagging feature works this way.

Model Size vs. Quality vs. Speed

ModelSizeRAM NeededSpeed (CPU)Notes
Phi-3 Mini2.3GB4GBFastGood for simple tasks
Llama 3.2 3B2GB4GBFastGood starting point
Llama 3.1 8B4.7GB8GBModerateStrong general purpose
Mistral 7B4.1GB8GBModerateGood at instruction following
Gemma 2 9B5.5GB10GBSlowExcellent code model

For a 16GB RAM server running other services, Llama 3.1 8B is about the largest model you can load while keeping the rest of the stack running.

Keeping Models Updated

Models don’t auto-update. When a new version of a model releases:

ollama pull llama3.1:8b   # pulls the latest version

Ollama keeps the old version until you remove it:

ollama rm llama3.1:8b:old-version-tag
# or list all models
ollama list

Access via Tailscale

If you want to use Open WebUI from other devices on your tailnet:

The WebUI is accessible at http://YOUR_TAILSCALE_IP:3001 from any device in your network. No configuration changes needed. Tailscale handles the connectivity.

For the Ollama API itself (port 11434), be aware that the default configuration accepts connections from any IP. If you’re exposing your homelab over Tailscale, that’s fine. If you’re exposing ports publicly, restrict the Ollama port to localhost or your LAN only.