Do I need a dedicated GPU to run a local LLM?

No. Ollama runs on CPU only, and modern mini PCs with 16-32GB RAM can run 7B and 13B models at usable speeds. A GPU (NVIDIA RTX 3060 12GB or better) dramatically improves response speed, but it's optional. It's a good upgrade path once you decide you want more performance.

How much does it cost per month to self-host an LLM?

The main ongoing cost is electricity. A mini PC running Ollama draws 15-30W at idle, roughly $1-3/month in power. Compare that to ChatGPT Plus or Claude Pro at $20/month each. You break even on hardware within a few months if you were going to subscribe anyway.

What is the best model to start with?

Llama 3.1 8B is a strong starting point. It requires 8GB RAM and runs at acceptable speed on a modern mini PC CPU. If you have 32GB RAM, try a quantized 13B model for noticeably better quality. For code specifically, Qwen2.5-Coder:7b is excellent.

Can I use this as a drop-in replacement for ChatGPT or Claude?

For many everyday tasks, yes. Code help, summarization, document Q&A, and general questions all work well with a 7B or 13B model. Frontier commercial models are still more capable for complex reasoning. But for private, offline, no-subscription use, a self-hosted setup covers most daily AI tasks.

Can I access my self-hosted LLM from my phone or from outside my home?

Yes. Tailscale makes this straightforward. Once your server is on your tailnet, Open WebUI is accessible from any device running Tailscale. No port forwarding, no public exposure required.

How to Self-Host an LLM at Home in 2026 (Ollama + Open WebUI on a Mini PC)

This post contains affiliate links. If you buy hardware through these links, I earn a small commission at no extra cost to you.

ChatGPT Plus costs $20/month. Claude Pro costs $20/month. Both send your prompts to someone else’s servers.

A decent mini PC runs 7B models locally, costs $180-320 depending on specs, and draws about as much power as a dim light bulb. If you were going to pay for an AI subscription anyway, you break even on hardware in a few months, and every conversation stays on your network.

This guide covers the whole setup: picking the right hardware, installing Ollama, choosing models that fit your specs, and adding Open WebUI for a proper chat interface. If you’ve never run a local LLM before, this is the place to start.

For hardware context outside the LLM-specific picks below, the best mini PCs for a homelab in 2026 covers the broader landscape. If you’re deciding what else to self-host, see replacing Google One and other subscriptions with self-hosted tools.

Hardware You Need

Ollama runs on CPU without any GPU. What determines which models you can run is RAM. Models load entirely into RAM during inference.

RAM	What you can run	Approx. response speed
8GB	3B models (Phi-3 Mini, Llama 3.2 3B)	Fast on modern CPU
16GB	7B-8B models (Llama 3.1 8B, Mistral 7B)	3-8 tokens/sec
32GB	13B models, or 7B with room for other services	Same speed, bigger models
32GB + GPU (12GB VRAM)	13B fully GPU-loaded	40-60+ tokens/sec

“3-8 tokens per second” means roughly one word every 0.3-0.6 seconds. Slow for real-time conversation, workable for “generate this code block” or “summarize this document.” Most people find it usable and consider the privacy tradeoff worth it.

Mini PC Picks

For CPU inference, you want a fast multi-core processor and as much RAM as the board supports. AMD Ryzen 7000/8000 series mini PCs are the current sweet spot.

Budget tier (~$180-220):

The Beelink EQ12 with Intel N100 runs 3B models adequately. The N100 has fast efficiency cores, but single-core throughput limits it with 7B models. Fine for light use.

Mid-tier (~$250-320, recommended starting point):

The Beelink SER8 (Ryzen 9 8945HS) and MinisForum UM790 Pro (Ryzen 9 7940HS) are the machines I’d recommend for anyone serious about local LLMs. The Ryzen cores are meaningfully faster for inference than Intel N-series machines. Both also carry AMD Radeon 780M integrated graphics, which supports partial GPU offloading. It doesn’t transform CPU performance, but it helps with smaller models.

Both ship with 16GB DDR5. Upgrade to 32GB with a 32GB DDR5 SO-DIMM kit (~$45-60) if you want to run 13B models or keep RAM free for other homelab services.

Storage: Models eat disk space. Llama 3.1 8B alone is about 5GB. If you want several models installed, plan for at least a 1TB NVMe drive. Most mini PCs have a spare M.2 slot so you can add storage without replacing the built-in drive.

GPU upgrade path (~$150-300 used): A used NVIDIA RTX 3060 12GB changes the experience significantly: 13B models run fully GPU-accelerated at 40-60+ tokens per second. A used RTX 3090 24GB handles 70B quantized models, which puts you at the same capability tier as top commercial offerings. This path requires a desktop, not a mini PC, but it’s worth knowing the ceiling.

Install Ollama

Ollama is the engine. It loads models, serves them over a local API, and runs as a background service.

Linux (one-line install):

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama as a systemd service, starts it immediately, and sets it to start on boot. It listens on port 11434.

Verify it’s running:

systemctl status ollama

You should see active (running).

macOS: Download the native app from ollama.com. It runs in the menu bar and sets up the same local API.

Pull Your First Model

Models are separate downloads from Ollama itself. Start with:

ollama pull llama3.1:8b

About 4.7GB download. On a typical home connection, 5-10 minutes.

Test it immediately:

ollama run llama3.1:8b "What's the capital of France?"

If you get a coherent response, Ollama is working. Type /bye to exit interactive mode.

Other models worth trying once the basics are confirmed:

ollama pull mistral:7b          # Strong at following instructions
ollama pull qwen2.5-coder:7b    # Best code model in this size class
ollama pull gemma2:9b           # Google Gemma 2 (good reasoning)
ollama pull phi3:mini           # Microsoft Phi-3 Mini (very fast, capable for its size)

Model size reference

Model	Download size	RAM needed	Best for
phi3:mini	2.2GB	4GB	Fast responses, simple tasks
llama3.2:3b	2.0GB	4GB	Quick answers, low-RAM systems
llama3.1:8b	4.7GB	8GB	General purpose, strong starting point
mistral:7b	4.1GB	8GB	Instruction following, Q&A
qwen2.5-coder:7b	4.7GB	8GB	Code generation and review
gemma2:9b	5.5GB	10GB	Reasoning tasks

For a 16GB mini PC running other homelab services alongside it, llama3.1:8b is the practical ceiling. With 32GB, you can load a 13B model and still have RAM left over.

Set Up Open WebUI

The terminal interface works but isn’t what most people want for daily use. Open WebUI provides a ChatGPT-like interface with conversation history, model switching, and document uploads.

It runs in Docker and connects to your Ollama instance. If you don’t have Docker installed, the Docker Compose basics guide covers installation.

Create a directory:

mkdir -p /opt/open-webui

Create /opt/open-webui/docker-compose.yml:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    volumes:
      - /opt/open-webui/data:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://YOUR_SERVER_IP:11434
    restart: unless-stopped

Replace YOUR_SERVER_IP with your server’s LAN IP (e.g., 192.168.1.50). If Open WebUI and Ollama are on the same machine, you can use the Docker bridge IP 172.17.0.1 instead.

Start it:

cd /opt/open-webui
docker compose up -d

Navigate to http://YOUR_SERVER_IP:3000. Create an account: the first account created gets admin privileges. Select your model from the top dropdown and start a conversation.

What You Can Actually Do With This

Code help: Paste a function and ask what it does, how to improve it, or to write a test. CPU inference speed is fine here because the interaction is write-once, wait-once rather than back-and-forth conversation.

Summarization: Paste a long article, meeting notes, or documentation section. Ask for a bullet summary. 7B models handle this well.

Document Q&A: Open WebUI supports file uploads. Upload a PDF and ask questions about its contents. I use this for technical documentation I need to search without reading everything.

Private use cases: A local LLM you control is valuable for anything you’d never type into a commercial service: personal decisions, health questions, sensitive work topics. It’s not connected to anything outside your network.

API access for other apps: Ollama exposes an OpenAI-compatible API at http://YOUR_IP:11434/v1. Applications that support a custom OpenAI endpoint can often point at your local Ollama instead of the real OpenAI API.

GPU Acceleration

CPU inference is usable. GPU inference is a different tier of experience.

For NVIDIA GPU passthrough in Docker, install the NVIDIA Container Toolkit on the host first, then update your Ollama deployment:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - /opt/ollama/models:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

Ollama detects and uses the GPU automatically after restart. A 7B model that generates text at 5 tokens/sec on CPU will run at 50+ tokens/sec on a GPU, fast enough to feel like a real conversation.

VRAM is the hard limit for GPU inference. Models must fit in VRAM to run GPU-accelerated; overflow spills to CPU and slows everything down.

GPU VRAM	Models that fit fully	Typical speed
8GB	Up to 7B models	30-50 tokens/sec
12GB (RTX 3060)	Up to 13B models	40-60 tokens/sec
24GB (RTX 3090)	34B models; 70B Q4 quantized	60-80 tokens/sec

If you’re already running a homelab desktop with a capable GPU sitting idle, enabling GPU passthrough for Ollama costs nothing. For help picking the right GPU, the GPU buyer’s guide for self-hosted AI compares the used 3060, 3090, P40, and Intel Arc on real VRAM, power draw, and price-per-token.

Remote Access

Open WebUI is accessible from any device on your local network at http://YOUR_IP:3000. For access from outside your home, Tailscale is the right path.

If Tailscale isn’t already on your server, the Tailscale homelab setup guide walks through it. Once installed, Open WebUI is accessible at http://YOUR_TAILSCALE_IP:3000 from any device on your tailnet. No port forwarding, no public exposure, no SSL required for this use case.

Maintenance

Models don’t auto-update. When a new version releases, pull it manually:

ollama pull llama3.1:8b    # pulls the latest version
ollama list                # see what's installed + storage used
ollama rm gemma2:9b        # remove a model you're not using

Update Open WebUI when new versions drop:

cd /opt/open-webui
docker compose pull
docker compose up -d

Conversation history is stored in the data volume and persists through updates.

A self-hosted 7B or 13B model won’t replace frontier AI for complex reasoning tasks. But for code help, summarization, private Q&A, and most everyday AI assistance, a $250-320 mini PC running Ollama competes well with a $20/month subscription, and everything stays on your hardware.