Best GPU for a Self-Hosted AI Homelab (Used 3060 vs 3090 vs P40 vs Arc)
A practical buyer's guide to GPU selection for local LLMs: the used 3060 12GB, 3090, P40, and Intel Arc compared on VRAM, noise, power draw, and real inference speed.
This post contains affiliate links. If you buy through them, I earn a small commission at no extra cost to you.
NVIDIA RTX 3060 12GB
~$280-320 new / ~$180-230 usedBest GPU for a budget local AI homelab. 12GB GDDR6, full CUDA support, flash attention, runs 13B models fully GPU-loaded at 40-60 tokens/sec. Runs cool, quiet, and pulls ~170W at load.
Check Price →Intel Arc A770 16GB
~$320-349Best new-GPU value if you want 16GB VRAM at under $350. IPEX-LLM gives competitive inference speeds with Ollama. Some setup friction on Linux but well-documented now.
Check Price →NVIDIA RTX 3090 24GB
~$400-500 usedBest VRAM-per-dollar in the used market. 24GB GDDR6X, handles 33B quantized models fully loaded. Real trade-off: 350W TDP, runs hot, needs good cooling and a 850W+ PSU.
Check Price →Seasonic Focus GX-850 PSU
~$130Solid 80+ Gold 850W unit. Enough headroom for a 3090 build. Semi-modular, quiet fan, well-supported on Linux.
Check Price →Crucial P3 Plus 2TB NVMe
~$90Budget model storage. Fits in the spare M.2 slot most mid-tower boards have. Enough room for 8-10 large models with space left over.
Check Price →The existing self-host LLM guide on this site deliberately punts on GPU selection. The headline on that guide says “no GPU required” — and that’s true. CPU inference on a modern Ryzen mini PC gets you a working local AI setup with usable response speeds on 7B models.
But once you’ve run a model on CPU for a week, you notice the ceiling. Seven tokens per second is workable. Fifty tokens per second is a different experience. The gap between “usable” and “fast” in local LLM work is almost entirely VRAM.
This is the guide for people who have decided to add a GPU and want to spend their money well. The options in 2026 are better than they’ve ever been, but the trade-offs aren’t obvious from spec sheets.
The One Variable That Controls Everything: VRAM
Before comparing GPUs, it helps to understand why VRAM is the constraint.
When you run a model in Ollama, the entire model loads into VRAM. If your GPU has 12GB VRAM and your model weighs 9GB, the remaining 3GB handles the context window and activations. If your model is larger than your VRAM, Ollama offloads layers to system RAM — and inference speed drops sharply because the memory bandwidth difference between GDDR6 and system RAM is roughly 10x.
Rule of thumb for quantized GGUF models (what Ollama uses by default):
| Model size | Quantization | VRAM needed | Fits in |
|---|---|---|---|
| 7B-8B | Q4_K_M | ~5-6GB | 8GB+ cards |
| 13B | Q4_K_M | ~8-9GB | 10GB+ cards |
| 33B | Q4_K_M | ~20-22GB | 24GB cards only |
| 70B | Q4_K_M | ~40-45GB | Multi-GPU or CPU fallback |
This table is why a 12GB card is the right starting point. It handles 13B models fully in VRAM, which is the practical ceiling for single-GPU quality on most homelab hardware.
The Picks
Best All-Around: NVIDIA RTX 3060 12GB
VRAM: 12GB GDDR6 TDP: 170W (typical load) Idle draw: 15-20W Noise: Quiet to moderate depending on card design Flash attention: Yes Price: ~$280-320 new, ~$180-230 used
The 3060 12GB is the right GPU for most people building a local AI homelab on a budget. The VRAM sweet spot is the reason — Nvidia sold 8GB and 10GB variants of the 3060 in some configurations, and both are worse for this use case. The 12GB version is the one you want.
At this VRAM size, you can load a 13B Q4_K_M model fully in VRAM with room for a reasonable context window. Inference speed lands around 40-60 tokens per second depending on the model and host CPU — which is fast enough that it feels responsive rather than slow.
The 3060 also supports flash attention, which matters for larger context windows. The P40 (covered below) does not.
For new cards, the RTX 3060 12GB is available through Amazon from AIB partners. Used cards on eBay typically run $180-230, and the 3060 12GB was widely sold, so supply is healthy.
One practical note: the 3060 12GB uses a standard dual-slot PCIe x16 form factor and runs cool enough on reference and AIB cooler designs that you can put it in a mid-tower with reasonable airflow and not think about it again.
Best Value on New Hardware: Intel Arc A770 16GB
VRAM: 16GB GDDR6 TDP: 225W Idle draw: 25-30W Noise: Moderate Flash attention: Yes (with IPEX-LLM) Price: ~$320-349 new
Intel’s Arc A770 deserves more attention than it gets in this conversation. At $320-349 new, it offers 16GB GDDR6 — more VRAM than any Nvidia card at this price point. The extra 4GB over a 3060 12GB means you get more headroom for context length on 13B models, and some quantized 20B models load fully in VRAM when they’d need to spill layers on a 12GB card.
The caveat is Linux tooling. The Arc A770 requires Intel’s IPEX-LLM stack for accelerated inference rather than the standard CUDA path. As of early 2026 this is well-documented and Ollama supports it, but setup is more involved than dropping in an Nvidia card and running ollama serve. Budget an extra hour for driver configuration on a fresh Linux install.
If you’re already on Windows or don’t mind following a setup guide once, the Arc A770 16GB is the best new-hardware value in this category.
Best VRAM in the Used Market: NVIDIA RTX 3090 24GB
VRAM: 24GB GDDR6X TDP: 350W (load), ~25-30W (idle) Noise: Loud under load on most reference designs Flash attention: Yes Price: ~$400-500 used
The 3090 is the card you want if 13B models aren’t enough and you want to push into 33B territory. A 33B Q4_K_M model needs about 20-22GB of VRAM to load fully, and the 3090’s 24GB covers it with room to spare. At that size, you’re running models that are genuinely competitive with older GPT-3.5 performance on many benchmarks — qualitatively different from 7B or 13B.
The honest trade-off list is long.
Heat. The 3090 dumps roughly 350W of heat at load. In a mid-tower with three case fans, that’s manageable. In a small form factor case or a poorly ventilated office, it will raise room temperature noticeably.
Noise. Most 3090 reference designs have aggressive fan curves under load. The AIB triple-fan versions (ASUS ROG Strix, MSI Gaming X Trio) are quieter but large — many won’t fit in mid-tower cases without measurement checks.
PSU requirements. A 3090 needs at minimum a quality 750W PSU. 850W gives you headroom and is the number I’d recommend. The Seasonic Focus GX-850 is the unit I’d buy.
Used market reliability. The 3090 was the card that ETH miners ran hard through 2021-2022. Not every used unit comes from gaming use. Buy from sellers with return windows, avoid listings that don’t mention gaming or rendering as the use case, and run a VRAM stress test (like ollama run llama3.1:8b for 20+ minutes) in your return window.
For a homelab where the machine is in a basement or a closet and noise doesn’t matter, the 3090 is the best VRAM-per-dollar in the used market as of mid-2026. If it’s in your office or living room, think hard before buying.
The Budget Card That Gets Overlooked: NVIDIA RTX 3060 Ti 8GB
VRAM: 8GB GDDR6 TDP: 200W Idle draw: ~15W Flash attention: Yes Price: ~$250-300 new, ~$130-180 used
The 3060 Ti has better raw compute than the 3060 12GB — faster CUDA cores, higher bandwidth — but half the VRAM. On 7B models it’s faster. On 13B models it spills layers to system RAM.
If you know you’ll stay on 7B models (Llama 3.1 8B, Mistral 7B, Qwen2.5:7B), the 3060 Ti at $130-180 used is a reasonable buy. If you think you might want 13B quality at some point, the 3060 12GB’s VRAM advantage matters more than the compute difference.
Don’t buy the 3060 Ti if you’re planning to run 13B. Buy it if you’re budget-constrained and comfortable staying on 7B.
The Misleading Value Play: NVIDIA P40 24GB
VRAM: 24GB GDDR5 TDP: 250W Idle draw: ~40-60W idle (it doesn’t downclock like consumer cards) Noise: Loud — it’s a blower fan designed for rack mount servers Flash attention: No Price: ~$150-250 used
The P40 is the GPU that shows up constantly in “budget 24GB AI build” YouTube videos. The pitch is true: you can get 24GB of VRAM for $150-250, which sounds like a deal compared to a $400 3090.
The reality is messier.
No flash attention. The P40 is a Pascal-generation card. Flash attention requires hardware capabilities that Pascal doesn’t have. For large context windows (filling a 16K+ context), this is a meaningful performance penalty and in some model configurations causes inference to fail entirely.
High idle power. The P40 doesn’t have the consumer-style power management Nvidia added to GeForce cards in the Turing generation. It idles at 40-60W versus 15-20W for a 3060. Over a year of 24/7 operation, that’s an extra $35-55 in electricity at US average rates. Over two years it erodes most of the purchase price advantage.
Server blower noise. The P40 was designed for rack servers with high-airflow chassis. In a typical desktop case, the blower fan is audible from across the room under load. You can swap to an aftermarket cooler (the P40 community has guides for this), but it’s a project.
No display output. The P40 has no video output. You need a second GPU or integrated graphics to drive a monitor. In a headless homelab server this isn’t a problem. In a desktop you use for other things, it’s inconvenient.
The P40 is a reasonable choice for someone building a dedicated headless inference server, comfortable with Linux, who already has a separate display-capable GPU, and who wants to run 33B models on a tight budget. It’s not the right default recommendation for most people starting out.
If you want 24GB and the P40’s limitations are acceptable: buy it from a seller who specifies data center pull (not miner pull), apply aftermarket cooling, and run it headless. If any of those constraints don’t fit your situation, spend the extra money on a used 3090.
Decision Matrix
| GPU | VRAM | Best 13B speed | Best for | Avoid if |
|---|---|---|---|---|
| RTX 3060 12GB | 12GB | 40-60 t/s | Most homelab builds | You need 33B models |
| Arc A770 16GB | 16GB | 35-55 t/s | New-card buyers, Windows | You want simplest Linux setup |
| RTX 3090 24GB | 24GB | 60-80 t/s | 33B models, desktop servers | Noise is a problem, tight PSU budget |
| RTX 3060 Ti 8GB | 8GB | 50-70 t/s (7B) | Budget 7B-only builds | You want 13B quality |
| P40 24GB | 24GB | 20-35 t/s | Dedicated headless server | Consumer desktop, quiet homelab |
The three factors that should drive your decision, in order:
- VRAM first. What model size do you want to run? 13B needs 12GB+ with headroom. 33B needs 24GB. Start here.
- Power budget. A 3090 at 350W is fine in a basement. It’s not fine as a permanent load on a 15A circuit with other machines on it.
- Noise tolerance. The P40 is genuinely loud. The 3090 reference design is loud at load. The 3060 and Arc A770 are quiet to moderate.
What You Actually Need Beyond the GPU
A GPU alone isn’t a complete build. These are the parts that matter and get underspecified.
PSU: Budget at least 750W for a 3060 or Arc build. For a 3090, use 850W minimum. Don’t cheap out on the PSU — the 3090 can spike over 400W during initial model loading and an underpowered PSU will hard-reset your system. The Seasonic Focus GX-850 is reliable.
Model storage: Models pile up. Llama 3.1 8B is 4.7GB, Qwen2.5:14B is 9GB, a 33B Q4_K_M model is ~20GB. A 2TB NVMe like the Crucial P3 Plus in a spare M.2 slot gives you room for 8-10 large models without managing storage constantly.
Case airflow: More important with a 3090 than a 3060, but worth checking on any build. Three case fans (front intake, rear exhaust) is the baseline. More if you’re running a 3090 in an enclosed rack.
PCIe x16 slot: Any modern desktop motherboard has one. If you’re adding a GPU to an existing homelab server (ThinkCentre, etc.), verify the PCIe slot is x16 electrical. Some small form factor machines have x16 mechanical slots that are only x4 electrical — fine for a NIC, not ideal for a GPU.
How This Connects to the Software Side
On the software side, none of this changes the setup. Ollama installation from scratch is the same process regardless of GPU. Adding a GPU just means Ollama detects CUDA (Nvidia) or OneAPI (Intel Arc) automatically on first run and uses it.
The one Ollama flag worth knowing: set OLLAMA_NUM_GPU to the number of GPU layers you want offloaded. The default is all of them, which is usually correct. If you’re explicitly testing CPU vs GPU inference, this is the knob.
Once Ollama is running with GPU support, model selection is where the real decisions happen. The Ollama setup guide covers that in detail.
The Short Answer
For most people building a budget local AI homelab in 2026: buy a used RTX 3060 12GB for $180-230. It handles 13B models fully in VRAM, runs quiet, pulls reasonable power, and has the full modern CUDA feature set including flash attention. The used market is healthy and the card is well-understood.
If you want 16GB on new hardware, the Arc A770 16GB at $320-349 is the play. It takes more setup but the VRAM advantage over the 3060 12GB is real.
If you want 33B capability and can handle the noise and power draw: used 3090 for $400-500. Just budget for the PSU and verify the card isn’t a miner pull.
The P40 is a fine card for a dedicated headless server if you’re comfortable with its limitations. It’s not the right starting point for most people.