Running a Local LLM on a Raspberry Pi 5: What Actually Works
Hands-on results running quantized LLMs on a Raspberry Pi 5. Which model sizes are usable, what tokens/sec to expect, and the accessories you actually need.
The Raspberry Pi 5 is the cheapest realistic on-ramp to local AI. It’s not fast, but for the right job (a tiny always-on agent) it’s perfect, and it sips power. The trick is having honest expectations: knowing what it can do means you’ll be delighted instead of disappointed.
What to expect (the real numbers)
With llama.cpp and a 4-bit 1.7B model, the Pi 5 (8 GB) lands in the low single-digit
tokens/sec range. That’s too slow for a chat UI where you wait on every word, but
perfectly fine for background tasks that classify a message, pick a tool, or extract a
field and move on. Push past ~3B and it drops below one token/sec, usable only for
non-interactive batch jobs you don’t sit and watch.
Power draw under inference is about 5–8 W, low enough to run 24/7 for pennies a year, which is exactly what makes it a great always-on box.
What you can actually build with it
This is where the Pi 5 shines. Real, useful projects that fit a 1–2B model:
- A home-automation router that reads a spoken or typed request and decides which smart-home action to trigger.
- An offline text classifier: tag incoming notes, emails, or sensor messages by category without sending anything to the cloud.
- A tiny RAG assistant over a small personal knowledge base for simple Q&A.
- A learning rig to understand quantization, GGUF, and llama.cpp flags hands-on before you invest in a real GPU.
If you need fluent chat or coding help, this isn’t the tier; see the cheapest-way guide for the step up to a GPU.
The board
Raspberry Pi 5 (8GB)
- 8 GiB VRAM
- 12 W TDP
- 2023
~$80 street price
Get the 8 GB model: the extra RAM directly limits the model size you can load. An active cooler is non-negotiable for sustained inference, and a quality 27 W USB-C power supply prevents brown-outs under load.
What else you’ll need
- Active cooler (official or a heatsink+fan). Without it the Pi throttles within minutes of sustained inference and your tokens/sec quietly halves.
- Official 27 W USB-C PSU. Under-powering causes random resets exactly when the CPU spikes during generation.
- Fast microSD or, better, an NVMe SSD via the PCIe HAT. Model files are GB-sized and load far quicker from SSD.
Setup in three steps
The quickest path is Ollama, which now runs on 64-bit Raspberry Pi OS and skips all the compiling. Flash Raspberry Pi OS (64-bit) with Raspberry Pi Imager (the 32-bit OS can’t address enough memory for these models), boot, then:
# one-line install, then pull a tiny 4-bit model
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.2:1b
That downloads a ~0.8 GB model and drops you into a prompt. Try qwen2.5:1.5b
too, and stay at the 1–2B sizes, anything larger crawls.
Or build llama.cpp from source (more control, more speed)
If you want to tune every flag or squeeze the most tokens/sec, compile
llama.cpp directly:
sudo apt update && sudo apt install -y build-essential cmake git
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build -j4
# then run a 4-bit GGUF you downloaded:
./build/bin/llama-cli -m qwen2-1_5b-q4_k_m.gguf -p "Classify: 'reset the lights'" -n 64
Tips to get the most out of it
- Keep the model at 1–2B and Q4. It’s the only combination that stays usable.
- Use short prompts and short outputs. The Pi’s bottleneck is compute per token, so fewer tokens = a snappier feel.
- Run it headless as a service and call it over your network, so the Pi does one job well in the background.
We test this exact setup on a real board and will publish the measured tokens/sec here once the runs are captured: real numbers, no estimates.
Gear mentioned in this post
- Check Raspberry Pi 5 price
Raspberry Pi 5 (8GB)
8 GiB · ~$80
Frequently asked questions
Can a Raspberry Pi 5 run a local LLM?
Yes. With llama.cpp and a 4-bit 1–2B model it runs at a few tokens/sec, fine for routing, classification, and small always-on agents, but too slow for a chat interface.
What size model can a Raspberry Pi 5 run?
Comfortably 1–2B quantized models on the 8 GB board. Anything above ~3B becomes painfully slow, so keep models small.
Do I need a cooler to run an LLM on a Raspberry Pi 5?
Yes. Sustained inference will thermal-throttle the bare board, so an active cooler is effectively non-negotiable for steady performance.
What is the easiest way to run an LLM on a Raspberry Pi 5?
Install Ollama with the one-line script on 64-bit Raspberry Pi OS, then run 'ollama run llama3.2:1b'. It downloads the model and starts a chat prompt with no compiling required.