Running Local LLMs for AI Agents: A Practical Guide
Cloud APIs are great, but they come with costs, latency, and privacy concerns. For many agent tasks, a local LLM is faster, cheaper, and good enough. Here’s how to set one up.
The Stack: Ollama + LiteLLM
The simplest production-ready local setup is:
- Ollama — Runs models locally with GPU acceleration
- LiteLLM — OpenAI-compatible proxy that routes between local and cloud models
This gives you a single API endpoint that can serve local models for cheap tasks and fall back to Claude or GPT-4 for complex ones.
Choosing Models
Not all models are equal for agent work. Agents need reliable tool-use and instruction following. Here’s what works in 2026:
| Use Case | Model | Why |
|---|---|---|
| Code + General | Qwen3-30B-A3B | MoE architecture, fast, strong reasoning |
| Complex Reasoning | DeepSeek-R1-32B | Best open-source reasoning model |
| Fast Automation | Qwen3-8B | Small, quick, great for JSON/workflows |
| Code-specific | Qwen2.5-Coder-32B | Purpose-built for code tasks |
Setting Up LiteLLM Routing
The power of LiteLLM is intelligent routing. Define your models and let it decide:
model_list:
- model_name: local-code
litellm_params:
model: ollama/qwen2.5-coder:32b
api_base: http://localhost:11434
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-5-20250514
Simple tasks go local (free). Complex tasks go to Claude (paid, but better).
Performance Tips
Three flags that matter for Ollama:
--no-mmap— Required for stable GPU inference-ngl 999— Offload all layers to GPU (2x speedup)--flash-attn— Required for long context windows
With these settings, expect 70-85 tokens/sec on modern hardware for 30B parameter models.
Next Steps
In the next article, we’ll build an actual agent that uses this local stack to automate a real workflow.