Mar 22, 2026

Running Local LLMs for AI Agents: A Practical Guide

Cloud APIs are great, but they come with costs, latency, and privacy concerns. For many agent tasks, a local LLM is faster, cheaper, and good enough. Here’s how to set one up.

The Stack: Ollama + LiteLLM

The simplest production-ready local setup is:

Ollama — Runs models locally with GPU acceleration
LiteLLM — OpenAI-compatible proxy that routes between local and cloud models

This gives you a single API endpoint that can serve local models for cheap tasks and fall back to Claude or GPT-4 for complex ones.

Choosing Models

Not all models are equal for agent work. Agents need reliable tool-use and instruction following. Here’s what works in 2026:

Use Case	Model	Why
Code + General	Qwen3-30B-A3B	MoE architecture, fast, strong reasoning
Complex Reasoning	DeepSeek-R1-32B	Best open-source reasoning model
Fast Automation	Qwen3-8B	Small, quick, great for JSON/workflows
Code-specific	Qwen2.5-Coder-32B	Purpose-built for code tasks

Setting Up LiteLLM Routing

The power of LiteLLM is intelligent routing. Define your models and let it decide:

model_list:
  - model_name: local-code
    litellm_params:
      model: ollama/qwen2.5-coder:32b
      api_base: http://localhost:11434
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-5-20250514

Simple tasks go local (free). Complex tasks go to Claude (paid, but better).

Performance Tips

Three flags that matter for Ollama:

--no-mmap — Required for stable GPU inference
-ngl 999 — Offload all layers to GPU (2x speedup)
--flash-attn — Required for long context windows

With these settings, expect 70-85 tokens/sec on modern hardware for 30B parameter models.

Next Steps

In the next article, we’ll build an actual agent that uses this local stack to automate a real workflow.