Running Local LLMs for AI Agents: A Practical Guide


Cloud APIs are great, but they come with costs, latency, and privacy concerns. For many agent tasks, a local LLM is faster, cheaper, and good enough. Here’s how to set one up.

The Stack: Ollama + LiteLLM

The simplest production-ready local setup is:

  • Ollama — Runs models locally with GPU acceleration
  • LiteLLM — OpenAI-compatible proxy that routes between local and cloud models

This gives you a single API endpoint that can serve local models for cheap tasks and fall back to Claude or GPT-4 for complex ones.

Choosing Models

Not all models are equal for agent work. Agents need reliable tool-use and instruction following. Here’s what works in 2026:

Use CaseModelWhy
Code + GeneralQwen3-30B-A3BMoE architecture, fast, strong reasoning
Complex ReasoningDeepSeek-R1-32BBest open-source reasoning model
Fast AutomationQwen3-8BSmall, quick, great for JSON/workflows
Code-specificQwen2.5-Coder-32BPurpose-built for code tasks

Setting Up LiteLLM Routing

The power of LiteLLM is intelligent routing. Define your models and let it decide:

model_list:
  - model_name: local-code
    litellm_params:
      model: ollama/qwen2.5-coder:32b
      api_base: http://localhost:11434
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-5-20250514

Simple tasks go local (free). Complex tasks go to Claude (paid, but better).

Performance Tips

Three flags that matter for Ollama:

  1. --no-mmap — Required for stable GPU inference
  2. -ngl 999 — Offload all layers to GPU (2x speedup)
  3. --flash-attn — Required for long context windows

With these settings, expect 70-85 tokens/sec on modern hardware for 30B parameter models.

Next Steps

In the next article, we’ll build an actual agent that uses this local stack to automate a real workflow.