AI Guides

Smart Model Routing: Cloud When You Need It, Local When You Don't

Smart routing now means local/private first, GPT OAuth for cloud work, Ollama Cloud for delegation, and paid APIs only when needed.

2026-04-12 · 4 min read

The problem

Out of the box, every prompt wants to go to one model. That model is usually the most capable one -- and usually the most expensive.

But most prompts do not need the most capable model.

Consider the difference between:

"What time is it?" — tiny prompt, no reasoning
"Refactor this authentication middleware to use JWT rotation with Redis caching and rate limiting per endpoint" — longer prompt, more context, more risk
"Summarise this private local file" — maybe simple, but privacy-sensitive

Those should not all take the same route.

That is the real point of smart model routing: not just saving money, but choosing the right path for the job.

Updated as of May, this is the routing map I wish I had before I started burning tokens unnecessarily.

The model map — at a glance

Smart Model Routing — Which Model, Which Path

View full-size fixed infographic

The routing order I use now

The current rule is simple:

Short or privacy-sensitive prompts stay local where possible. If the prompt includes private files, local context, keys, logs, personal data, or anything I would not happily paste into a cloud chat box, it should not leave the machine unless I explicitly choose that.
GPT OAuth is the go-to cloud path. If I already pay for subscription access, I should use that before reaching for per-token API billing.
Ollama Cloud handles delegation and fallback work. For agent/subagent runs, specialist passes, and heavier fallback routes, Ollama Cloud gives me a controlled cloud layer without making every task an API spend event.
API spend comes last. Per-token API calls are still useful, but they should be a deliberate decision, not the default route for every prompt.

That order matters more than any one config file. A routing policy is a habit: local first when privacy matters, subscription/OAuth before API, delegation where it makes sense, and paid API only when it is actually justified.

The first version was a threshold router

My first version was much simpler. I treated routing as a size problem:

smart_model_routing:
  enabled: true
  cheap_model: "qwen3.5:9b"
  cheap_model_base_url: "http://localhost:11434/v1"
  thresholds:
    max_chars: 220
    max_words: 40
  primary_model: "gpt-5.3-codex"

The idea was straightforward: if a prompt was under both thresholds, send it to a cheaper local model. Everything else went to the stronger cloud model.

That was useful as a learning step. It taught me that not every prompt deserves the same model.

But it was not the final route.

The problem is that prompt size is only one signal. A short prompt can still be sensitive. A long prompt can be low-value. A coding task might need a stronger model. A background delegation job might be better sent through Ollama Cloud. A one-line private note might need to stay local, even if it is trivial.

So the setup evolved from “short prompts go cheap” into “choose the right path based on privacy, task type, existing subscription access, and cost control”.

The cost tiers

The point is not to avoid cloud models completely. The point is to avoid using the expensive route by accident.

Tier	Monthly Cost	Routing Style	Use Case
Lean	£0–15/month	Local-first with light cloud fallback	Testing, notes, simple build work, low-risk prompts
Balanced	£15–40/month	Local + GPT OAuth + Ollama Cloud delegation	Normal agent workflows without defaulting to API spend
Scale-up	£40–80/month	API escalation after OAuth and delegation routes	Heavy specialist work where paid API calls are justified

The goal is not “never spend”. That is fantasy accounting.

The goal is: route first, spend second.

If a prompt can stay local, keep it local. If cloud is needed and subscription access already covers the work, use that. If a delegated agent needs cloud horsepower, use Ollama Cloud deliberately. Only then should per-token API billing enter the conversation.

The local advantage

A local model is not just cheaper than a cloud model. It changes the risk profile.

Local is better for:

Privacy -- Data stays on the machine.
Latency -- No network round trip for small jobs.
Resilience -- No vendor outage, no rate-limit surprise, no account weirdness.
Control -- I can decide exactly when something leaves the box.

That does not mean local wins every task. It means local should be the default privacy boundary.

When cloud still wins

Cloud still earns its place.

GPT OAuth is useful for general cloud reasoning, coding help, longer explanations, and normal day-to-day tasks where I want a strong model without paying per token every time.

Ollama Cloud is useful for delegation, fallback runs, and heavier agent work where I want cloud capacity but still want cost discipline.

Specialist APIs are useful when a specific provider or model is genuinely needed. But that is the key word: specific. API spend should answer a clear need, not paper over lazy routing.

The config drift problem

The deeper lesson was not just model choice. It was config drift.

Old routing ideas can sit in backups, archives, notes, and half-remembered blog drafts while the active setup has moved on. If an agent writes from memory instead of checking the live config, it can document a system that no longer exists.

That is exactly why this post needed updating.

As of May, the lesson is clearer: smart routing is not a single YAML block. It is an operating rule.

Short and sensitive prompts stay local. GPT OAuth is the go-to cloud path. Ollama Cloud handles delegation and fallback work. API spend comes last.

Found this useful? 👉 Follow @Raf_VRS for more Build Journal updates 👉 Support the work: ko-fi.com/rafvrs #SelfHosting #AIAgents #HardInterference