Benchmarks

Choosing the Right Models (So You Don't Burn Money)

Six local models on an RTX 5070 Ti showed why speed, quality, and routing matter more than benchmark bragging rights.

2026-04-12 ยท 2 min read

The benchmark

Within 24 hours of setting up, I needed data. Which models actually work on this hardware? What are the real speed/quality tradeoffs?

I built a standardized benchmark -- 5 test categories, each scored objectively:

  1. Simple Greeting -- Can it respond coherently and concisely?
  2. Thinking / Reflection -- Can it produce original, structured analysis?
  3. Logical Reasoning -- Can it solve a reasoning puzzle?
  4. Code Generation -- Can it write working Python?
  5. Math -- Can it solve and explain a math problem?

The results

ModelScore (/10)Avg ResponseBest For
gemma4:e4b100.78sFast coding, tool orchestration, strict JSON
gpt-oss:20b100.89sFast coding, tool orchestration, strict JSON
qwen3.5:9b106.31sCoding + structured outputs
gemma3:12b70.89sGeneral chat, quick drafts
glm-4.7-flash778.11sDeep analysis only; NOT for agent loops
devstral-small-231.01sFallback / experimental only

The model map โ€” at a glance

Here's the benchmark logic as a visual routing map: which models were fast, which were useful, and which ones looked good until latency made them painful.

Choosing the Right Models โ€” speed vs quality benchmark map

View full-size infographic

The 78-second problem

Look at that glm-4.7-flash number. 78 seconds. It scored 7/10 -- not bad quality-wise. But in an agent loop where you might make 20 API calls to solve a task? You're looking at 26 minutes per task. That's not an AI assistant, that's a pen pal.

Speed matters more than quality when you're building automated workflows.

The routing strategy

This benchmark directly shaped my model routing:

The API discovery

One more critical finding: the Ollama /api/generate endpoint returns empty response fields for models that use "thinking tokens" (gemma4, qwen3.5). You have to use /api/chat with streaming to get both message.content and message.thinking properly separated.

This bug made my initial benchmarks look terrible -- models scoring 0/5 on categories they were actually handling fine, just through a different API field. Always verify your measurement tools before trusting the measurements.


Found this useful? ๐Ÿ‘‰ Follow @Raf_VRS for more Build Journal updates ๐Ÿ‘‰ Support the work: ko-fi.com/rafvrs #SelfHosting #AIAgents #HardInterference