Benchmarks

How I Built a Local AI Model Benchmark (And Why You Should Too)

I couldn't find a benchmark that tested what matters for real agent work — so I built one. Seven models, eight prompts, judge-based scoring, and an honest leaderboard. Here's the full breakdown of how it works and what the results actually mean.

2026-04-19 · 5 min read

The benchmark problem

Every LLM benchmark has the same flaw: it tests the wrong thing.

MMLU tests trivia. HumanEval tests coding puzzles. MATH tests competition maths. These are useful for academic papers and model cards, but they don't answer the question you actually have:

"Which model should I run locally for my AI agent?"

That question has several dimensions:

No single benchmark answers all of these. So I built one that tries.

What I built

The VRS Model Benchmark is a Python script I use to run a standardised set of prompts against any Ollama model and score the results objectively. Here's how it works:

The prompt suite

Eight prompts across seven categories, each designed to test something specific about agent-quality performance:

CategoryWhat it testsExample
GreetingConversational warmth, natural language"Introduce yourself briefly"
FactualAccuracy, citation awareness, knowledge depthMulti-part factual questions
ReasoningMulti-step logic, constraint satisfactionProblems requiring 3+ logical steps
CodingWorking code, edge cases, error handlingReal-world programming tasks
Instruction followingPrecision, adherence to constraints"Do X but NOT Y"
CreativeOriginality, coherence, voiceOpen-ended generation
VisionImage understanding, descriptionScreenshot analysis

The coding and vision prompts are weighted heavier in the final score because an agent that can't write working code or read a screen isn't much use.

LLM-as-judge scoring

Instead of keyword matching (which breaks constantly) or multiple choice (which does not test generation quality), I use judge-based scoring. After each model response, a separate LLM call grades the output on a 1-5 scale against the prompt's rubric.

# Judge prompt (simplified)
judge_prompt = f"""
Rate this response on a 1-5 scale.
Prompt: {original_prompt}
Response: {model_response}

Scoring criteria:
5 = Excellent, complete, accurate
4 = Good, minor issues
3 = Adequate, some gaps
2 = Poor, significant problems
1 = Unusable

Output ONLY a number.
"""

This gives me repeatable, calibrated scores that actually reflect quality — not just word count or keyword presence.

Speed benchmarks

Latency matters. A model that scores 5/5 on everything but takes 90 seconds per response is unusable as an agent backend. I measure:

Cost tracking

Every benchmark run captures:

For local models, cost is effectively zero (electricity aside). For cloud models, this matters enormously — a model that's twice as good but costs 10x more per token isn't automatically the better choice.

The results so far

I tested seven models across local (Ollama) and cloud (OpenRouter) providers. Here's the leaderboard for my setup:

ModelSpeedQuality (avg)CostNotes
glm-5.1:cloud4.1s4.25FreeMy daily driver of choice
devstral-small-22.3s3.8FreeBest speed/quality ratio
gemma3:12b15.2s3.5FreeSlow but solid
qwen3.5:9b2.4s3.8FreeGreat local option
gemma4:e4b8.1s3.25FreeVision support
gpt-oss:20b45.6s3.0FreeToo slow for agent work
qwen3:32b120s+FreeOOM'd my 64GB machine

The full detailed results with per-category breakdowns are in my results directory, updated with each benchmark run.

Why you should build your own

You'll notice I said "build" not "use mine." There's a reason for that.

Every agent workload is different. My benchmark tests for a specific pattern: command-execution loops, tool calls, multi-step reasoning, and natural conversation. Your agent might prioritise creative writing, data analysis, or customer support. The prompts should reflect that.

The framework I built — the script, the judge, the scoring — is reusable. The prompts are where you customise. Swap in your own use cases, run the benchmark, and you'll get a leaderboard that answers the question that actually matters: which model works best for YOUR agent?

Getting started

# Clone and run
cd ~/vrs-model-bench
python3 model-bench.py --models glm-5.1:cloud qwen3.5:9b --all

# Run specific categories
python3 model-bench.py --models devstral-small-2 --coding --speed

# Results are saved per-prompt in results/

Results are JSON files with full scoring breakdowns, token counts, latency data, and the raw responses. You can aggregate them however you want.

What's next

Three things I am adding:

  1. OpenRouter cloud model tests — Free-tier models from OpenRouter (Nemotron, Qwen, etc.) alongside my local results. Same prompts, same judge, same leaderboard.

  2. Cost comparison tool — Given a workload profile (X messages/day, Y tokens average), calculate the actual monthly cost for each model. Local vs cloud vs hybrid.

  3. Continuous testing — When a new model drops (or a rumour says a free tier opened up), I spin up the benchmark automatically and have X-ready results within the hour.

The goal isn't to have the best benchmark. It's to have a benchmark that answers the question you actually care about. I couldn't find one, so I built it. You should too.

The Results — At a Glance

Local AI Benchmark — methodology and results

View full-size infographic


Found this useful? 👉 Follow @Raf_VRS on X for more benchmark notes 👉 Support the work: ko-fi.com/rafvrs