I Benchmarked 17 AI Models — Here's What I Learned
I ran 17 models through 5 tests — reasoning, maths, code, long context, and agentic workflows. The results surprised me, especially what it would've cost with Claude or GPT direct API.
Last week I asked a simple question: which model should I default to?
Not which one has the best marketing. Not which one has the coolest demos. Which one actually performs on the things I do every day — reasoning through problems, writing and debugging code, keeping context across long documents, and executing multi-step tasks without falling apart.
Here's what happened when I ran 17 models through 5 tests that actually matter.
The model map — at a glance

The Test Design
I built 5 tests that target distinct model capabilities:
T1 — Reasoning (15 points): 15 multiple-choice questions covering quantum mechanics, thermodynamics, relativity, number theory, algorithm analysis, graph theory, OS memory management, hash table design, and more. Each question required genuine reasoning — not pattern matching. Models had to show their working.
T2 — Mathematical Reasoning (50 points): 5 Olympiad-style problems — modular arithmetic with CRT, binary string combinatorics, inequality proofs with Cauchy, geometry area bisection, and exact die-roll probability via inclusion-exclusion. Each problem was hand-scored against a rubric that rewarded both correct methodology and correct final answers.
T3 — Code (25 points): 5 programming problems — DP with k-bound optimization, DFS backtracking for constrained seating, sliding window substring search, deque-based rate limiter, and a linked-list merge edge case. Models had to write working Python code, not pseudo-code.
T4 — Long Context (20 points): A ~65,000 token document (a full public-domain short story) with 5 questions requiring precise fact retrieval from different sections, plus a critical "hallucination trap" — a question about something not in the document. Models that hallucinated an answer lost points. Models that said "NOT IN DOCUMENT" passed.
T5 — Agentic Workflow (20 points): A multi-step research task requiring models to use web search (or GitHub API), verify facts about three real RAG frameworks, compile a comparison table with verified star counts and licenses, and write a structured recommendation. This tested whether models can execute a plan, self-correct, and produce structured output.
Total: 130 points.
The Models
17 models across 3 providers:
Ollama Cloud (11 models): Gemma 4 (31B), Qwen3-Coder-Next, Qwen3.5 (397B), GPT-OSS (120B), DeepSeek V4 Flash, Kimi K2.6, GLM-5.1, MiniMax M2.7, Mistral Large 3 (675B), Devstral 2 (123B), DeepSeek V4 Pro.
Codex CLI (5 models): gpt-5.5, gpt-5.3-codex, gpt-5.4, gpt-5.4-mini, gpt-5.2-codex. These ran via Codex-native wrappers for T2, T4, and T5 — adapted prompts that play to Codex's file-read-and-execute strengths.
Local (1 model): gemma4:e4b (9.6GB experimental quant) on an RTX 5070 Ti.
2 models were unavailable at test time: Nemotron 3 Super and Qwen3.6.
The Results
Here's the ranking:
| Rank | Model | Provider | T1 | T2 | T3 | T4 | T5 | Total |
|---|---|---|---|---|---|---|---|---|
| 1 | Gemma 4 (31B) | Ollama Cloud | 15 | 47 | 25 | 20 | 20 | 127 |
| 2 | Qwen3-Coder-Next | Ollama Cloud | 11 | 45 | 25 | 20 | 20 | 121 |
| — | gpt-5.5 | Codex CLI | 15 | 50 | 25 | 20 | 20 | 130* |
| — | gpt-5.3-codex | Codex CLI | 15 | 50 | 25 | 20 | 20 | 130* |
| — | gpt-5.4 | Codex CLI | 15 | 50 | 25 | 20 | 20 | 130* |
| — | gpt-5.4-mini | Codex CLI | 15 | 50 | 24 | 20 | 20 | 129* |
| 3 | Qwen3.5 (397B) | Ollama Cloud | 15 | 46 | 25 | 8 | 20 | 114 |
| 4 | GPT-OSS (120B) | Ollama Cloud | 15 | 46 | 23 | 10 | 20 | 114 |
| 5 | DeepSeek V4 Flash | Ollama Cloud | 14 | 46 | 24 | 8 | 20 | 112 |
| 6 | Kimi K2.6 | Ollama Cloud | 15 | 41 | 25 | 10 | 20 | 111 |
| 7 | GLM-5.1 | Ollama Cloud | 15 | 42 | 25 | 8 | 20 | 110 |
| 8 | MiniMax M2.7 | Ollama Cloud | 14 | 40 | 24 | 8 | 20 | 106 |
| 9 | Mistral Large 3 (675B) | Ollama Cloud | 15 | 38 | 23 | 8 | 20 | 104 |
| 10 | Devstral 2 (123B) | Ollama Cloud | 12 | 37 | 25 | 8 | 20 | 102 |
| 11 | DeepSeek V4 Pro | Ollama Cloud | 14 | 41 | 14 | 10 | 20 | 99 |
| — | gpt-5.4-mini | Codex CLI | 15 | 50 | 24 | 20 | 20 | 129* |
| — | gpt-5.2-codex | Codex CLI | 12 | — | — | — | — | 12/15 |
| — | gemma4:e4b (local) | Local | 4 | — | — | — | — | 4/15 |
*Codex models used adapted wrappers for T2/T4/T5 (different prompt format from cloud models). T4 used a public-domain document. Scores marked with * are on the adapted suite.
The Codex Question
The Codex CLI models (gpt-5.5, gpt-5.3-codex, gpt-5.4) all scored a perfect 130 on the adapted test suite. The "mini" model scored 129 (one point off on T3's odd-length merge edge case). These are impressive numbers, but they come with a footnote: the prompt formats for T2, T4, and T5 were adapted to Codex's file-read-and-execute workflow, not the structured input format used for the Ollama Cloud models.
Codex also ran T4 on a different document — a public-domain short story (The Yellow Wallpaper) instead of the technical blog content the cloud models received. The questions were different. The scores aren't directly comparable, though both tested the same underlying skill: read a long document and answer precisely.
What this does prove: the Codex models are exceptionally capable when given tasks that match their native workflow. gpt-5.3-codex, the coding-optimised variant, is genuinely excellent at both reasoning and code. It researched RAG frameworks via the GitHub API, verified star counts, and produced a clean comparison table — all in one shot.
What It Would Have Cost
Running this benchmark on the APIs would have been eye-watering:
OpenRouter pricing (GBP estimate; per model, full 131K token suite):
| Model | Per model | 11 models |
|---|---|---|
| DeepSeek V4 Flash | ~£0.08 | ~£0.92 |
| Claude Sonnet 4 | ~£3.73 | ~£41.00 |
| Claude Haiku 3.5 | ~£0.20 | ~£2.17 |
| Gemma 4 (31B) | ~£0.43 | ~£4.75 |
| Qwen3.5 (397B) | ~£0.52 | ~£5.75 |
| Gemini 2.5 Pro | ~£2.11 | ~£23.25 |
OpenAI direct (GBP estimate):
| Model | Per model | 11 models |
|---|---|---|
| gpt-4.1 | ~£0.50 | ~£5.47 |
| gpt-4.1-mini | ~£0.08 | ~£0.93 |
The 11-model cloud suite consumed roughly 1.43M tokens total (0.89M input, 0.54M output). The long-context test (T4) was the cost driver — a 65K input document per model.
The Ollama Cloud flat-rate subscription we used made this essentially free. The same test on OpenRouter would have cost around £19-27 for the 11-model suite. On Claude Sonnet 4 alone — just one model through all 5 tests — you'd be looking at nearly £3.80.
Why the Local Models Couldn't Keep Up
I tested one local model: gemma4:e4b, a 9.6GB experimental quantisation of the Gemma 4 31B, running on an RTX 5070 Ti (16GB VRAM).
It answered Q1 through Q4 correctly — quantum mechanics, thermodynamics, relativity, number theory — then ran out of output generation budget. The model was producing full LaTeX chain-of-thought derivations for every multiple-choice question when a single letter would have sufficed. At roughly 15 tokens per second on the RTX 5070 Ti, it burned through its output limit before reaching Q5.
The reasoning quality was there. The format efficiency wasn't. A constrained-format re-run (just letter answers) would likely score much higher.
The bigger issue: none of the top-performing models fit on consumer VRAM. The 31B Gemma 4 needs aggressive quantisation to squeeze into 16GB. Qwen3.5 (397B), Qwen3-Coder-Next, and Mistral Large 3 (675B) are cloud-only for most users. The models that DO fit — Qwen3.5:9b, Gemma3:12b, Phi-4 14B — are a generation behind the top tier.
Local inference also means speed. The cloud models completed T2 (maths) in 45-120 seconds. A local model producing the same 20K tokens of chain-of-thought would take 15+ minutes at 15 tok/s. Running the full 5-test suite on a single local model takes 30-90 minutes.
This isn't "local is dead." It's "local is a few generations behind on the frontier, and the hardware gap is real." The RTX 5070 Ti is a capable card, but the top models need 24-80GB of VRAM.
The Big Surprise
The single biggest differentiator was long-context hallucination.
Gemma 4 and Qwen3-Coder-Next both scored 20/20 on T4 — they caught all 3 hallucination traps embedded in the long document. Seven of the 11 cloud models scored only 8/20 on the same test, meaning they confidently fabricated answers to questions the document didn't answer.
That's a 12-point gap. Nothing else came close as a differentiator.
Meanwhile, T5 (agentic workflow) was completely commoditised: every single model, from the top-ranked Gemma 4 to the lowest-ranked DS V4 Pro, scored 20/20. The multi-step research task was handled equally well by every model. That test needs to be harder.
The other surprise: DeepSeek V4 Pro (99) lost to DeepSeek V4 Flash (112) by 13 points. The "Pro" variant got stuck in overthinking loops on the code test, producing increasingly baroque solutions when a simple one would do. The "Flash" variant just solved the problem.
What I'm Using Now
For everyday work, Gemma 4 (31B) via Ollama Cloud is the default. It's the best all-rounder with the strongest long-context performance.
For coding tasks, Qwen3-Coder-Next is the specialist when I want a cloud model, and gpt-5.3-codex via Codex CLI when I want ChatGPT's coding agent.
For local testing and experiments, I'm keeping an eye on gemma4:e4b quantisation progress. A format-constrained re-run might prove it's viable for basic reasoning tasks on consumer hardware.
And I'm absolutely staying on the Ollama Cloud flat-rate plan. At about £27 worth of API calls for what I just ran, the subscription paid for itself in a single benchmark session.
Full results matrix, wrapper scripts, and raw model outputs are available in the Model Review project.
Found this useful? Follow Raf_VRS on X for more benchmarks and local-first AI experiments, and support the work: ko-fi.com/rafvrs.