Benchmarks

Model tests, benchmark reports, cost comparisons, and evidence-led AI reviews.

I Benchmarked 17 AI Models — Here's What I Learned
I ran 17 models through 5 tests — reasoning, maths, code, long context, and agentic workflows. The results surprised me, especially what it would've cost with Claude or GPT direct API.
I Tested GPT-5.5 vs Claude Opus 4.7 and Gemini 3.1: What Actually Matters
GPT-5.5 looks strong, but the winner changes by workload. Here is a practical comparison of GPT-5.5 vs Opus 4.7 and Gemini 3.1 with benchmarks, cost, and deployment reality.
Model Benchmarking: Real Tests, Real Hardware, Real Numbers
Real-world model benchmarks on an RTX 5070 Ti: methodology, results, and why local testing matters more than marketing charts.
How I Built a Local AI Model Benchmark (And Why You Should Too)
I couldn't find a benchmark that tested what matters for real agent work — so I built one. Seven models, eight prompts, judge-based scoring, and an honest leaderboard. Here's the full breakdown of how it works and what the results actually mean.
Choosing the Right Models (So You Don't Burn Money)
Six local models on an RTX 5070 Ti showed why speed, quality, and routing matter more than benchmark bragging rights.