Benchmarks
Model tests, benchmark reports, cost comparisons, and evidence-led AI reviews.
- I Benchmarked 17 AI Models — Here's What I Learned
I ran 17 models through 5 tests — reasoning, maths, code, long context, and agentic workflows. The results surprised me, especially what it would've cost with Claude or GPT direct API. - I Tested GPT-5.5 vs Claude Opus 4.7 and Gemini 3.1: What Actually Matters
GPT-5.5 looks strong, but the winner changes by workload. Here is a practical comparison of GPT-5.5 vs Opus 4.7 and Gemini 3.1 with benchmarks, cost, and deployment reality. - Model Benchmarking: Real Tests, Real Hardware, Real Numbers
Real-world model benchmarks on an RTX 5070 Ti: methodology, results, and why local testing matters more than marketing charts. - How I Built a Local AI Model Benchmark (And Why You Should Too)
I couldn't find a benchmark that tested what matters for real agent work — so I built one. Seven models, eight prompts, judge-based scoring, and an honest leaderboard. Here's the full breakdown of how it works and what the results actually mean. - Choosing the Right Models (So You Don't Burn Money)
Six local models on an RTX 5070 Ti showed why speed, quality, and routing matter more than benchmark bragging rights.