Benchmarks

Model Benchmarking: Real Tests, Real Hardware, Real Numbers

Real-world model benchmarks on an RTX 5070 Ti: methodology, results, and why local testing matters more than marketing charts.

2026-04-21 · 6 min read

What Model Benchmarking Really Is (Spoiler: It's Not What You Think)

Let's cut through the noise. When a new AI model drops, you'll see headlines claiming it's "the fastest," "most accurate," or "revolutionary." These claims usually come from cherry-picked tests on expensive server hardware or, worse, pure marketing.

That's not helpful if you're trying to run AI locally on a reasonable budget. You need to know how a model actually performs on hardware you can afford – like an RTX 5070 Ti – with real-world prompts, not synthetic benchmarks designed to look good.

That's what I do here. I take every model I'm curious about (or that you ask me to test), install it on my Ubuntu 24 setup with an RTX 5070 Ti, and run it through the same gauntlet of tests. No special optimizations, no trickery – just what you'd see if you downloaded it yourself and gave it a spin.

How I Benchmark: The Nitty-Gritty

My process is straightforward but thorough:

What I Measure (And Why It Matters to You)

When you see my benchmark tables, here's what each column means for your actual use:

Here's Where I Stand Right Now

This table shows my latest benchmark results. All local models were tested on the same RTX 5070 Ti setup. The cloud model (Claude Opus) is shown for comparison – you pay for convenience, but local options can be surprisingly capable.

ModelQuality (1-10)SpeedVRAM£/M Tokens
Qwen 3 32B912s20GB£0.08/M (local)
GLM-5.188s20GB£0.08/M (local)
Mistral Small 3.1 24B96s16GB£0.08/M (local)
Gemma 3 27B85s18GB£0.08/M (local)
Command R 35B716s22GB£0.08/M (local)
Phi-4 14B83s9GB£0.08/M (local)
Llama 3.1 8B62s6GB£0.08/M (local)
Claude Opus 4.6 (cloud)10~10sN/A£24/M tokens

Notice how the local models cluster around that £0.08/M token mark? That's the real cost of running them – barely a penny for a mountain of tokens. The cloud model's £24/M isn't just for the tokens; it's for the infrastructure, the support, and the convenience of not managing anything yourself.

Beyond Text: Image Generation Benchmarks

Text models aren't the whole story. I've started benchmarking image generation too — and the results were surprising enough to change how I think about AI creativity.

When I set out to create the Hard Interference logo, I tested every model I could get my hands on: Flux.1-schnell, SD 1.5, SDXL, and even Claude's native image generation. Then ChatGPT Images 2.0 came along and changed everything.

If the embed does not render on your client, use the direct post link: OpenAI's ChatGPT Images 2.0 announcement

After 50+ failed attempts across local and cloud models, the winning logo came from 8 iterative prompts through ChatGPT Images 2.0 — perfect text rendering, transparent background, and the brain-circuitry-chip design I'd been chasing. But here's the twist: Nemotron (a free, text-only model) wrote a Flux prompt that scored 8/10 — higher than Flux with a handcrafted prompt (7.5/10). A model that can't draw pictures wrote a better picture description than the picture models.

That finding became the foundation for my Reverse Prompting Guide — proving that text models are better at describing images than image models are at generating them from short prompts. I'll be expanding image benchmarks as new models drop, because if the local AI landscape moves fast for text, image generation moves even faster.

What You'll Find in This Category

This isn't just a leaderboard. I dive deeper in connected posts:

I will update this table whenever I evaluate a new model that catches my eye or that you recommend. The local AI landscape moves fast, and yesterday's champion might be today's solid option – or vice versa.

Why I Do This (So You Don't Have To)

Honestly? I started benchmarking because I was tired of guessing. Tired of downloading a model based on a hype tweet, only to find it crawled on my 16GB card or couldn't follow a simple instruction. I wanted data, not marketing.

So I built this benchmark for myself – and then realised you might want it too. If you're self-hosting, tinkering with agents, or just trying to get useful AI without breaking the bank, you deserve to know what actually works on real hardware.

I benchmark so you don't have to guess.


Next up: Looking at how those weekly token usage reports shake out – because the cheapest model to run isn't much use if it's too slow or dumb for your tasks.


Found this useful? → Follow @Raf_VRS for more benchmark drops → Support the work: ko-fi.com/rafvrs #HardInterference #Benchmarks #LocalAI