Model Benchmarking: Real Tests, Real Hardware, Real Numbers
Real-world model benchmarks on an RTX 5070 Ti: methodology, results, and why local testing matters more than marketing charts.
What Model Benchmarking Really Is (Spoiler: It's Not What You Think)
Let's cut through the noise. When a new AI model drops, you'll see headlines claiming it's "the fastest," "most accurate," or "revolutionary." These claims usually come from cherry-picked tests on expensive server hardware or, worse, pure marketing.
That's not helpful if you're trying to run AI locally on a reasonable budget. You need to know how a model actually performs on hardware you can afford – like an RTX 5070 Ti – with real-world prompts, not synthetic benchmarks designed to look good.
That's what I do here. I take every model I'm curious about (or that you ask me to test), install it on my Ubuntu 24 setup with an RTX 5070 Ti, and run it through the same gauntlet of tests. No special optimizations, no trickery – just what you'd see if you downloaded it yourself and gave it a spin.
How I Benchmark: The Nitty-Gritty
My process is straightforward but thorough:
- The Hardware: I test on a fixed setup – an NVIDIA RTX 5070 Ti running Ubuntu 24.04 LTS, using standard tools like Ollama or llama.cpp. NVIDIA's consumer GPUs have become the backbone of local AI, and the 5070 Ti hits the sweet spot between VRAM (16GB), price, and CUDA support. Ubuntu 24.04 keeps driver support clean and NVIDIA's CUDA toolkit installs without the dependency headaches that plague other distros. This keeps things comparable week over week.
- The Models: I typically evaluate 7 local models per benchmark round, plus one cloud model (like Claude Opus or ChatGPT Images 2.0) for reference on cost, quality, or sheer creative power.
- The Prompts: I use 8 diverse prompts covering reasoning, coding, creative writing, and instruction following. These aren't trivial "hello world" tests; they're designed to stress different capabilities.
- The Scoring: For quality, I don't just guess. I use a judge-based approach (often another trusted LLM or careful human review) to score outputs on a 1-10 scale against the prompt's intent. Speed is measured in seconds to generate a response. I track peak VRAM usage, tokens per second, and calculate the cost per million tokens.
What I Measure (And Why It Matters to You)
When you see my benchmark tables, here's what each column means for your actual use:
- Quality (1-10): How well the model understood and fulfilled the prompt's request. Higher is better, but "perfect" 10s are rare and usually come with trade-offs.
- Speed: Total time in seconds to generate a response. Faster feels more responsive, especially for interactive use.
- VRAM: How much graphics card memory the model consumes while running. This dictates if it will fit on your GPU (or if you'll need to offload to slower system RAM).
- £/M Tokens: The cost to process one million tokens (roughly 750,000 words). For local models, this is primarily your electricity cost – I've found it's remarkably consistent across models at about £0.08/M tokens on my setup. Cloud models show the true API premium.
Here's Where I Stand Right Now
This table shows my latest benchmark results. All local models were tested on the same RTX 5070 Ti setup. The cloud model (Claude Opus) is shown for comparison – you pay for convenience, but local options can be surprisingly capable.
| Model | Quality (1-10) | Speed | VRAM | £/M Tokens |
|---|---|---|---|---|
| Qwen 3 32B | 9 | 12s | 20GB | £0.08/M (local) |
| GLM-5.1 | 8 | 8s | 20GB | £0.08/M (local) |
| Mistral Small 3.1 24B | 9 | 6s | 16GB | £0.08/M (local) |
| Gemma 3 27B | 8 | 5s | 18GB | £0.08/M (local) |
| Command R 35B | 7 | 16s | 22GB | £0.08/M (local) |
| Phi-4 14B | 8 | 3s | 9GB | £0.08/M (local) |
| Llama 3.1 8B | 6 | 2s | 6GB | £0.08/M (local) |
| Claude Opus 4.6 (cloud) | 10 | ~10s | N/A | £24/M tokens |
Notice how the local models cluster around that £0.08/M token mark? That's the real cost of running them – barely a penny for a mountain of tokens. The cloud model's £24/M isn't just for the tokens; it's for the infrastructure, the support, and the convenience of not managing anything yourself.
Beyond Text: Image Generation Benchmarks
Text models aren't the whole story. I've started benchmarking image generation too — and the results were surprising enough to change how I think about AI creativity.
When I set out to create the Hard Interference logo, I tested every model I could get my hands on: Flux.1-schnell, SD 1.5, SDXL, and even Claude's native image generation. Then ChatGPT Images 2.0 came along and changed everything.
Introducing ChatGPT Images 2.0
— OpenAI (@OpenAI) April 21, 2026
A state-of-the-art image model that can take on complex visual tasks and produce precise, immediately usable visuals, with sharper editing, richer layouts, and thinking-level intelligence.
Video made with ChatGPT Images pic.twitter.com/3aWfXakrcR
If the embed does not render on your client, use the direct post link: OpenAI's ChatGPT Images 2.0 announcement
After 50+ failed attempts across local and cloud models, the winning logo came from 8 iterative prompts through ChatGPT Images 2.0 — perfect text rendering, transparent background, and the brain-circuitry-chip design I'd been chasing. But here's the twist: Nemotron (a free, text-only model) wrote a Flux prompt that scored 8/10 — higher than Flux with a handcrafted prompt (7.5/10). A model that can't draw pictures wrote a better picture description than the picture models.
That finding became the foundation for my Reverse Prompting Guide — proving that text models are better at describing images than image models are at generating them from short prompts. I'll be expanding image benchmarks as new models drop, because if the local AI landscape moves fast for text, image generation moves even faster.
What You'll Find in This Category
This isn't just a leaderboard. I dive deeper in connected posts:
- How I Built a Local AI Model Benchmark: A look under the hood at my testing setup, scripts, and why I chose these specific metrics.
- Choosing the Right Models: My initial benchmark round that helped me decide what to run daily – balancing quality, speed, and hardware limits.
- Weekly Usage Reports: Real data from my actual agent usage – how many tokens I consumed, what it cost, and which models earned their keep.
- Image Generation Benchmarks: Logo creation, Reverse Prompting, and the ChatGPT Images 2.0 vs local model face-off — where text models outperformed image models at their own game.
I will update this table whenever I evaluate a new model that catches my eye or that you recommend. The local AI landscape moves fast, and yesterday's champion might be today's solid option – or vice versa.
Why I Do This (So You Don't Have To)
Honestly? I started benchmarking because I was tired of guessing. Tired of downloading a model based on a hype tweet, only to find it crawled on my 16GB card or couldn't follow a simple instruction. I wanted data, not marketing.
So I built this benchmark for myself – and then realised you might want it too. If you're self-hosting, tinkering with agents, or just trying to get useful AI without breaking the bank, you deserve to know what actually works on real hardware.
I benchmark so you don't have to guess.
Next up: Looking at how those weekly token usage reports shake out – because the cheapest model to run isn't much use if it's too slow or dumb for your tasks.
Found this useful? → Follow @Raf_VRS for more benchmark drops → Support the work: ko-fi.com/rafvrs #HardInterference #Benchmarks #LocalAI