AI Guides

Listening Backwards: Extracting Lyrics From AI-Generated Music

HeartMuLa generated a song from lyrics. But what did it actually sing? I ran Whisper (faster-whisper turbo) on all three versions — English, German, Greek — to find out. The results are messy, funny, and surprisingly faithful.

2026-04-17 · 8 min read

The Problem With AI Music

You give an AI music model lyrics. It generates a song. You listen. You think you hear the words... but did it actually sing what you wrote?

This isn't academic. HeartMuLa's vocal synthesis is impressive for a 3B local model, but it's not Suno. Words slur. Syllables collide. Phonetic artefacts creep in. And when you generate in a language you don't natively speak — German, Greek — you can't tell if the model nailed the pronunciation or just hallucinated something plausible.

I needed to close the loop. I wrote the lyrics, HeartMuLa sang them, and now I needed to extract them back — to verify fidelity, spot drift, and document what a 3B music model actually produces. I know this feeling well — when you're pushing local AI to its limits, verification isn't optional, it's essential.

The Tool: faster-whisper

OpenAI's Whisper is the standard for speech-to-text. I used faster-whisper, the CTranslate2-backed implementation that's roughly 4x faster than the original, with CUDA support.

The model hierarchy for Whisper looks like this:

ModelParametersVRAMSpeedQuality
tiny39M~1 GB~32x realtimeRough
base74M~1 GB~16x realtimeDecent
small244M~2 GB~6x realtimeGood
medium769M~5 GB~2x realtimeVery good
large1550M~10 GB1x realtimeBest
turbo809M~6 GB~8x realtimeBest speed/quality

I went with turbo — 809M parameters, ~6GB VRAM, 8x realtime speed. On the RTX 5070 Ti with 16GB VRAM, this leaves plenty of headroom even with Ollama running.

Why not large?

I tried large first (best quality). It needs ~10GB VRAM. With an Ollama model loaded (~8GB) and display compositor (~500MB), that's 18.5GB on a 16GB card. OOM before it even starts. This is the same lesson from the FLUX OOM journey — VRAM is the scarcest resource, and "fits on paper" doesn't mean "runs in practice."

Why not medium?

Medium (769M, ~5GB) would technically fit, but turbo has similar quality with better speed. Same parameter count, better architecture. No reason to choose medium over turbo in 2026.

Why not small/base/tiny?

I am transcribing AI-generated vocals — already phonetically noisy. Using a small model on top of noisy audio is stacking uncertainties. The transcription would be garbage-in-garbage-out. Turbo is the minimum for meaningful results from synthetic vocals.

The Setup

Installation into my existing HeartMuLa venv (managed by uv):

cd ~/heartlib && uv pip install faster-whisper

The transcription script itself is minimal — load the model on CUDA, run transcribe(), print segments:

from faster_whisper import WhisperModel

model = WhisperModel("turbo", device="cuda", compute_type="float16")
segments, info = model.transcribe("one_more_prompt_output.mp3", language="en")

for seg in segments:
    print(f"[{seg.start:.1f}s - {seg.end:.1f}s] {seg.text}")

Each 2-3 minute track transcribed in under 15 seconds. The full pipeline for all three versions (English, German, Greek) finished in under a minute.

The Results

English — Surprisingly Faithful

The English version transcribed with high fidelity. Most of HeartMuLa's output matched the input lyrics word for word. The drift I found was phonetic, not semantic:

Input LyricsWhat Whisper Heard
"Agentic AI got me moving""A genetic guy, got me moving"
"Eyeballing love""Eyeball in love"
"Didn't pick up""Calling in pickup"
"Won through the work""One through the work"

These are listening errors, not singing errors — Whisper is transcribing what HeartMuLa actually phoneticised, and the model doesn't always enunciate perfectly. The word "agentic" in particular is a hard word for both text-to-speech (HeartMuLa blurs it) and speech-to-text (Whisper hears the more common "genetic").

The structure held: intro → verse → pre-chorus → chorus → verse 2 → bridge → final chorus → outro. All present, all recognisable. The meaning survived the round trip.

German — Good Structure, Funny Ghost Lyrics

The German transcription was structurally sound. HeartMuLa clearly performed a translation-adaptation — not a word-for-word translation, but a culturally localised version. "Rabbit hole attack" became "Kaninchenbau verschwunden" (disappeared into the rabbit burrow). The vibe is right.

The funny part: at the end, HeartMuLa got stuck in a loop and repeated "Das ist ein Prompt, der Richtige" (This is a prompt, the right one) three times. Whisper faithfully transcribed every repetition. This is a known HeartMuLa issue — the model sometimes loops on the outro. The transcription didn't skip it or smooth it over. It showed me exactly what happened.

Some drift in the bridge: "gefühllos durchziehend" (remorselessly persisting) isn't quite what the English original meant by "doing it relentless," but it's a creative interpretation, not a failure.

Greek — Most Creative, Most Artefacts

The Greek version was the longest (200s vs 156s English) and showed the most drift. HeartMuLa's Greek pronunciation is phonetically rough — it sounds Greek-ish but doesn't always match standard orthography. Whisper tried its best, producing a transcript that's understandable but spelled phonetically rather than correctly.

Examples:

The final outro had an extra-long tail (196s → 226s) where the model repeated "Πάμε ξανά" (let's go again) with no music — just ghost vocals. Again, Whisper captured it exactly. Without this verification, I would have never known the model added 30 seconds of repetitive filler. I've seen this kind of behaviour before — when models don't know when to stop, they just keep going, and you need the tools to catch it.

Why This Matters — And Why I Built the Resource Screen

This lyrics extraction exercise crystallised something I had been feeling for weeks: you can't manage what you can't see.

When HeartMuLa generates a song, you hear it once and move on. You don't know if the German version has a repetition bug. You don't know if the Greek outro is 30 seconds too long. You don't know that "agentic" sounds like "genetic." The output seems fine until you look closely.

The same principle applies to the machine itself. I was running AI models 24/7 — Ollama serving local LLMs, HeartMuLa generating music, FLUX rendering images, Whisper transcribing audio — all on a single RTX 5070 Ti with 16GB of VRAM. And for weeks, I had no idea what the GPU was actually doing.

That's why I added a System Resource screen to Mission Control.

The System Stats Tab

Mission Control already tracked token usage, model sessions, and live activity. But it couldn't answer the question that matters most when you're running multiple AI workloads on consumer hardware:

"Can I run this model right now, or is my GPU already full?"

The new System Stats tab shows:

It's powered by a /api/system-stats endpoint that runs Python's psutil for system metrics and nvidia-smi for GPU data — the same pattern as our other API routes. Auto-polls every 10 seconds.

The connection

The lyrics extraction and the resource screen are the same idea applied at different scales:

Both solve the same problem: closing the feedback loop. You write → the model generates → you verify. You schedule a task → the machine runs it → you check if it had the resources to succeed. Without the verification step, you're flying blind.

The FLUX OOM journey (three crashes before I found the 8-bit + sequential offload combo) was the wake-up call. I didn't know the GPU was full because I had no way to see it. Now I do. And when I ran Whisper turbo on those HeartMuLa tracks, I could check the resource screen to confirm I had the 6GB VRAM free — instead of guessing and hoping.

The Model Report Card

Here's what I learned about which models worked and which didn't for this workflow:

ModelTaskResultWhy
HeartMuLa 3BMusic generation✅ WorksFits in ~6GB VRAM, ~2 min generation. Needs 4 patches on Ubuntu.
faster-whisper turboLyrics extraction✅ Works6GB VRAM, 15 sec per track. Best speed/quality trade-off.
faster-whisper largeLyrics extraction❌ OOM10GB VRAM + 8GB Ollama = 18GB on a 16GB card. No room.
faster-whisper mediumLyrics extraction⚠️ Don't botherSame param count as turbo, worse architecture. Turbo beats it.
faster-whisper small/base/tinyLyrics extraction❌ Too noisySynthetic vocals + small model = garbage transcription.
FLUX.1-schnell (bfloat16)Image generation❌ OOM14.5GB model alone on a 16GB card.
FLUX.1-schnell (8-bit + seq offload)Image generation✅ Works6-8GB peak VRAM, 13 sec per image. Slower but functional.

The pattern is clear: on 16GB VRAM, quantization and offloading aren't optional — they're the only path that works. Full-precision models that exceed 12GB are dead on arrival. The models that work are the ones that respect the constraint.

What's Next

The lyrics extraction pipeline is now a repeatable workflow:

  1. Generate music with HeartMuLa
  2. Transcribe with faster-whisper turbo
  3. Compare input lyrics vs. extracted lyrics
  4. Flag drift, loops, and phonetic artefacts

I could automate this — generate, transcribe, diff, and surface a quality report. But that's a future post.

For now, the loop is closed. I write the lyrics, the model sings them, and I listen backwards to hear what it actually said. Sometimes it's exactly right. Sometimes it's "a genetic guy" instead of "agentic AI." And sometimes it repeats the last line three times because it doesn't know when to stop.

Sounds like every creative process I have ever known.


Transcribed with faster-whisper 1.2.1 (turbo, CUDA float16) on RTX 5070 Ti 16GB. Music generated with HeartMuLa 3B. Mission Control at localhost:3000.

Found this useful? Support the work at ko-fi.com/rafvrs and Follow @Raf_VRS.

#StopScrollingStartBuilding #LocalAI #AIAudio #SpeechToText