Build Journal

Just One More Prompt

I generated a full rap-over-house track on a local RTX 5070 Ti using HeartMuLa — and lived to tell the tale of dependency hell, patching transformers, and the moment the beat finally dropped.

2026-04-16 · 4 min read

Just One More Prompt

You know the feeling. It's 2 AM. Cursor blinking. The feed is calling. "Just one more prompt," you whisper.

This time, the prompt wasn't a distraction — it was the point. After seeing a great AI-generated tune, I thought: why not try it out? The thing is, I had not taught Dade how to do any of it. It ran off, researched what would work on my system, and installed HeartMuLa, an open-source music generation model. It took around 30 minutes to get everything running from my prompt: turn that late-night struggle into a song — rap over a house beat, old-school flow, agentic AI references, building from mellow to hype.

And it worked. Dade delivered.

The Song

The track is called "Just One More Prompt" — a 2:36 track about the eternal struggle between focus and distraction, agentic AI and late nights, commitment and the open browser tab.

The lyrics map the familiar arc: mellow introspection → building tension → full energy commitment. The hook hits hard:

ONE MORE PROMPT — that's what I always say
But the clock don't stop and the work won't wait!
Distraction's calling but I'm locking in
Commit to the grind — let the focus begin!

The refrain below is the same journey condensed into the hook.

Videos

The Setup (Or: How I Learned to Stop Worrying and Patch Python)

Generating music locally with HeartMuLa on an RTX 5070 Ti (16GB) should be straightforward. Clone, install, download checkpoints, run. Reality had other plans.

Bug 1: RoPE Cache Skips on Meta Device

HeartMuLa uses from_pretrained, which creates the model on meta device first then loads weights. The Llama3ScaledRoPE module's rope_init() quietly skips building caches on meta tensors — and never rebuilds them after the model moves to a real GPU. Result: cryptic runtime crash.

Fix: Patch modeling_heartmula.py to reinitialise RoPE caches after reset_caches():

# Re-initialise RoPE caches that were skipped during meta-device loading
from torchtune.models.llama3_1._position_embeddings import Llama3ScaledRoPE
for module in self.modules():
    if isinstance(module, Llama3ScaledRoPE) and not module.is_cache_built:
        module.rope_init()
        module.to(device)

Bug 2: HeartCodec Shape Mismatch

HeartCodec's VQ codebook has initted buffers saved as shape [1] but the model expects [] (scalar vs 0-d tensor). Same data, different shape. from_pretrained throws a size mismatch error and refuses to load.

Fix: Add ignore_mismatched_sizes=True to both HeartCodec.from_pretrained() calls (the eager load in __init__ and the lazy load in the codec property).

Bug 3: torchcodec Needs FFmpeg 7, Ubuntu Ships 6

The new torchaudio.save() requires torchcodec, which requires libavutil.so.59 (FFmpeg 7). Ubuntu 24.04 ships FFmpeg 6 with libavutil.so.58. No sudo, no easy fix.

Fix: Ditch torchcodec. Use soundfile to write WAV, then ffmpeg (system) to convert WAV → MP3. Patched the postprocess method with a try/except fallback:

try:
    torchaudio.save(save_path, wav_cpu, 48000)
except (ImportError, OSError):
    import soundfile as sf
    wav_path = save_path.replace('.mp3', '.wav') if save_path.endswith('.mp3') else save_path
    sf.write(wav_path, wav_cpu.numpy().T, 48000)
    if save_path.endswith('.mp3'):
        subprocess.run(['ffmpeg', '-y', '-i', wav_path, '-b:a', '128k', save_path], check=True)
        os.remove(wav_path)

Bug 4: Dependency Version Conflicts

The pinned datasets and transformers versions clash with newer pyarrow and huggingface-hub. Standard open-source fun.

Fix: Upgrade both with uv pip install --upgrade datasets transformers. The skill doc already had this documented — good lesson in actually reading the setup instructions.

The Hardware

RTX 5070 Ti, 16GB VRAM. The 3B model with --lazy_load true peaks around 6.2GB VRAM — comfortable headroom. Token generation ran at ~24 tokens/sec, producing 3000 tokens in about 2 minutes. HeartCodec decode added another 30 seconds. Total time from command to MP3: under 3 minutes.

Not bad for a local model that fits on a consumer GPU.

The Refrain

ONE MORE PROMPT — that's what I always say, but the clock don't stop and the work won't wait! Distraction's calling, but I'm locking in, commit to the grind, let the focus begin! Hard work over shortcuts, that's the only route, late night, bright screen, drown the doubt out!

Possibilities Are Endless

This was a single curious prompt that triggered the agent to install the relevant tools, patch the PC enough to use them, create the lyric and the beat, and deliver beyond my wildest dreams. HeartMuLa supports different styles, lyrics, and languages, so of course I jumped on board and asked it for a mid-20s female singer to perform the song in German. It delivered again: the translation did not just say the same thing in a different language, it adapted the lyrics to the beat and made sense.

And then I asked it to do the same in Greek. And not only did it deliver, according to my wife, but it also picked a beat that was more popular in Greece.

Just one more prompt. The right one. Follow @Raf_VRS for more like this.


Generated with HeartMuLa 3B on RTX 5070 Ti. Lyrics by Dade and Raf. Bugs by open-source dependencies. Persistence by choice.

Found this useful? Follow @Raf_VRS for more VRS Computing insights and support the work: ko-fi.com/rafvrs #LocalAI #MusicGeneration #HeartMuLa