Build Journal

When Your AI Stack Eats Itself: The Ollama Crash Loop That Took Everything Down

Two Ollama services, 56,000 restart attempts, and one port — how a silent systemd conflict took down my entire local AI stack, why it can happen to you, and how to prevent it.

2026-04-19 · 5 min read

The morning everything stopped

You wake up, grab coffee, sit down at your desk. Your AI agent — the one that runs 24/7, manages your research, watches your services, writes your apps — is gone. Not crashed. Not paused. Gone.

The terminal session? Dead. The model server? Crash-looping. The background tasks? Silent. Even the dev server that was serving your theme demo? Gone.

This happened to me. And the root cause was so mundane it's almost funny: two services fighting over one port, and one of them wouldn't stop trying.

What happened

I run Ollama — the local LLM runtime — as a systemd user service. It starts when I log in, holds port 11434, serves models on demand. Works great.

Except somewhere along the way, Ollama's convenience install script had also created a system-level service. Same name, same port, same binary. Two units, one port.

The user service started first and grabbed 11434. Then systemd tried to start the system service. It couldn't bind the port. It exited with code 1. Systemd, helpfully, restarted it. It failed again. Systemd restarted it again.

56,795 times.

That's not a typo. The restart counter was at fifty-six thousand, seven hundred and ninety-five. The system service had been crash-looping for two days before I noticed. Each restart attempt:

Spawned a process
Failed to bind port 11434
Logged an error
Got restarted 3 seconds later

Multiply that by 56,000+ and you get: excessive disk I/O, constant CPU interrupts, PID table churn, and a systemd journal swollen with error messages. None of this was catastrophic on its own, but it was a slow drain on system resources that eventually contributed to instability.

The cascade

Here's the thing about local AI stacks: they're fragile not because any single component is weak, but because they're tightly coupled.

When Ollama's system service hit its 56,794th restart, the cumulative resource pressure finally tipped something over:

The agent session — my Hermes agent was mid-task (running model benchmarks, editing files, managing background processes). The session consumed its entire context window and compressed. Then compressed again. By the fourth compression, the agent had lost track of what it was doing.
The dev server — the Python HTTP server serving my theme demo was killed when process resources became constrained.
The agent process itself — the main agent session terminated. No graceful shutdown, no cleanup.

Everything fell like dominoes. And the first domino was a service that should never have existed.

Why this will happen to you

If you're running local AI, you will hit this class of problem. Here's why:

Local AI stacks have no isolation. Everything runs on one machine. Ollama, your agent, your web server, your database, your monitoring — they share one kernel, one PID table, one set of ports. A problem in one service leaks into all the others.

Convenience scripts leave landmines. The Ollama install script does exactly what it should: set up a system service so Ollama runs on boot. But if you later switch to a user-level service (which you should — it's safer, it doesn't need root, it respects user boundaries), the system service is still there. Still enabled. Still trying to start. You just can't see it unless you look.

Systemd doesn't tell you about conflicts. It logs the failures, sure. But systemctl status ollama only shows one service — whichever one responds first. The other one is in a crash loop that's invisible from the outside.

Crash loops are silent killers. A service that fails and restarts every 3 seconds doesn't trigger alarms. It doesn't show up as "down" on your dashboard. It's always "trying." Systemd thinks it's being helpful. It's not.

The fix

Here's what I did, and what you should do:

1. Check for duplicate services

# List ALL services matching "ollama" at every level
systemctl list-units --all '*ollama*'
systemctl --user list-units --all '*ollama*'

If you see two, you've got the problem.

2. Remove the system-level service

You only need the user service. The system one was created by the install script and is now redundant (and dangerous):

sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service
sudo systemctl daemon-reload

3. Verify only the user service remains

systemctl --user status ollama

You should see one service, active and running, holding port 11434.

4. Clear the restart counter

Even after disabling, the failed state lingers:

systemctl reset-failed ollama 2>/dev/null
systemctl --user reset-failed ollama 2>/dev/null

How to continue after a crash

The crash itself is only half the problem. The other half is: what do you do when you sit down and everything's gone?

Here's the protocol I use:

Preserve before you touch. Copy the last command or line you see on screen before doing anything. Context evaporates fast.
Restart the foundation first. Ollama, then the agent, then the app servers. Bottom of the stack up.
Open a fresh agent session. Don't try to resume the dead one. Paste your preserved context and say: "Investigate what happened here."
Check for cascading damage. Did the crash corrupt any files? Leave any orphan processes? Check systemd services, check ports, check disk space.

The last step is the one most people skip. After a crash, you're relieved it's working again and you move on. But the 56,000-restart crash loop was happening silently for a couple of days before it caused visible problems. The earlier you catch these, the less damage they do.

The deeper lesson

Local AI is powerful because you control the entire stack. But controlling the entire stack means you are also responsible for the entire stack. Every service, every port, every systemd unit file, every cron job — they're all yours. There's no cloud provider abstracting away the plumbing.

This isn't a weakness. It's the tradeoff. You get full control, but you also get full responsibility. The 56,000-restart crash loop is what happens when that responsibility lapses — not because anyone did something wrong, but because the convenience install script did something sensible that became a landmine when circumstances changed.

Check your services. Check your ports. Check your restart counters. Your future self will thank you. Teach the agent how to spot those things for you and set up regular system checks.

The Ollama fix was applied immediately. The lesson took longer.

Found this useful? Follow @Raf_VRS for more from the Hard Interference build journal.

Support independent tech writing: ko-fi.com/rafvrs

Stop Scrolling. Start Building. #LocalAI #AIAgents #HardInterference