Build Journal

Are You Still Working? How I Made AI Agent Status Visible

My AI agent got stuck twice in two sessions -- once from context loss, once from a stale process conflict. I couldn't see it happen because the dashboard only knew the agent existed, not what it was doing. Here's how I built 4-state status detection and why it matters.

2026-04-16 ยท 10 min read

The question you can't answer

You have an AI agent running on your machine. You walk away for ten minutes. You come back.

Is it still working? Did it crash? Is it waiting for you?

For weeks, the Mission Control dashboard couldn't answer this. It could tell me the agent had an active session. It could show recent messages. But it couldn't distinguish between:

These are four very different states. Treating them all as "the agent is running" is like treating a stopped heart monitor the same as a healthy pulse -- technically the machine is plugged in, but the patient is dead.

The two sessions that broke trust

I learned this the hard way, two sessions in a row.

Session 1: Context loss coma

The agent was building this blog -- seven posts, a full Next.js app, file after file being written. The task consumed the entire context window. The session compressed four times, each compression stripping more detail from working memory. By the end, the agent had no idea what it had just built.

Anyone watching the dashboard would have seen: "Agent active, last session running." Useless. The session was technically alive but the agent inside it had lost the plot entirely. What looked like a working assistant was actually an amnesiac staring at a screen it no longer understood.

I documented this incident in my post on context loss. The fix was a three-layer memory system: persistent scratchpad notes, daily conversation logs, and a wake-up protocol. But the incident revealed a deeper problem -- nobody could see that something was wrong.

Session 2: The zombie process

Next session, I restarted Mission Control. Or tried to. The restart hung because a stale next-server process from a prior version (v15.5.15) was still holding port 3000. The new process (v16.2.3) couldn't bind. Two servers fighting over the same port.

The dashboard? Still showing green. The health check endpoint returned 200 because something was serving on port 3000 -- just the wrong thing. The problem was invisible until I noticed the UI was stale and the process wasn't responding to changes.

The root cause: no pre-startup cleanup. The restart flow assumed a clean slate. It should have killed existing processes first.

But the restart command itself was causing crashes too. The sequence was:

fuser -k 3001/tcp 2>/dev/null; sleep 2; cd /home/klb/local-ai-journal && np

Two problems:

fuser -k sends SIGKILL. No graceful shutdown. No cleanup of temp files, Websocket connections, or build artifacts. Just immediate murder of whatever's on the port. If Mission Control's Hot Module Replacement process happened to be on 3001 during the 2-second race window, it would get killed too. The sleep 2 was supposed to prevent this, but two seconds isn't a guarantee -- it's a coin flip.

np isn't a real command. It doesn't exist on the system. No binary, no alias, no script. Every time the restart ran, the start silently failed. The server never came back. From the dashboard, it looked like the agent was online (the old process was still technically running until fuser -k killed it), then suddenly offline with no explanation.

The fix: a proper systemd user service that handles the full lifecycle -- SIGTERM first (graceful shutdown), wait up to 8 seconds, SIGKILL only as a last resort. Auto-restart on crash with a 5-second delay. Logs via journalctl. And a restart.sh script that uses systemctl when the service is active, falling back to manual process management when it's not.

# Before: silent death
fuser -k 3001/tcp 2>/dev/null; sleep 2; cd ~/local-ai-journal && np

# After: proper lifecycle
systemctl --user restart local-ai-journal
# or: ~/local-ai-journal/restart.sh
# logs: journalctl --user -u local-ai-journal -f

The key lesson: your restart command is part of your reliability story. If restarting causes crashes, you don't have a restart system -- you have a crash system that sometimes starts things.

But again -- the deeper problem was observability. The dashboard said "fine" while the system was broken.

Why status matters

AI agents aren't web servers. A web server either serves requests or it doesn't -- binary. An AI agent has a lifecycle:

  1. Active -- talking to an LLM, running tools, writing files. IO is happening. The thinking is live.
  2. Idle -- the turn completed, the ball is in the user's court. The agent is waiting for input.
  3. Crashed -- something went wrong. A tool hung, a process died, context was lost. The agent is unresponsive but the session still looks "open."
  4. Offline -- no session at all. The agent isn't running.

If you can't see these states, you can't trust the system. You either check in constantly (wasting your time) or you trust blindly (missing failures). Neither scales.

This is especially critical when you have agents running autonomously -- cron jobs, nightly reviews, watchdogs. If a scheduled agent crashes mid-task, you need to know. A dashboard that says "last seen 2 hours ago" is not the same as one that says "crashed -- no response for 120 minutes, last user message unanswered."

The challenge: Messages are batch-written

The obvious way to detect agent activity is to check the messages in the database. When the agent sends a message, it's active. When the user sent the last message, the agent is idle.

Except messages aren't written in real-time.

Hermes (the agent framework) batch-writes all messages for a turn when the turn completes. While the agent is actively thinking -- calling the LLM, running tools, processing results -- the database shows nothing new. The last message is always from the previous completed turn, which is almost always an assistant message.

This means checking "who sent the last message" always returns "assistant," which always maps to "idle." The agent could be deep in a 30-tool-call chain and the database would report it as idle.

This is the fundamental observability gap. The work happens between database writes.

My solution: /proc/PID/io delta detection

Since I can't rely on the database for real-time state, I read the process's IO activity instead.

Linux exposes per-process IO counters at /proc/PID/io. The fields rchar and wchar count bytes read and written by the process. When an agent is actively calling an LLM API and processing responses, these counters change in real-time -- even between database writes.

The approach:

  1. Find the Hermes process -- pgrep -f 'hermes_cli' to get the main process ID
  2. Read IO counters -- parse rchar and wchar from /proc/PID/io
  3. Compare with previous poll -- a state file at ~/.hermes/mc-status-state.json stores the last counters
  4. If IO changed -- the agent is active (it's talking to the LLM API)
  5. If IO is stable -- check message timestamps and decide between idle, crashed, or offline

The 4-state detection logic:

if no open session:
    status = "offline"
elif io_bytes_changed in last 10s:
    status = "active"     # agent is processing right now
elif last_message_from_assistant and stable < 60s:
    status = "active"     # just finished, might still be working
elif last_message_from_assistant and stable > 60s:
    status = "idle"       # agent is waiting for user input
elif last_message_from_user and silent > 300s:
    status = "crashed"    # user asked something, no response for 5+ min
elif end_reason == "compression":
    status = "crashed"    # session ended due to context overflow

I also check for end_reason: "compression" -- when a session dies from context overflow, it's effectively crashed even though the session record is closed.

The Python syntax bug that broke everything

Of course, building the endpoint introduced its own failure mode. The route embeds a Python script (to query the SQLite database, since sqlite3 CLI isn't available on the machine). The Python list comprehension had this:

int(r[4]or0)

In Python, or is a keyword operator. It needs spaces around it adjacent to a number literal. Without spaces, r[4]or0 is a syntax error -- Python sees or0 as an identifier, not or 0.

# Broken:
int(r[4]or0)       # SyntaxError: invalid syntax
float(r[8])if r[8]is not None  # Same problem

# Fixed:
int(r[4] or 0)     # Works
float(r[8]) if r[8] is not None  # Works

This is the kind of bug that's invisible until runtime. The TypeScript compiled fine. The route loaded fine. Only when the endpoint was called did the embedded Python fail with a syntax error that got caught and wrapped as an HTTP 200 response with an error payload. The dashboard showed "idle" because it got a response -- just not the one it expected.

Debugging required curling the endpoint directly and reading the raw JSON error. The fix was five spaces added to three lines.

Making it visible: The Live Activity Feed

With the endpoint working, the Mission Control dashboard now shows real-time agent status in two places:

The header badge -- A small colored dot next to the assistant indicator:

The Live Activity panel -- A live-updating feed showing the last 10 status events with relative timestamps ("2m ago", "just now"). Crashed sessions get a red alert row. Active sessions get a pulsing green dot.

This is the difference between "the system is running" and "the system is running correctly." The first tells you the process exists. The second tells you it's doing useful work.

The restart guard

The zombie process incident and the fuser -k + np crash taught us that restarts need to be defensive. I built three layers of protection:

systemd user service (production):

  1. Manages the full process lifecycle automatically
  2. SIGTERM on stop, SIGKILL only after timeout
  3. Auto-restart on crash with 5-second delay
  4. Logs captured by journalctl
  5. Survives reboots (enabled=always)

restart.sh (quick manual use):

  1. Detects systemd service -- uses systemctl when available
  2. Falls back to manual: SIGTERM, wait up to 8s, SIGKILL last resort
  3. Only kills the journal's own process (not mission-control if it happens to share the port)
  4. Verifies the server comes back up on port 3001

stop.sh (emergency cleanup):

  1. Kill port 3001
  2. Find and kill remaining next-server PIDs
  3. SIGKILL fallback for anything stubborn

The key principle: never assume a clean slate, and never use SIGKILL when SIGTERM will do.

When it gets stuck: the manual recovery playbook

All the automation in the world won't help if you don't know why something got stuck. Before you restart anything, grab the evidence.

1. Copy the last line from your UI. Whatever the agent was doing when it froze -- a command, an error, a tool call -- that's your breadcrumb. Copy it to your clipboard before you touch anything else. Once you restart, that context is gone.

2. Open a new terminal window and run:

hermes gateway stop
hermes gateway start

This cleanly stops and restarts the agent gateway without the fuser -k carpet-bomb approach. The gateway handles graceful shutdown on its own.

3. Open a new agent instance and paste what you copied. Tell the agent:

"Investigate the issue of [paste the last line you copied]"

This gives the fresh agent enough context to trace the failure. It can check logs, inspect processes, read error files -- whatever the original agent couldn't do because it was stuck. The old session is gone, but the evidence you clipboarded lets the new one pick up the trail.

The pattern is simple: preserve, restart, investigate. Not restart, realise you lost the error, shrug, and hope it doesn't happen again.

What this means for your setup

If you're running AI agents -- locally or in the cloud -- ask yourself:

  1. Can you see what the agent is doing right now? Not "is the process running" but "is it actively processing a task?"
  2. Can you tell when it's stuck? A crashed agent and an idle agent look the same if you're only checking process existence.
  3. Do you have restart guards? Never assume the previous instance cleaned up after itself. And never SIGKILL when SIGTERM will do -- your restart command is part of your reliability story.
  4. Are you checking the right signals? Database timestamps are delayed. Log files are delayed. Process IO counters are real-time. Pick the signal that matches the resolution you need.
  5. Can you recover gracefully? When your agent gets stuck, does someone notice? Or does it sit there for hours until you happen to check?

The most dangerous state for an AI agent isn't failure. It's silent failure -- when it looks like it's working but it's actually stuck. Observability is what turns silent failures into visible ones.

The meta-lesson

I now track my agent's status through my agent. The agent runs the cron job that checks whether the agent is alive. The agent maintains the memory logs that help the agent recover from context loss. The agent's dashboard monitors the agent's own processes.

This sounds circular, and it is. But it works because each layer is independent. The cron job reads /proc/PID/io -- it doesn't need the agent to be working. The memory logs are written to disk -- they survive even if the agent's context is destroyed. The dashboard polls an API endpoint -- it doesn't need the agent to be responsive, just the web server.

The system is resilient because each observability layer can function without the others. Remove any single layer and you still get answers, just less detailed ones. That's the goal.

Build your observability like you build your backups: independent, overlapping, and checked by something other than the thing they're monitoring.

Found this useful? ๐Ÿ‘‰ Follow @Raf_VRS for more Build Journal updates ๐Ÿ‘‰ Support the work: ko-fi.com/rafvrs