Build Journal

When Memory Becomes the Problem

My AI agent's memory hit 21.1K chars in a 16K limit. It wasn't a bug — it was a design flaw. Here's how persistent memory bloat creeps up on AI agents, why compression alone can't save you, what I did to fix it, and where external memory providers fit into the architecture.

2026-04-16 · 11 min read

The memory that ate itself

I’ve seen it happen too many times. You’re deep in a debugging session with your AI agent, and it’s saving useful findings to persistent memory — research notes, pipeline data, margin calculations. All valuable in the moment. But when the session ends? That memory doesn’t just fade away. It lingers. It bloats. And suddenly, your agent is hitting limits it was never designed to exceed.

Last session, my agent Dade was debugging a crash. During the investigation, it saved findings to memory — detailed research notes, pipeline data, margin calculations. Useful stuff in the moment. But by the time the debugging session ended, memory had bloated to 21,100 characters… in a limit designed for 2,200.

The agent didn’t crash because of a bug. It crashed because it saved too much to a system designed to stay small.

Two kinds of compression, two kinds of failure

Here’s the thing most people misunderstand: AI agents have two memory systems, and they fail in completely different ways.

1. Conversation context (short-term memory)

This is what fills up during a long chat. The conversation history grows, hits the model’s context window limit (say 128K tokens), and something has to give.

Hermes handles this with the ContextCompressor — a sophisticated system that:

Prunes old tool results — replaces large outputs with 1-line summaries ("[terminal] ran npm test -> exit 0, 47 lines")
Protects the head — keeps the system prompt and first exchange intact
Protects the tail — preserves recent messages by token budget
Summarizes the middle — uses a separate LLM call to create a structured handoff summary
Iterates — on re-compression, updates the previous summary instead of starting fresh

It even has anti-thrashing protection: if two consecutive compressions save less than 10% each, it stops trying and tells you to start a fresh session.

This system works well. It’s the equivalent of a human forgetting the details of a conversation but remembering the key decisions.

2. Persistent memory (long-term notes)

This is the system that survives across sessions. In Hermes, it’s two markdown files:

MEMORY.md (agent’s personal notes, 2,200 char limit)
USER.md (user profile, 1,375 char limit)

These are small on purpose. They get injected into every system prompt. Every token spent on memory is a token not available for the actual task.

And here’s the critical difference: there is no automatic compression for persistent memory.

The conversation compressor handles context overflow elegantly. But when persistent memory overflows, the tool just rejects the write:

Memory at 2,081/2,200 chars. Adding this entry (350 chars) would exceed
the limit. Replace or remove existing entries first.

That’s it. No summarisation. No auto-compression. Just a hard wall.

How memory bloat actually happens

Memory bloat isn’t usually dramatic. It’s death by a thousand cuts:

The debugging spiral

You’re investigating a crash. You save finding #1. Then finding #2. Then a hypothesis. Then a counter-hypothesis. Then the resolution path. Each one seems essential in the moment. But after the bug is fixed, most of those entries are dead weight — the fact that you debugged something matters, but the specific hypotheses that turned out to be wrong don’t.

The project detail trap

My new VRS product pipeline entry was 330 characters of specific margin calculations, margin percentages, product strategy, and hardware compatibility notes. All useful during research. All completely irrelevant once the research was saved to a dedicated file. The memory entry became a redundant copy of information that already lived elsewhere.

The ad monetization overflow

My followers aquisition plan was another 280-character entry with specific CPM ranges, outreach strategy, provider comparisons, and strategy recommendations. Again — detailed research that belonged in a file, not in the agent’s always-loaded working memory.

The pattern

In every case, the pattern is the same: the agent saves findings to memory instead of saving them to files. Memory is easy — one tool call and it’s done. Writing to a file requires thinking about where to put it, what to name it, whether to create a directory. So the agent takes the path of least resistance.

Why compression doesn’t help here

The conversation compressor is brilliant for short-term context. But it’s the wrong tool for persistent memory because:

Memory is already compressed — it’s curated notes, not raw conversation. You can’t summarise a summary without losing signal.
Memory has different value economics — a user preference is worth keeping forever. A debugging finding is worth keeping for the duration of the investigation. These need different retention policies.
Memory is injected fresh each session — it doesn’t accumulate in the same way conversation context does. The problem isn’t that one entry is too big, it’s that too many entries survive past their usefulness.
The flush_memories feature makes it worse — before compression, Hermes gives the model one turn to save memories. This is supposed to preserve important facts. In practice, the model panics and saves everything, including information that’s already been saved to files.

The architecture: why 2,200 chars?

Hermes uses character limits (not token limits) for memory because they’re model-independent. A 2,200-char limit works the same whether you’re running GPT-4, Claude, or a local Qwen model.

That limit gets allocated like this:

~500 chars: identity, values, priorities (fixed overhead)
~400 chars: environment facts (rig specs, paths, versions)
~300 chars: work rules and protocols
~1,000 chars: active project notes and task bullets

That leaves roughly zero room for research findings, debugging notes, or detailed plans. And that’s the point. Those things belong in files.

What I did about it

1. Compressed existing entries

I replaced bloated entries with file pointers:

Before (330 chars):

ACTIVE: vrscomputing.co.uk + VRS product pipeline saved in ~/local-ai-journal/vrs-product-pipeline.md and vrs-minipc-research.md. Key: cheap laptops + AI setup guide. DGX Spark standalone = hardened competition, only sell as bundle+service. Need to find mini PCs that run Ubuntu natively (no ChromeOS flashing). Chromebox CXI5 = cheapest option but needs MrChromebox firmware = support burden.

After (80 chars):

ACTIVE: vrscomputing.co.uk — details in vrs-product-pipeline.md + vrs-minipc-research.md

Same information access. The actual data is in the files. Memory just needs to know where to look.

2. Added a self-enforcing rule

I added a rule to memory itself:

Memory rule: keep entries under 150 chars; details go in files, memory gets
a pointer. Never exceed 80% (1,760/2,200).

Making the rule part of memory means it’s injected into every session. The agent can’t “forget” the rule because it’s always visible.

3. Set a hard ceiling

Never exceed 80% utilisation. This gives ~440 chars of headroom for legitimate new entries. Before, I was running at 94% — essentially zero room for anything new. The agent would try to save something, hit the limit, and either fail or waste turns trying to find something to remove.

The general principle: memory tiers

This maps to a pattern I have seen in how the best agent frameworks handle long-term knowledge:

|| Tier | What it stores | Size | Retention | ||------|---------------|------|-----------| || Working context | Current conversation | Full until compressed | Per-session | || Persistent memory | Pointers, preferences, rules | <2K chars | Permanent | || Files | Research, plans, code | Unlimited | Permanent | || Session archive | Full conversation history | Unlimited | Searchable |

The key insight: each tier should reference the tier below it, not duplicate it. Persistent memory should say "the VRS pipeline details are in vrs-product-pipeline.md", not repeat the pipeline details themselves.

This is similar to how MemGPT/Letta approaches memory — they use a tiered architecture with core memory (always loaded), archival memory (searchable on demand), and recall memory (conversation history). The difference is that Hermes’s approach is simpler and more explicit: memory is a curated list, not an LLM-managed black box.

Why this keeps happening to agents

This isn’t just a Hermes problem. The fundamental tension in agent memory is:

Agents want to remember everything — they don’t know what will be important later
Memory costs are front-loaded — every byte of memory is loaded into every prompt
Deletion feels risky — what if you remove the wrong thing?
Compression is lossy — summarising research notes loses the specificity that made them valuable

The result is a ratchet: memory only grows, never shrinks. Each session adds a little more. Each crash-recovery adds a lot more. And eventually you’re at 94% with zero room for anything new.

What I am considering next

The current fix — manual compression + self-enforcing rules — works. But it's fragile. I am thinking about:

Auto-pruning by age: Memory entries older than N sessions with no recent access get automatically compressed to pointers. This would require the memory tool to track access patterns.

Structured memory categories: Different char budgets for different entry types. User preferences get a permanent budget. Task bullets get a rotating budget. Research notes get zero budget — they go to files only.

Smarter flush_memories: Instead of letting the model save anything during pre-compression flush, filter by entry size. Entries under 150 chars go to memory. Anything longer gets redirected to a file automatically.

The next layer: external memory providers

Here's where it gets interesting. My AI agent doesn't live in isolation. It's connected to:

VRS Computing — an e-commerce site selling AI-ready laptops and setup services
The Local AI Journal — this blog, running on Next.js
Ollama — local model server running on the same machine
Systemd services — keeping everything alive and restarted

Each of these has its own state, its own configuration, its own context that the agent needs. When Dade works on the journal, it needs to know the port (3001), the systemd service name, the npm commands. When it works on VRS, it needs product margins, distributor pricing, affiliate program details.

If all of that goes in MEMORY.md, I am back to 21K. If it goes in files, the agent has to remember which file to read — which means the pointer still takes up space, and the agent still has to spend a turn reading the file before it can act.

The fundamental problem is that persistent memory and limited context windows are in tension. I want the agent to know everything relevant, but I can't afford to inject everything relevant every turn.

Hermes has a plugin system that supports external memory providers sitting alongside the built-in MEMORY.md. Right now there are eight options:

Hindsight — Knowledge graph with entity resolution and semantic search. Can run locally.
Honcho — Cloud-based AI-native memory with dialectic Q&A.
Mem0 — Cloud or self-hosted, automatic memory extraction.
Holographic — Local-only, simpler setup.
OpenViking — Full bidirectional sync with external databases.
ByteRover, RetainDB, Supermemory — Various cloud options.

The key insight: semantic search changes the game. Instead of loading everything into context every turn, you only recall what's relevant to the current conversation. The agent asks "what do I know about VRS margins?" and gets back just the relevant facts — without loading the entire product pipeline file.

This is the difference between a filing cabinet you have to manually open and an assistant who already knows which documents matter for the conversation you're having right now.

Why I haven't turned it on yet

Honestly? The built-in memory works well enough for most sessions. The compression hack keeps things under control. And every external provider adds complexity:

Latency — every memory recall is an API call (or local inference)
Cost — cloud providers charge per query, local providers need compute
Privacy — sending conversations to a cloud service means trusting someone else with your data
Dependencies — more services running means more things that can break

My rig has 64GB of RAM and an RTX 5070 Ti. I could run Hindsight locally, pointing at my own Ollama instance for the LLM backend. No cloud costs, no privacy concerns. But that's another service to maintain, another systemd unit, another thing that can crash at 3am.

The trade-off that matters

Here's what I've learned from running an AI agent with persistent memory for weeks:

Small, focused memory beats large, broad memory. The 150-character rule works because it forces you to decide what matters.
Pointers are your friend. Every project gets a file. Memory gets a one-liner pointing to it.
Accept imperfection. The agent will occasionally save too much or too little. That's fine — you can always compress or expand.
The next step is semantic search, but it's not urgent. The current system works. When it starts hurting — when you're spending more time managing memory than doing work — that's when you upgrade.
One external provider is enough. Hermes enforces this. Don't try to run Honcho AND Hindsight. Pick one, commit, learn its edges.

I am currently setting up Hindsight in local embedded mode — running entirely on my own hardware, using my local Ollama models for entity extraction and indexing. If it works, MEMORY.md shrinks to identity + rules + pointers (~500 chars), and all the detailed context moves to semantic recall.

The beauty of this approach is that it's additive. MEMORY.md never goes away — it's always there as the fallback. Hindsight just adds a search layer on top. If Hindsight crashes, I lose recall, not knowledge. The files are still on disk. The pointers are still in memory.

That's the architecture principle: layers that fail independently, not one monolith that fails together.

The same principle applies to the websites. The Local AI Journal runs on Next.js with systemd restarting it if it crashes. VRS Computing runs on whatever stack makes sense. Each site owns its own data, its own deployment, its own failure domain. The agent connects them — but it doesn't single-point-of-failure them.

The takeaway

If you're building with AI agents, persistent memory management is not optional. It's not enough to have a memory system — you need a memory hygiene system. Without one, your agent will slowly poison its own context with redundant information until it can't function.

The fix isn't bigger memory. It's smarter memory. Keep the pointers. Drop the details. Trust the files. And when you're ready, add semantic search — but only when the current system starts hurting.

My memory went from 94% utilised (2,081/2,200 chars, basically dead) to 73% (1,611/2,200 chars, with room to breathe). The agent works again. And I have a rule to stop it from happening next time.

Until the next time it forgets the rule, of course. But that's what making the rule part of memory is for.

Memory, like infrastructure, should be resilient by default. Not because it never fails, but because when it does, you can still get work done.

Found this useful? 👉 Follow @Raf_VRS for more Build Journal updates 👉 Support the work: ko-fi.com/rafvrs