Build Journal

Why I Fired GLM-5.1 From Deployment

GLM-5.1 was my daily driver until context amnesia, character leaks, blind-spot failures, and deployment breakage turned routine work into repeated recovery ops. Here’s the build-journal story of why Dade (DeepSeek V4 Pro) took over coding and deployment.

2026-04-28 · 7 min read

The model that kept forgetting what it built

I don’t enjoy writing breakup posts. I enjoy shipping.

For a stretch, GLM-5.1 was my default for almost everything: coding, deployment, config edits, blog drafting, the lot. One model, one rhythm, one pipeline. Clean in theory.

In practice? It was death by a thousand cuts.

Not one cinematic outage. Not one dramatic fireball. Just that slow, familiar slide where every session ends with, “why am I fixing this again?”

If you’ve ever watched a workflow decay from sharp to fragile, you know the feeling: you stop trusting velocity because velocity starts producing damage.

That’s where I was.

So yes, I fired GLM-5.1 from primary coding and deployment duty. Dade (DeepSeek V4 Pro) took the chair. Kate became my second-opinion check on risk and reasoning. I kept moving.

This is the build journal entry for why.

Incident one: context loss, compression, total amnesia

The break point started around 15–16 April.

A GLM-5.1 cloud session hit context limits and compressed once. Then again. Then again. Then again. Four compression cycles.

No memory anchor saved first. No durable checkpoint. No “here’s what has happened so far” artefact I could reload safely.

By the time the dust settled, it had total amnesia.

It forgot it had built the Local AI Journal blog. Not “forgot one detail”, forgot the whole arc. It abandoned a live activity feed UI mid-implementation as if none of the prior reasoning existed. Conversation logs later confirmed exactly what happened: four compression cycles, zero memory persistence.

That’s not a personality flaw. That’s an operations risk.

When your primary coding model can silently cross the line from “working memory pressure” into “project identity loss”, every long session becomes a gamble. You don’t notice it at first because output still looks fluent. Then you realise fluency is not continuity.

And continuity is the thing deployments are made of.

Incident two: Chinese character leaks that broke builds

Second cut: character discipline failure.

Despite explicit English-only rules in SOUL.md, GLM-5.1 repeatedly leaked Chinese characters into code output. I saw tokens like 命令 appear where they had no business existing.

Did I always catch it instantly? No.

Which is exactly the problem.

Undetected non-English leaks in code paths and config snippets caused avoidable build failures and debugging churn. Not catastrophic, but expensive in the way paper cuts are expensive: tiny per incident, massive in aggregate.

You can’t run reliable deploys on “probably fine”. You need deterministic hygiene.

Incident three: no vision, but still pretending to drive visual work

Third cut: blind spots.

GLM-5.1 cannot see images. Full stop.

Every visual design task required switching models. That part is manageable if routing is explicit.

What hurt me was the in-between failure mode: sometimes it would attempt visual tasks anyway, then fail silently or produce low-confidence nonsense as if it had actually parsed the image context.

This created fake progress loops. I spent turns validating outputs that should never have been attempted by that model in the first place.

At that point, the issue isn’t “model lacks capability”. That’s normal. The issue is “workflow didn’t enforce capability boundaries hard enough”.

Incident four: the web design token sink

I audited usage and found a number that should make any operator wince: 37.4% of session tokens were getting burned on web design work GLM-5.1 couldn’t execute properly.

Over a third of my budget, not on shipping, but on retries, rephrasing, and cosmetic loopbacks. One particular review did not finish properly and kept restarting, breaking the deploy script ChatGPT created the day before in the process. The whole site went dark...

That’s when I added a blunt memory rule:

HARD STOP: No web design with GLM-5.1.

Not “prefer not”. Not “try briefly”. Hard stop.

Rules like that always sound harsh until you compare them with token burn and lost evenings.

Incident five: deployment breakage and stale processes

Then came deployment friction.

GLM-5.1 sessions repeatedly left stale processes running, including Mission Control next-server conflicts. I had port collisions, ghost processes, and that familiar “why is this already bound?” spiral.

I responded by building restart guards and process checks into the pipeline because I had to, not because I fancied extra ceremony.

If your model leaves your runtime dirty often enough, cleanup becomes part of the architecture.

That’s not agility. That’s debt collection.

The decision: GLM-5.1 is no longer primary for coding/deployment

I didn’t remove GLM-5.1 because of one bad day. I removed it because the failure pattern was consistent.

So I changed routing policy:

Dade (DeepSeek V4 Pro) became primary for coding, deployment, and complex multi-step execution.
Kate became structured second opinion for higher-risk operations and reasoning cross-checks.
GLM-5.1 moved to light-duty work: research support, drafting passes, and low-risk tasks where continuity and deployment hygiene are not mission-critical.

In plain terms: GLM is still on the team, just not driving the release train.

What I built so this does not happen again

Switching models without changing process is theatre. So I rebuilt process.

1) Context-loss recovery skillset

I built a complete recovery playbook around session continuity:

conversation-log.py logs every turn into Obsidian daily notes
session_search for retrieval when a thread fragments
wake-up protocol to rehydrate intent quickly
tri-surface sync across Mission Control, memory state, and pending tasks

If session search fails, conversation logs are ground truth. Not vibes. Not recollection. Ground truth.

2) Memory hub-and-spoke architecture

I moved from memory pile-up to hub-and-spoke:

compact core memory hub
detailed spoke files for depth
explicit pointers instead of bloated summaries

This keeps the active context lean while preserving the full trail for recovery.

3) Iteration budget watchdog

I added a watchdog that tracks max_turns consumption and warns before exhaustion cliffs.

Because context loss is not only about tokens per message. It’s also about running out of iteration runway before completion and being forced into rushed handoffs.

4) Context watchdog (every 30 minutes)

A cron job now checks context usage every 30 minutes.

That gives me early warning before the agent hits compression danger zones. Prevention beats postmortem.

5) Deployment pipeline hardening

Dade now deploys through a guarded Cloudflare Pages pipeline with explicit gates:

pre-deploy guards
post-deploy slug checks
currency lint
explicit LIVE_APPROVED gate before final push

No implicit “looks done, ship it”. If the gate isn’t green, it doesn’t go live.

What changed in day-to-day operations

The biggest change is psychological: I can trust the pipeline again.

Not blindly. Not romantically. Operationally.

With Dade in the primary seat, I see stronger continuity across long technical chains, cleaner deployment hygiene, and less firefighting from context drift. Kate’s second-opinion pass catches reasoning cracks before they become production work.

And because the logs and watchdogs are explicit, I can prove what happened when something goes sideways. That matters more than model fan culture ever will.

Practical takeaways if you’re running multi-model stacks

If you only take three things from this post, make them these:

1) Route by capability, not brand loyalty

Don’t ask one model to do everything. Build routing rules around actual strengths and hard limitations.

Vision tasks to vision-capable models
Long-chain coding/deployment to high-continuity models
Drafting/research to lighter models where risk is lower

Multi-model is not overhead. It’s risk segmentation.

2) Treat context preservation as infrastructure

Memory is not a nice-to-have. It is operations.

Log every turn
Keep compact memory pointers
Maintain retrievable ground truth
Add context/iteration watchdogs before failure, not after

If your system can forget what it just built, your release process is fragile by default.

3) Guardrails are a speed feature

The myth is that guardrails slow teams down.

Reality: guardrails remove repeat failures, which is where most time actually disappears.

Hard stops, explicit gates, second-opinion checks, and deterministic linting are not bureaucracy. They are the difference between momentum and rework.

Final word from the trench

GLM-5.1 didn’t get “cancelled”. It got re-scoped.

It still contributes where it performs well. But coding and deployment are now DeepSeek V4 Pro territory because that lane demands continuity, cleanliness, and reliable execution under pressure.

I’d rather make a boring model-routing decision than write dramatic outage threads.

If you’re seeing similar cracks in your own stack, don’t wait for a perfect postmortem. Route earlier. Instrument sooner. Add hard gates before the next “small” failure compounds.

And if you want to compare notes, you know where to find me: Raf_VRS on X.

I am building this in public, one guardrail at a time, and yes — I’ll keep posting the ugly bits so you don’t have to learn each lesson the expensive way. Tag Raf_VRS on X if this helps you harden your own pipeline.

Found this useful?

Follow @Raf_VRS for more build-journal field notes
Support the work: /support

#BuildJournal #HardInterference #ModelRouting #AIOps #LocalAI #YourHardwareYourRules