Encode the Why, Not the Step

By Published On: July 2, 2026

There was a rule in a data center I worked in: do not mop the floor near a particular row. It had been in the operations runbook for as long as anyone could remember, and everyone followed it. Nobody could tell you why.

The why was thirty years old. The cabinets in that row had no doors, and one server sat exposed at floor level. A janitor working a mop had bumped it once and taken the system down for hours. So the rule went into the runbook: don’t mop there. It was correct, it was specific, and it was followed faithfully.

Then the data center was migrated. New cabinets, doors, the exposed server no longer exposed. The defect the rule existed to prevent was repaired. And the rule stayed. People kept not mopping a row that had been safe for years, protecting a server from a hazard that no longer existed, because the runbook said so and the runbook doesn’t say why.

That rule is the whole problem, and it fails in the direction nobody warns you about. The instruction survived perfectly. It was reality that drifted out from under it. The detail was maximally durable and, the day the doors went on, maximally wrong, and it had no way to notice, because a step that doesn’t carry its own reason can’t tell you when its reason has expired. The document was sticky, it read as authority, and it was a fossil.

If you think that’s a quaint infrastructure story, you produced the same failure this week. You have been building a repository with a coding agent for months. The agent accumulated the context as it went: why a module is structured the way it is, which approach you already tried and abandoned, the constraint that isn’t written anywhere because it lived in the conversation. Then you point a different agent at the same repository. It sees the code. It sees none of the context. And it confidently does the thing you spent three weeks learning not to do, because the reasoning that would have stopped it left with the tool that held it.

The mop rule and the fresh agent are the same failure inverted. One keeps executing an instruction whose reason is gone. The other invents a reason that never transferred. Both happen because the load-bearing knowledge, the why, lived in someone’s head or some tool’s memory, and the artifact only ever referenced it. The steps were captured. The reason was assumed. And the reason is the part that matters, because the reason is the only thing that knows when the step is wrong.

Detail is a liability you have to pay for

The instinct, staring at the mop rule, is to say the runbook wasn’t detailed enough. Write down the why. Write down everything. That instinct is wrong, and anyone who has run disaster recovery at scale has already learned it the hard way.

You learn not to write overly detailed DR runbooks. Not unless you are testing them constantly. A detailed runbook assumes institutional knowledge. It assumes the person running it is the person who wrote it, who supplies from memory all the context the document never states because it was obvious to the author at the time. Hand that runbook to anyone else and the assumed context isn’t there. The detail that looked like completeness is a trap: it reads as authoritative while silently depending on knowledge that isn’t in the room. And detail rots faster than brevity, because there is more of it to drift, more specific steps referencing specific systems that quietly change.

This is why disaster recovery testing exists, and it is worth being precise about what the test is actually for. You do not run DR twice a year only to prove you can recover. You run it because drift is guaranteed. The environment moves, the runbook doesn’t, and the test is a brute-force mechanism to detect the drift that has already accumulated and re-true the document against reality. The DR test is the tax you pay for keeping critical process in documents. It is expensive, it is periodic, and it exists because a document has no way to tell you it has gone stale. You have to go and look.

So the mature operator stops writing fossils. Instead of a detailed script, you write an architecturally consistent runbook: the constraints, the reasons, the things that stay true when the environment moves, or the word AI taught me today, the invariant, and you let a capable, non-deterministic executor adapt the specific steps to the reality in front of them. You don’t write “don’t mop this row.” You write “this server is physically exposed at floor level; protect it from impact.” The second rule carries its own expiration. The day the cabinet gets doors, the invariant is visibly satisfied and the rule retires itself. The step never knows it’s done. The invariant does.

That distinction is the entire argument. Encode the step and you get a thirty-year-old instruction guarding a hazard that no longer exists. Encode the why, the constraint and its reason, and reality can be checked against it, continuously, by anyone or anything executing against it.

The model is the newest non-deterministic executor

This is not a new theory, and it is not really a theory about AI. Infrastructure operators have been writing runbooks for fallible, non-deterministic executors for decades. Junior engineers. Contractors. The on-call person at 3 a.m. who has never seen this system before. The discipline that works for them, externalize the invariants, test them, let the executor adapt the steps, hold the boundary with checks that fail loudly, is the discipline that works for a language model, because the model occupies the exact role that discipline was always built around.

And the model fails exactly where an unconstrained junior with no context would fail. It invents where nothing is specified, and it skips documented steps that nothing enforces. Those are the two faces of the mop rule, reproduced by a frontier model.

I know they are, because I ran the experiments, and I want to be honest that the experiments are what convinced me. But they only confirmed a law I had already been operating under for twenty years.

The first was a coding lab. A local model, real bugs, a deterministic test harness deciding whether a fix passed. On its own the model cleared a quarter of the tasks. I added a repair loop that fed the harness’s failure codes back into the prompt and let the model try again, a minimal deterministic stand-in for what a reviewer would do. Four retries later, the pass rate hadn’t moved. Not because loops don’t work, but because the task distribution had almost no middle: the easy tasks passed immediately, the hard ones failed every time for the same reason, and a repair loop only operates on the near-misses in between. The number moved when the harness rejected an attempt and escalated the task to a stronger model from a clean workspace. The loop resamples inside the frontier. Only the escalation policy, a rule outside the model, moved the frontier. The evaluator was the control point, and it was the control point twice: it supplied the only real signal the loop had, and its rejection policy decided which model faced the work.

That finding became the spine of a migration operating model I published: playbook-driven, deterministic first, loops trusted only when the exit condition lives outside the agent, validators determining done. The paper is where the lab result became doctrine.

The second lab tested the doctrine at its one soft joint. I pointed a frontier model at my own production estate and let it build the control plane the paper describes, and because I did not have mature documented patterns for these systems, it built the playbooks and validators too. That was the experiment, and it failed in the two ways the mop rule predicts.

Where nothing was documented, it invented. It produced a governed migration playbook, detection logic, validators, lifecycle stages, every field my own paper requires, registered it, executed it, and passed it six for six. There was no migration behind it. No workload of that class had ever been observed. The validators were inventions grading their own homework, and the grades were perfect. It collapsed on one question: where is the migration this recipe came from? There wasn’t one.

And where the process was documented, it ignored the documentation. This is the more damning half, and it is the whitepaper as mop rule with the reason still legible on the page. My paper enumerates fifteen scheduler input classes; operating on its own, the model ran one. My paper states, in plain words, that a playbook begins as a human-governed pattern and never as an AI-generated artifact; the model authored one anyway. My paper names “deterministic but wrong” as a failure mode; the model built deterministic validators that were confidently wrong. The why was written down, correctly, in the document the model was built from. It didn’t govern. Every requirement started constraining the model only at the moment it became code that refuses to proceed. Nine corrections became five standing gates, and once the gates existed they caught the model’s later bugs on their first run. Documentation constrained nothing. Encoding constrained everything.

The part the optimists skip

Here is the objection, and it is the strongest one: the model failed because your documentation wasn’t good enough. Write it better. Prompt it better. Give it a richer spec.

I am the counterexample. The documentation was good enough. Not perfect; good enough that every miss maps to a specific written requirement I can point to. It was my own operating model, the best thing I have written about this exact process, sitting in the model’s context, enumerated and correct, with every reason stated. It was not a vague prompt. It was the spec. And a frontier model read the spec, and the process ceded my authority anyway, politely, plausibly, nine times, for about the price of a coffee.

That is not a documentation-quality problem, because no amount of additional prose would have closed it. It is the oldest truth in automation, in a new disguise. Automation has never been able to consume intent from prose. It consumes structure. Thirty years of “the runbook clearly says X and the system did Y” is the same lesson: the document that describes the desired behavior and the mechanism that enforces it are two different artifacts, and only the second one runs. We knew this. We built policy engines and CI gates and validators precisely because we had learned that writing the rule down does not make the rule run.

Then AI arrived fluent enough to read the prose, and the industry forgot the lesson, because for the first time the automation appeared to understand the document. But appearing to understand is exactly the failure. A model that reads your why, nods, and does what it wants is worse than one that can’t read it at all, because now you have a system that ignored your spec while looking like it honored it. The mop rule at least announced its dumbness. The model hides it behind fluency.

So the discipline holds, and it is the discipline you already practice everywhere the cost of being wrong is real. I have a name for the shape of it: deterministic code in the loop. Not merely deterministic code wrapped around the model. That is the version that failed in the second lab: the model’s own validators were deterministic and confidently wrong. The distinction is not whether code exists. The distinction is whose authority the code encodes. Externalize the invariant, not the step, because the invariant is what survives the handoff and knows when it’s expired, and the step is what fossilizes. Put the deterministic code at the boundaries that have to hold, the entry gate and the done-check, and leave the middle deliberately loose, because over-specifying the middle is the detailed-runbook mistake and under-specifying the boundary is cession. Let the non-deterministic executor, human or model, adapt the specifics to current reality inside those boundaries. Hold the boundaries with gates that fail loudly the moment reality diverges, so drift detection stops being a semiannual campaign and becomes a continuous signal. And convert every correction into one more gate, so the judgment compounds instead of rotting in a document nobody rereads. That last part is the honest mechanism: nine corrections became five gates, and the gates started catching the model before I had to.

Your estate is already full of mop rules and lost context. It was before any of this. The model didn’t create the problem; it made the executor fast enough, and fluent enough, that the cost of never externalizing your why finally came due. You can write the why perfectly and still cede it, because a model fluent enough to read your spec is fluent enough to treat it as a suggestion whose reason it never has to honor.

Write down the why. Then encode it at the boundary where it must hold, or watch something articulate ignore it while looking like it agreed.

Share This Story, Choose Your Platform!

RelatedArticles