Engineering

When the cluster was ahead of the code

June 28, 20267 min read

The Terraform plan said our deploy would halve the gateway's memory and roll the live version backward. We had not changed either setting.

TL;DR

When a terraform plan comes back full of reverts, the cluster is ahead of the code: someone applied changes straight to the live environment and never committed them. That is Terraform drift, and a clean apply on top of it is a silent downgrade. The fix is not a command. Stop, capture every live change in the repo, decide per change whether it stays, and only then deploy.

The plan that pointed backward

Before applying anything, I ran a plan against dev, scoped to the one module we had changed. I expected to watch the new change go in. Instead I got three changes, and every one was pointing backward. A version string the live cluster was already serving would drop back a release. The gateway's memory would be cut in half. A stack of web-server settings (TLS, some rate-limiting, a few header rules) would revert to an older shape.

# terraform plan, scoped to the one module we changed
~ resource "app_service" "gateway" {
    ~ image_version = "2.4.0" -> "2.3.0"   # live is AHEAD of the repo
    ~ memory_mb     = 2048    -> 1024       # the plan would halve it
  }
~ resource "web_server" "edge" {
    ~ tls_min_version = "1.3" -> "1.2"
    ~ rate_limit_rps  = 200   -> 100
  }

A plan is supposed to move the cluster toward the code. Terraform's own workflow treats plan as the preview of changes needed to make remote objects match the configuration, which is exactly why this output mattered. Think of it this way: a plan is a diff between two claims about reality, what the code thinks is true, and what the cluster thinks is true. When that diff is all reverts, the code has stopped being the source of truth. The cluster had been ahead of the repo. Someone had been applying changes straight to the live environment that never made it back. The plan was the first thing honest enough to say so.

Staging told the same story with a different cast. Its hand-applied tweaks were not dev's tweaks: each environment carried its own private notion of "latest," and neither was written down. The only thing they agreed on was that the codebase was behind both.

Where the Terraform drift came from

Some of it I could trace. A version-bump PR had been merged for two of the three environments and quietly skipped on the third, so the repo claimed one version and the cluster ran another. The rest was not in any branch at all: applied live, by hand, for testing, with a "I'll raise the PR after" that never came. Most of it was ordinary: applying to a cluster is faster than committing to the repo, and without a closed loop the gap compounds quietly.

This is the same operational shape that makes home-server work either calm or brittle: the running box can become the only copy of the truth unless the setup is written down and reproducible. I ran into a smaller version of that in Trusting a home server: a local model, monitoring, and backups, where the boring parts mattered because they were the parts that let the machine be rebuilt.

Why a clean apply is the dangerous case

Here is the part that makes drift dangerous. If I had done the textbook thing, pull latest and apply it the way you are supposed to, I would have rolled the live environments backward. The deploy would have run clean, reported success, and silently undone work that was live and working. A green apply that is secretly a downgrade is far worse than one that fails, because a failure at least tells you something is wrong. A green downgrade tells you nothing.

The honest move was to stop

So I stopped. I was not going to re-apply changes I could not see in the repo. With no visibility into which live tweaks were deliberate and which were stale, reconciling blind is how you ship the wrong one. The move was to push back, get the live changes captured in code, decide per change whether each one stays, and only then deploy. Not on top of drift, and not into production on deploy day.

The harder part is the reconciliation itself. To distinguish intended live changes from stale ones, I compared the plan output against recent deploy history and any tickets or PRs that touched those resources. For each drifted setting, the question is whether the repo should be updated to match the cluster (the change was real and should stay) or whether the cluster should be reverted to match the repo (the change was transient or wrong). refresh-only mode is useful here for updating Terraform state to reflect what is actually running without touching the cluster, but it does not tell you which side is correct. That judgment is still manual. Scoped plans help narrow the surface area, but they can obscure cross-module dependencies: a setting that looks isolated in one module may be consumed by another, so scope carefully and verify the full dependency graph before deciding anything is safe to keep or drop.

What the model did, and what it did not know

I will be honest about how much of this I now hand to an LLM. Running the scoped plan, reading back a thousand-line diff, walking all three environments and laying out which setting drifted where and in which direction: that used to be a careful afternoon. With a model driving it, it was a few minutes, and it surfaced the material drift quickly. The mechanical work has genuinely changed.

But the model would also have cheerfully applied that plan. Ask it to deploy and it deploys. Nothing in the output says "this is a trap." Knowing that a plan full of reverts means the cluster is ahead of the code, that a clean apply here is a silent downgrade, that the move is to stop and reconcile rather than proceed: none of that came from the tool. It came from having been bitten by drift before and from knowing how these environments actually fit together. That is the part people keep assuming the model has. Frontier models will do an enormous amount of the work, but you still need someone who understands the system well enough to know which confident outputs to refuse. I wrote about that boundary from a product angle in One Policy Gate for an Autonomous Agent: the useful part is not that the agent can act, it is knowing where action needs a gate. The model can run the system. It does not yet understand what the system means.

A change is not done until the repo reproduces it

Terraform documents that the default plan mode compares the current configuration with prior state and remote objects, then proposes the changes needed to match the configuration. When that diff is all reverts, the code has quietly stopped being the source of truth without anyone deciding that. The fix is not a flag or a command. It is holding the line that a change is not done when the cluster works. It is done when the repo would reproduce the cluster.

Sources

FAQ

What is configuration drift?

Configuration drift is when the running system no longer matches what its code says it should be, because changes were applied directly to the live environment and never committed back. The repo stops being the source of truth.

How do I know my cluster has drifted?

Run a plan scoped to what you think you changed. If the plan proposes reverts (lowering a version, halving memory, undoing settings) instead of just your change, the cluster is ahead of the code. A plan-only or drift-detection run on a schedule catches it before deploy day.

Why is a clean terraform apply dangerous when there is drift?

A clean apply succeeds by making the cluster match the code. If the code is behind, that "success" means silently rolling back live, working changes. A failed apply at least tells you something is wrong; a green downgrade tells you nothing.

How do you fix drift safely?

Do not apply over it. Capture every live change in code first, review each one to decide whether it should stay, then reconcile so the repo reproduces the cluster. Only deploy once the plan shows just your intended change.