When Bot Mitigation Is Just Observation: Hardening Meridian Without Bot Management
I was staring at a page that should have been boring: a store-locator lookup URL returning 400s for some users while the public browsing path looked clean from my machine. The awkward part was not that Cloudflare was involved. The awkward part was that our Cloudflare bot mitigation posture was mostly observation dressed up like defense. We fixed the immediate risk by moving narrow custom WAF rules from log-only toward scoped blocking, preserving auth and booking carve-outs, blocking one noisy Baidu-Netcom /24, watching a Chrome/133 user-agent pattern as rule C-11 took hold, and running rolling telemetry for block rate, challenges, 504s, and false positives.
The useful argument is simple: when you do not have bot-management signals, broad bot mitigation becomes a safety problem before it becomes a security problem. You can still reduce damage, but only if you treat every rule as a surgical instrument with a recovery plan, not as a moral statement about bad traffic. The goal is not to defeat automation in one dramatic deploy. The goal is to stop the parts you can identify without teaching checkout, login, and store affiliation to hate your customers.
Where bot mitigation starts with admitting what you cannot see
The incident had a familiar shape. Traffic hit Cloudflare first. Some of it was filtered there, then the rest reached the site, where the application and downstream systems took over. China was high in the traffic list, but not because one villainous IP was leaning on the doorbell. We were seeing a broad automation swarm: many IPs, rotating user agents, requests shaped well enough to avoid the obvious rate-limit buckets.
Rate limiting was the tempting knob. It also was the wrong main control. A per-IP limiter works when the attacker has a concentrated source, a sloppy loop, or a predictable endpoint cadence. This did not. The pressure was smeared across enough addresses that each single IP could look merely annoying rather than abusive. That is the small tragedy of distributed automation: it makes the dashboard look busy while keeping each cell just polite enough.
The account also did not have Cloudflare Bot Management. That matters. Without bot score, verified bot categories, JavaScript detections tied to richer behavioral signals, and related fields, you are left with HTTP facts: country, ASN, IP ranges, paths, user agents, headers, request rates, response codes, and whatever your logs can prove. Those facts are useful. They are also blunt. If the proxy chain itself is muddy, the first job is preserving evidence; I wrote about that narrower problem in Preserving the Real Client IP Through a Proxy Chain That Rewrites the Evidence.
That diagram is plain because the system was plain. The hard part was not architecture. It was deciding where we had enough evidence to block.
The rule that helped was the rule that stayed small
The most important rule in the set was C-1a, not because it was clever, but because it respected blast radius. It blocked a scoped slice of bad traffic while preserving explicit carve-outs for authentication and booking. In particular, the real checkout flow for China stayed protected. That mattered because one of the live reports was not just general browsing. A logged-in learning flow opened the store locator and produced a 400 during store affiliation. If a mitigation breaks identity or booking, it becomes part of the outage.
The working pattern looked like this:
| Control | Action | Why it was acceptable |
|---|---|---|
| C-10 | Log only after rescope | The signal was useful but not clean enough to block broadly |
| C-1a | Block with auth and booking carve-outs | High-confidence enough, with protected customer paths |
Baidu-Netcom a specific abusive IP /24 range | Block | Concentrated abusive fleet with low legitimate value for this flow |
| C-11 | Watch then block UA-133 pattern | Needed live validation while the rule propagated |
The practical Cloudflare expression was closer to this shape than to a heroic one-liner:
(
ip.geoip.country eq "CN"
and http.request.uri.path contains "/stores/nearby"
and not http.request.uri.path contains "/booking"
and not http.request.uri.path contains "/auth"
and not http.host contains "auth.meridian.example"
)That is deliberately incomplete as a production rule because the real version depends on account fields, hostnames, and rule names. The point is the structure: match the abusive surface, subtract the paths where false positives are expensive, then promote only when telemetry agrees.
the account was watching the bots, not stopping them
That sentence was uncomfortable because it was accurate. It also stopped the conversation from drifting into fake certainty. Observation is not useless. It is how you earn enforcement. But observation sold as mitigation is how teams end up surprised when the bill, origin load, or user-facing errors keep moving.
The commands mattered because memory lies
We did not rely on a person refreshing dashboards and narrating vibes into chat. We used periodic scripts from the Cloudflare working directory to sample recent windows, append trend logs, and compare against a rolling baseline.
cd a local work directory
node .codex_tmp/cf_trend_cycle.mjs 30For the Chrome/133 watch tied to C-11:
cd a local work directory
node .codex_tmp/_ga133monitor.mjs 30For an explicit token-based review cycle in environments where the token was not auto-read:
cd a local work directory
export environment-based API credential
node .codex_tmp/cf_review_cycle.mjs 15The checks were boring by design:
| Watch | Healthy direction | Why it mattered |
|---|---|---|
| Block percentage | Not spiking far above about 4 percent | A sudden jump could mean a broad false positive |
| Challenge percentage | Flat or falling | Challenge growth can hide user pain |
| 504 percentage | Not climbing above about 15 percent | Origin pressure was part of the risk |
| UA-133 served 200 | Trending toward roughly 0 | The C-11 block was taking hold |
| Booking and SSO hits | Staying servable | Real shoppers and logged-in users must survive |
The dullness was the feature. A rolling script will not notice everything. One teammate correctly pointed out that challenges on another property had been visible in a real browser even though the periodic run did not flag them. That is a limitation, not a footnote. Synthetic telemetry samples what you told it to sample. A browser catches the messier truth: cookies, redirects, interactive challenges, cached assets, and flows that only fail after the second click.
Important
A mitigation loop without live-browser validation is a dashboard exercise. It can still miss the user path that is currently on fire.
So the validation loop had two sides. The script told us whether the system was drifting. The browser told us whether a person could still complete the flow.
Why not block the swarm by country
Country blocking was the obvious bad idea. It would have made the charts calmer and the business worse. The traffic distribution showed China near the top because the swarm was present there, but legitimate flows also existed. The store locator, learning affiliation, checkout, and authentication paths were not decorative. Blocking a geography because it appears in an abuse report is a satisfying way to create a different incident.
The second rejected option was making rate limits more aggressive. That might catch the clumsiest slice of automation, but it would also punish shared networks, VPNs, corporate egress, mobile carriers, and any real user unlucky enough to retry a broken flow. The attack was already shaped to avoid the obvious thresholds. Tightening the thresholds would mostly move pain toward humans.
The third rejected option was immediately promoting every suspicious log rule to block. C-10 is the example I would keep taped to the monitor. It was re-scoped and left in LOG after AWS and Azure were removed from its target set. That was not indecision. It was the correct answer when a rule still carries enough ambiguity to hurt high-volume consumer ISPs, VPNs, or legitimate cloud-hosted paths.
Deep-dive: The promotion checklist
Before moving a rule from LOG to block, I want five things visible in the same place:
- The request surface is narrow enough to describe in one sentence.
- The likely legitimate paths have explicit exclusions.
- The last 15-30 minutes do not show obvious consumer false positives.
- A browser can complete auth, booking, and the affected product flow.
- A rollback is one rule edit, not a meeting.
That checklist is intentionally small. If I need a long essay to justify the rule, I probably do not understand it well enough to block with it.
A small framework for constrained bot defense
The reusable model here is observe, subtract, enforce, verify, narrate.
Observe the abusive pattern with the fields you actually have. Do not pretend missing bot scores exist through force of personality.
Subtract the flows where false positives are unacceptable: auth, checkout, booking, account affiliation, support paths, and known partner traffic. The exclusions are not bureaucratic garnish. They are the difference between mitigation and self-harm.
Enforce only the slice that remains high-confidence. That can be a custom WAF block, an IP range block, or a managed challenge if the user experience can tolerate it.
Verify with both logs and a real browser. Logs tell you what happened at scale. Browsers tell you what happened to the path people complain about.
Narrate the limitation to stakeholders in plain language. In this case: blocking certain high-confidence traffic had a significant impact, but the wider automation swarm remained difficult because it rotated across many IPs and user agents while the account lacked bot-management signals. That is not an excuse. It is the operating envelope. The same discipline shows up in other constrained systems too, including the review gates in Building Press, Part 4: Review is where the human gates the irreversible.
Solution summary
The working fix was to move the account from passive observation toward scoped enforcement in Cloudflare. C-1a was deployed as a block rule with authentication and booking carve-outs. C-10 stayed in LOG after being re-scoped because its signal was not clean enough for broad blocking. The Baidu-Netcom a specific abusive IP /24 range fleet was blocked. UA-133 traffic was monitored while C-11 propagated, with the expectation that served-200 responses would trend toward roughly zero. Rolling scripts tracked block rate, challenge rate, 504s, false-positive watches, and business-critical flows.
That did not turn a constrained account into a bot-management account. It made the defensive posture honest. The swarm could still adapt. Some bad traffic would still reach origin. The scripts could still miss an interactive-browser failure. But the system moved from shrugging at bots to blocking the parts it could prove, while keeping the customer paths intact.
The unglamorous version is the one I trust: narrow rules, explicit carve-outs, periodic telemetry, live-browser checks, and stakeholder messaging that says what is still exposed. A bot mitigation rule is not better because it is aggressive. It is better when the next person on call can explain why it fired, what it spared, and how to turn it off before checkout becomes collateral damage.
