Engineering

When Bot Mitigation Is Just Observation: Hardening Meridian Without Bot Management

June 20, 202612 min read

I was staring at a page that should have been boring: a store-locator lookup URL returning 400s for some users while the public browsing path looked clean from my machine. The awkward part was not that Cloudflare was involved. The awkward part was that our Cloudflare bot mitigation posture was mostly observation dressed up like defense. We fixed the immediate risk by moving narrow custom WAF rules from log-only toward scoped blocking, preserving auth and booking carve-outs, blocking one noisy Baidu-Netcom /24, watching a suspect Chrome user-agent as the UA-pattern rule took hold, and running rolling telemetry for block rate, challenges, 504s, and false positives.

The useful argument is simple: when you do not have bot-management signals, broad bot mitigation becomes a safety problem before it becomes a security problem. You can still reduce damage, but only if you treat every rule as a surgical instrument with a recovery plan, not as a moral statement about bad traffic. The goal is not to defeat automation in one dramatic deploy. The goal is to stop the parts you can identify without teaching checkout, login, and store affiliation to hate your customers.

TL;DR

Without Cloudflare Bot Management signals, broad bot mitigation becomes a safety risk before it becomes a security win. It covers moving custom rules from log-only to scoped blocking, using explicit auth and booking carve-outs, blocking a high-confidence abusive ASN range, and running rolling telemetry scripts alongside live-browser validation. The approach is to block only what you can prove, exclude the paths where false positives break customers, and tell stakeholders plainly what is still exposed. The result is a more honest defensive posture, not a complete one.

Where bot mitigation starts with admitting what you cannot see

The incident had a familiar shape. Traffic hit Cloudflare first. Some of it was filtered there, then the rest reached the site, where the application and downstream systems took over. China was high in the traffic list, but not because one villainous IP was leaning on the doorbell. We were seeing a broad automation swarm: many IPs, rotating user agents, requests shaped well enough to avoid the obvious rate-limit buckets.

Rate limiting was the tempting knob. It also was the wrong main control. A per-IP limiter works when the attacker has a concentrated source, a sloppy loop, or a predictable endpoint cadence. This did not. The pressure was smeared across enough addresses that each single IP could look merely annoying rather than abusive. That is the small tragedy of distributed automation: it makes the dashboard look busy while keeping each cell just polite enough.

The account also did not have Cloudflare Bot Management. That matters. Without bot score, verified bot categories, JavaScript detections tied to richer behavioral signals, and related fields, you are left with HTTP facts: country, ASN, IP ranges, paths, user agents, headers, request rates, response codes, and whatever your logs can prove. Those facts are useful. They are also blunt. If the proxy chain itself is muddy, the first job is preserving evidence; I wrote about that narrower problem in Preserving the Real Client IP Through a Proxy Chain That Rewrites the Evidence.

That diagram is plain because the system was plain. The hard part was not architecture. It was deciding where we had enough evidence to block.

The rule that helped was the rule that stayed small

The most important rule in the set was the carve-out rule, not because it was clever, but because it respected blast radius. It blocked a scoped slice of bad traffic while preserving explicit carve-outs for authentication and booking. In particular, the real checkout flow for China stayed protected. That mattered because one of the live reports was not just general browsing. A logged-in learning flow opened the store locator and produced a 400 during store affiliation. If a mitigation breaks identity or booking, it becomes part of the outage.

The working pattern looked like this:

Control	Action	Why it was acceptable
The cloud-range rule	Log only after rescope	The signal was useful but not clean enough to block broadly
The carve-out rule	Block with auth and booking carve-outs	High-confidence enough, with protected customer paths
A noisy Baidu-Netcom /24	Block	Concentrated abusive fleet with low legitimate value for this flow
The UA-pattern rule	Watch then block the suspect UA	Needed live validation while the rule propagated

The practical Cloudflare expression was closer to this shape than to a heroic one-liner:

(
  ip.geoip.country eq "CN"
  and http.request.uri.path contains "/stores/nearby"
  and not http.request.uri.path contains "/booking"
  and not http.request.uri.path contains "/auth"
  and not http.host contains "auth.meridian.example"
)

That is deliberately incomplete as a production rule because the real version depends on account fields, hostnames, and rule names. The point is the structure: match the abusive surface, subtract the paths where false positives are expensive, then promote only when telemetry agrees.

the account was watching the bots, not stopping them

That sentence was uncomfortable because it was accurate. It also stopped the conversation from drifting into fake certainty. Observation is not useless. It is how you earn enforcement. But observation sold as mitigation is how teams end up surprised when the bill, origin load, or user-facing errors keep moving.

The commands mattered because memory lies

We did not rely on a person refreshing dashboards and narrating vibes into chat. We used periodic scripts from the Cloudflare working directory to sample recent windows, append trend logs, and compare against a rolling baseline.

cd ~/edge-watch
node trend-cycle.mjs 30

For the suspect-UA watch tied to the UA-pattern rule:

cd ~/edge-watch
node ua-watch.mjs 30

For an explicit token-based review cycle in environments where the token was not auto-read:

cd ~/edge-watch
export CF_API_TOKEN="$CF_API_TOKEN"
node review-cycle.mjs 15

The checks were boring by design:

Watch	Healthy direction	Why it mattered
Block percentage	Not spiking far above about 4 percent	A sudden jump could mean a broad false positive
Challenge percentage	Flat or falling	Challenge growth can hide user pain
504 percentage	Not climbing above about 15 percent	Origin pressure was part of the risk
Suspect UA served 200	Trending toward roughly 0	The UA-pattern block was taking hold
Booking and SSO hits	Staying servable	Real shoppers and logged-in users must survive

The dullness was the feature. A rolling script will not notice everything. One teammate correctly pointed out that challenges on another property had been visible in a real browser even though the periodic run did not flag them. That is a limitation, not a footnote. Synthetic telemetry samples what you told it to sample. A browser catches the messier truth: cookies, redirects, interactive challenges, cached assets, and flows that only fail after the second click.

Important

A mitigation loop without live-browser validation is a dashboard exercise. It can still miss the user path that is currently on fire.

So the validation loop had two sides. The script told us whether the system was drifting. The browser told us whether a person could still complete the flow.

Why not block the swarm by country

Country blocking was the obvious bad idea. It would have made the charts calmer and the business worse. The traffic distribution showed China near the top because the swarm was present there, but legitimate flows also existed. The store locator, learning affiliation, checkout, and authentication paths were not decorative. Blocking a geography because it appears in an abuse report is a satisfying way to create a different incident.

The second rejected option was making rate limits more aggressive. That might catch the clumsiest slice of automation, but it would also punish shared networks, VPNs, corporate egress, mobile carriers, and any real user unlucky enough to retry a broken flow. The attack was already shaped to avoid the obvious thresholds. Tightening the thresholds would mostly move pain toward humans.

The third rejected option was immediately promoting every suspicious log rule to block. The cloud-range rule is the example I would keep taped to the monitor. It was re-scoped and left in LOG after AWS and Azure were removed from its target set. That was not indecision. It was the correct answer when a rule still carries enough ambiguity to hurt high-volume consumer ISPs, VPNs, or legitimate cloud-hosted paths.

Deep-dive: The promotion checklist

Before moving a rule from LOG to block, I want five things visible in the same place:

The request surface is narrow enough to describe in one sentence.
The likely legitimate paths have explicit exclusions.
The last 15-30 minutes do not show obvious consumer false positives.
A browser can complete auth, booking, and the affected product flow.
A rollback is one rule edit, not a meeting.

That checklist is intentionally small. If I need a long essay to justify the rule, I probably do not understand it well enough to block with it.

A small framework for constrained bot defense

The reusable model is to do them in order.

Observe the abusive pattern with the fields you actually have. Do not pretend missing bot scores exist through force of personality.

Subtract the flows where false positives are unacceptable: auth, checkout, booking, account affiliation, support paths, and known partner traffic. The exclusions are not bureaucratic garnish. They are the difference between mitigation and self-harm.

Enforce only the slice that remains high-confidence. That can be a custom WAF block, an IP range block, or a managed challenge if the user experience can tolerate it.

Verify with both logs and a real browser. Logs tell you what happened at scale. Browsers tell you what happened to the path people complain about.

Narrate the limitation to stakeholders in plain language. In this case: blocking certain high-confidence traffic had a significant impact, but the wider automation swarm remained difficult because it rotated across many IPs and user agents while the account lacked bot-management signals. That is not an excuse. It is the operating envelope. The same discipline shows up in other constrained systems too, including the review gates in Building Press, Part 4: Review is where the human gates the irreversible.

Solution summary

The working fix was to move the account from passive observation toward scoped enforcement in Cloudflare. The carve-out rule was deployed as a block rule with authentication and booking carve-outs. The cloud-range rule stayed in LOG after being re-scoped because its signal was not clean enough for broad blocking. One noisy Baidu-Netcom /24 was blocked. The suspect UA was monitored while the UA-pattern rule propagated, with the expectation that served-200 responses would trend toward roughly zero. Rolling scripts tracked block rate, challenge rate, 504s, false-positive watches, and business-critical flows.

That did not turn a constrained account into a bot-management account. It made the defensive posture honest. The swarm could still adapt. Some bad traffic would still reach origin. The scripts could still miss an interactive-browser failure. But the system moved from shrugging at bots to blocking the parts it could prove, while keeping the customer paths intact.

The unglamorous version is the one I trust: narrow rules, explicit carve-outs, periodic telemetry, live-browser checks, and stakeholder messaging that says what is still exposed. A bot mitigation rule is not better because it is aggressive. It is better when the next person on call can explain why it fired, what it spared, and how to turn it off before checkout becomes collateral damage.

FAQ

How do I harden Cloudflare WAF without Bot Management?

Scope custom WAF rules to the specific abusive path and country, subtract high-value paths like auth and booking with explicit NOT conditions, and promote rules from LOG to block only after telemetry shows no consumer false positives. Without bot score signals you are working from HTTP facts alone, so keeping blast radius small is more important than being aggressive.

Why is country blocking a bad idea for bot mitigation in Cloudflare?

Country blocking quiets the dashboard but also kills legitimate traffic from the same geography, including real checkouts, logged-in users, and partner flows. A distributed automation swarm exploits the fact that it shares a country with real customers, so blocking the whole country trades one incident for another.

Why keep a Cloudflare WAF rule in LOG instead of promoting it to block?

A rule stays in LOG when its signal still carries enough ambiguity to hurt high-volume consumer ISPs, VPNs, or legitimate cloud-hosted paths. Promotion to block requires a narrow describable surface, explicit exclusions for legitimate paths, and clean telemetry for the last 15 to 30 minutes.

Why is rate limiting ineffective against distributed bot swarms?

Per-IP rate limiting works when an attacker has a concentrated source or predictable cadence. A distributed swarm spreads pressure across many IPs so each individual address looks merely annoying rather than abusive, staying below the obvious thresholds while the aggregate load still causes damage.

How do you validate a Cloudflare bot mitigation rule without Bot Management?

Run periodic telemetry scripts to track block rate, challenge rate, 504s, and false-positive watches at the log level, and separately validate with a live browser to confirm auth, booking, and the affected product flow still complete successfully. Logs tell you what happened at scale; the browser tells you what happened to the path people are complaining about.