ClaudeBot on the Naughty Step

How ClaudeBot fell in love with The MIRE/C³

CaddyHoneypotIncidents

The one door robots.txt left open: how ClaudeBot found The MIRE/C³

The MIRE/C³ is a Multi-layer Intrusion Response Engine — not just a honeypot, though one of its layers does exactly the kind of bait-and-log work people associate with that word. Part of its job is to look like the soft, exposed infrastructure an attacker (or a careless crawler) loves to find — fake admin panels, fake .env files, fake backup directories, fake everything. Most of the traffic it sees is the internet’s usual background radiation: vulnerability scanners, SEO crawlers, the occasional curious human. Then, starting on June 10th, one visitor stopped behaving like background radiation and started behaving like it had moved in.

That visitor was ClaudeBot, Anthropic’s web-training crawler. Over nine days it sent more requests to one subdirectory of one subdomain than every other bot on my entire infrastructure had sent in the previous five months combined. This is the timeline of how that happened, a brief and slightly absurd detour into a separate group of IPs that pretended to be ClaudeBot, and what changed when I pulled two different levers in the same week to see what would happen.

Act one: the slow burn, then the cliff

ClaudeBot first shows up in my logs on March 17th, at a perfectly unremarkable volume — a little over a hundred requests, then nothing for two days, then back again. For three months that’s the whole story: sporadic, low-volume, polite. Anywhere from zero to about 150 hits a day, scattered across the usual scanner bait — .env.git/config/aws-credentials, the greatest hits of automated probing — each one met with a synthetic, authentic-looking response and a correctly-tagged 404, not a blank stock error page.

On May 19th, something changed inside the /uploads/ route specifically — not a sitewide fix, as I first assumed. uploads_trap() opens with its own gate: resp = neutral_404_check(); if resp: return resp. Until May 19th, that gate appears to have been catching /uploads/* outright — my earliest logs show the line NEUTRAL_404_MODE: returning 404 for /uploads/shell.php. From May 19th on, that line stops appearing for /uploads/, replaced by a different one logged a few steps further into the same function: Uploads directory access subpath=X. Past that gate sits code that had probably been sitting dormant: a fake Index of /uploads directory listing, rendered fresh on every hit, returned through a bare Response(html, mimetype='text/html') with no status code set — which Flask defaults to 200. Nobody flipped a trap on. A gate stopped catching one route, and a glitch underneath it did the rest. A single test day, 203 hits, then back to baseline. I didn’t think much of it at the time.

ClaudeBot Connection Trends

Three weeks later, on June 10th, the bot noticed. Starting around 02:00 and climbing through the morning, request volume went from 568 hits the day before to 14,729 — a jump that held and then kept climbing, day after day, settling into a sustained 18,000–24,000 requests a day that, as of this writing, still hasn’t stopped.

What was it actually doing? Not looping, and not guessing. Every hit to /uploads/ renders an Index of /uploads directory listing with five freshly-randomized filenames baked in as links — a backup_*.zip, an export_*.sql, a users_*.csv, a config_*.json, and a .htaccess — each with a random size and a live timestamp. ClaudeBot reads the page, follows the links, and each of those five requests renders the listing again: five new names, every time. It’s a maze that regenerates itself on every step, which is exactly how it requested more than 50,000 distinct paths under /uploads/ in nine days, against 439 distinct paths in the entire three months before that. For most of that stretch, every one of those fresh links led to the same outcome underneath: a 200, and the same ~680-byte listing page rendering five more names. ClaudeBot had no way to know that “new filename” and “the same page again” were the same thing here.

And it was disciplined about it: on any given day, one single IP address did effectively all of the work, at a near-metronomic one request every six to seven seconds, twenty-four hours a day, no breaks. The IP doing the work changed from day to day — a different address inside Anthropic’s published 216.73.216.0/22 block took over each time — but the cadence and the singular focus didn’t.

ClaudeBot - the open door

Why /uploads/, specifically, and nothing else?

My robots.txt disallows /backup//old//archive//test//tmp//config//db//database//dump//export//admin/ — practically every obvious trap directory name. It says nothing about /uploads/. ClaudeBot is documented to honor robots.txt, and the traffic is entirely consistent with that: every other trap directory sat untouched, while the one legally fair-game path — which also happened to be the one generating infinite, never-repeating, 200-status content — got the entirety of its attention. Robots.txt didn’t fail here. It worked exactly as written; it just happened to leave one door unlocked, and that was the only door that mattered.

A short detour: meet the impostor

While digging through six months of logs to confirm all this, I cross-checked every IP that ever sent a user-agent string containing “ClaudeBot” against Anthropic’s actual published crawler IP list (claude.com/crawling/bots.json — worth bookmarking if you run anything internet-facing). 233 distinct IPs claimed to be ClaudeBot over the full window. 203 of them genuinely were. The other 30 — a small, recurring cluster mostly out of two address blocks — were not, and never have been.

That cluster wasn’t a one-time fluke. The same handful of addresses also turned up sending Claude-SearchBot — a different, real Anthropic agent — across five unrelated domains between April and May. None of those eighteen hits came from an official Anthropic IP either. Whoever’s behind this is rotating through fake Anthropic identities, not just borrowing one.

By volume, this impostor traffic is genuinely small — about 3.45% of everything that ever claimed to be ClaudeBot, and effectively zero overlap with the June ramp itself, which is 100% verified. But it’s a clean, concrete reminder that user-agent strings are a suggestion, not an identity, and that “we filtered to ClaudeBot’s IPs” is only true if you actually checked the IPs against something authoritative.

Act two: two levers, one week

Two changes landed almost on top of each other, which made untangling cause and effect more interesting than I expected.

June 16th — I flipped on Cloudflare’s AI bot mode for every host except the mire.cc family, which I wanted to keep open as a live experiment. The honest result is mixed. Three smaller properties with small-but-genuine prior ClaudeBot activity — digger.toce.chsentinel.toce.chassets.mire.cc — dropped to exactly zero immediately, about as clean a confirmation as you’d want. But three others with real prior activity — barnettland.netwww.relocation-support.ch4apps.ch — actually went up afterward, which says either the rule isn’t fully applied on those specific zones or three days of post-toggle data just isn’t enough signal yet. cfd.mire.cc, deliberately exempted, didn’t even flinch: average daily volume kept climbing, from roughly 7,500/day in the run-up to over 18,000/day in the three days after.

June 17th — the gap closed. Whatever was causing .sql and .csv requests to fall through to the same directory-listing renderer got fixed, and those two extensions started returning real, differentiated canary content instead of five more random names. The .sql files jumped from a ~680-byte listing page to a real ~4.1MB fake SQL dump. Worth being precise about what’s actually behind it, because the file types aren’t all doing the same job. The .sql and .csv files are static canary tokens — the exact same fixed payload, byte-for-byte, served again and again under a freshly randomized filename. That’s deliberate, not an oversight: a canary token only works as a tracking mechanism if it’s a fixed, fingerprintable artifact — if the content changed every time, there’d be nothing recognizable to flag if it ever turned up somewhere it shouldn’t. The .zip backups, by contrast, are synthetically produced per request — genuinely unique fake archives each time, there to make the directory look like it’s constantly producing new, real backups rather than to serve as a tracking mechanism in their own right. ClaudeBot, already mid-binge, didn’t slow down for the upgrade — it just started swallowing bigger meals. In under two days, the .sql file alone got served 5,595 times: roughly 12 gigabytes of fabricated breach data shipped to a single crawler, on top of everything that came before it.

So the same week that one lever (Cloudflare) tried to throttle the bot on most of the estate, the other lever (the canary route) made the one place it was still fully welcome dramatically more expensive to keep feeding.

ClaudeBot on the Naughty Step

So — does ClaudeBot behave well? And is robots.txt actually ignored?

It’s worth separating these, because the data answers them differently.

Is robots.txt ignored? Not in this case — if anything, this dataset is a clean counter-example to the common complaint. Every disallowed directory was left alone for the entire six months. The runaway behavior is fully explained by the one path that was never disallowed. If you want ClaudeBot to leave a generative trap alone, the boring, unglamorous fix is the right one: put it in robots.txt.

Does it behave well? Compliance and good behavior aren’t quite the same thing, and this is where I’d push back a little. Respecting robots.txt is necessary but not sufficient. A crawler that’s actually being careful about server load — Anthropic’s own stated principle is to avoid being “intrusive or disruptive” and to be thoughtful about crawl speed on a given domain — might reasonably notice that it has spent nine straight days, with no pause, pulling thousands of identically-patterned, randomly-named files from one directory, and treat that as a signal to back off rather than a reason to keep going. It didn’t. Whatever heuristics govern “is this directory still worth crawling at this rate” either didn’t fire here or aren’t tuned for a directory that looks infinite. That’s a real, fair criticism, and it’s a different criticism than “it broke the rules” — it followed the letter of them all the way into a content-cost sinkhole.

And the part I can’t resist asking: what happens to those canary files now?

I don’t have visibility into Anthropic’s training pipeline, so this is informed guessing, not inside knowledge — but for what it’s worth: large-scale training corpora go through aggressive deduplication and quality filtering before anything gets near a training run, and this is about as easy a case as deduplication gets. The .sql and .csv canary tokens aren’t just similar to each other — they’re static, byte-for-byte identical content served under thousands of different randomized filenames, which is closer to “the same file downloaded 5,595 times” than “5,595 different files.” Even exact-hash deduplication, the bluntest and most universal filtering stage there is, should catch that without needing anything clever. The synthetically-generated .zip backups are a slightly less certain case since each one is genuinely unique, but they’re still a vanishingly small, oddly-patterned slice of a multi-trillion-token corpus. My best guess is that the most likely afterlife for this entire haul is a hash bucket, not a model weight — which is a slightly anticlimactic ending for several gigabytes of fabricated corporate breach data, but probably the right one.

What’s next

Robots.txt got ClaudeBot to leave ten trap directories alone and chase only the eleventh. The natural follow-up is whether the same file can do more than say no — Anthropic’s own documentation claims ClaudeBot also honors the non-standard Crawl-delay directive. So that’s the next move: not a Disallow, but a number. I’m pointing a Crawl-delay at the one address that’s eaten gigabytes of bandwidth for two weeks straight, and finding out whether “compliant” and “considerate” turn out to be the same claim after all.