Block the Wrong Bot and You're Deleting Yourself From AI Answers

The most expensive misconception: “blocking AI” isn’t one switch

“I don’t want AI taking my content to train on” — a reasonable instinct, so many people go into robots.txt and block every AI-related crawler. The problem: a single AI company often sends more than one crawler, each doing completely different things. Block them all and you usually also cut the one that gets you cited.

The result is the most ironic kind of failure: you think you’re only refusing training, but you’re actually deleting yourself from AI’s answers — and because it’s invisible, you have no idea.

Training bots and search bots are different crawlers

Most mainstream AI companies split their crawlers into two purposes: one fetches content to train models, the other fetches in real time when a user asks a question, then cites you. Blocking the former doesn’t hurt visibility; blocking the latter means leaving that engine’s answers.

AI company	Training (low impact if blocked)	Search / live citation (you disappear if blocked)
OpenAI	`GPTBot`	`OAI-SearchBot`, `ChatGPT-User`
Anthropic	`ClaudeBot`, `anthropic-ai`	`Claude-User`, `Claude-SearchBot`
Perplexity	`PerplexityBot`	`Perplexity-User`
Google	`Google-Extended` (nominally opts out of Gemini training — the exception, see below)	`Googlebot` (block it and you lose Search too)
Apple	`Applebot-Extended`	`Applebot`

The point isn’t to memorize this table — it’s to understand that “block training” and “keep visibility” can both be true at once, as long as you can tell which bot is which.

The exception: Google-Extended is not a clean “training switch”

One cell in that table deserves its own section. Google’s official position is that Google-Extended only controls whether your content trains Gemini, and blocking it does not affect indexed content appearing in AI Overviews. But a large-scale measurement study (SIGIR 2026, arXiv:2604.27790, data from December 2025) compared thousands of queries and found the opposite: sites that block Google-Extended are measurably less likely to be cited by AI Overviews.

Nobody has a confirmed explanation yet, but the takeaway is clear: treating Google-Extended as a “low impact if blocked” pure training switch means gambling with your AIO visibility. If you care about AI visibility at all, leave Google-Extended allowed — and if you genuinely want out of training, start with the other vendors’ training UAs.

The cost of getting it wrong: vanishing from AI answers

In traditional SEO, blocking the wrong crawler at least shows up as a ranking drop you can notice. But AI citation is invisible: a user asks, AI answers, you’re not in it — no notification, no “impression that didn’t happen” in any dashboard.

That’s what makes blocking the wrong AI bot so dangerous: there’s no alarm. By the time you notice “competitors get mentioned in ChatGPT and I don’t,” you’ve usually been missing out for a long time.

`noai` and robots.txt are not the same thing

A common confusion worth clearing up: the noai / noimageai meta tags on a page, and robots.txt crawler rules, are two different mechanisms. The former asks “don’t train on this page”; the latter controls “which crawler may fetch which paths.” Both rely on crawlers honoring them voluntarily, neither is an enforceable standard, and both can hurt your visibility if set too bluntly.

So how should you set it

In one line: block training, allow search.

To opt out of training, write rules for the training UAs (GPTBot, ClaudeBot, Applebot-Extended…) — Google-Extended is the exception: blocking it measurably hurts AIO visibility (see above);
Always allow the search / live-citation UAs (OAI-SearchBot, Claude-User, Perplexity-User…), or you’re actively opting out of AI answers;
After editing, cross-check against each vendor’s crawler docs to confirm you blocked the bot you think you did.

For each vendor’s full crawler list and rule differences, see the earlier post: The 8 major AI crawlers — rule differences and best settings.

Why this isn’t “set it once and forget”

AI companies add and rename crawlers (it’s changed several times in the past two years), and one typo in robots.txt — or one default toggle in Cloudflare — can shut your whole site to a given bot. Combined with the fact that lost AI citations come with no alarm, this isn’t a set-once task: it’s ongoing site-health maintenance that requires cross-checking the latest crawler lists and verifying regularly — exactly the kind of invisible, slow-bleed problem that’s best watched continuously rather than discovered after the damage is done.