A robots.txt that actually lets agents in

I deployed a blog this week and checked /robots.txt as a sanity pass. The default was blocking every AI crawler on the internet. ClaudeBot, GPTBot, Google-Extended, PerplexityBot, CCBot, Applebot-Extended, meta-externalagent, Bytespider — all Disallow: /. Plus a Content-Signal: search=yes,ai-train=no at the top.

This is not what I want. If you're publishing in 2026 and your goal is for agents to find your work, starting from "block every agent" is starting at -1. You're paying the bandwidth bill for human crawlers and zero for the audience that is actually growing.

So I wrote a plugin.

The problem sits at two layers

The first layer is the template. The EmDash blog template I was using has a reasonable default robots.txt that lets everything in except the admin surface. Fine.

The second layer is the hosting provider. Cloudflare ships a zone-level feature called "AI Scrapers and Crawlers" that prepends its own managed robots.txt content above whatever your Worker returns. The prepended block is the one doing the blocking. It's on by default for a lot of accounts, and most people don't notice because they never diff the served /robots.txt against what their Worker actually emits.

robots.txt processing uses first-match-wins semantics. When a bot sees its user-agent appear in two groups, the first one is the one it obeys. CF's block list is above my allow list. I lose.

The fix has two steps: toggle the CF setting off in the dashboard under Security → Bots, and make sure your Worker is emitting a robots.txt you actually want served. This post is about the second step.

The bot catalog is data, not code

I published the generator as an open source plugin called emdash-plugin-agent-seo. The first thing in the package is a file called bots.ts. It is a flat list of every well-known AI crawler I could find documentation for as of Q2 2026:

export interface AgentBot {
  readonly id: string;
  readonly userAgent: string;
  readonly operator: string;
  readonly purpose: readonly BotPurpose[];
  readonly docsUrl: string;
}

export const AGENT_BOTS: readonly AgentBot[] = [
  { id: "gptbot", userAgent: "GPTBot", operator: "OpenAI", purpose: ["training"], docsUrl: "https://platform.openai.com/docs/gptbot" },
  { id: "claudebot", userAgent: "ClaudeBot", operator: "Anthropic", purpose: ["training"], docsUrl: "https://docs.anthropic.com/claude/docs/crawler" },
  { id: "google-extended", userAgent: "Google-Extended", operator: "Google", purpose: ["training", "grounding"], docsUrl: "https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers" },
  { id: "perplexitybot", userAgent: "PerplexityBot", operator: "Perplexity", purpose: ["search", "grounding"], docsUrl: "https://docs.perplexity.ai/guides/bots" },
  // ... 12 more
];

The important part: this is data. It's versioned separately from the generator, auditable in code review, trivial to fork and override. If Anthropic ships a new bot tomorrow called Claude-SearchBot, you add one line to this file and everything downstream picks it up. You don't touch the generator, you don't touch the route, you don't touch the site that consumes it.

Each bot carries metadata: who operates it, what it's for (training / grounding / search / assistant), and a link to the vendor's published docs so anyone reviewing the list can verify the entry.

The generator is 80 lines of pure TypeScript

export const buildRobotsTxt = (opts: RobotsTxtOptions): string => {
  const {
    siteUrl,
    defaultPolicy = { mode: "allow" },
    botPolicies = {},
    bots = AGENT_BOTS,
    globalDisallow = ["/_emdash/"],
  } = opts;

  const sections: string[] = [];
  for (const bot of bots) {
    const policy = botPolicies[bot.id] ?? { mode: "allow" };
    sections.push(...renderGroup(bot.userAgent, policy));
    sections.push("");
  }
  sections.push(...renderGroup("*", defaultPolicy));
  for (const path of globalDisallow) sections.push(`Disallow: ${path}`);
  const base = siteUrl.replace(/\/$/, "");
  sections.push(`Sitemap: ${base}/sitemap.xml`);
  return sections.join("\n") + "\n";
};

Same input always produces the same output. No I/O, no environment access, no framework coupling. Trivial to unit-test. The function takes policy as an argument — the default is "allow every bot in the catalog, disallow /_emdash/, advertise the sitemap" — but any caller can override per-bot or change the default without forking the package.

Policy as data

This is the part that mattered most to me. Most robots.txt generators hardcode a policy and give you knobs for individual bots. I wanted the opposite: give the caller a small DSL they can compose.

type BotPolicy =
  | { mode: "allow" }
  | { mode: "disallow" }
  | { mode: "paths"; allow?: string[]; disallow?: string[] };

Three modes. Every bot, plus the default fallback, takes one of these. That means you can compose any reasonable policy in a few lines.

Maximum agent discoverability — the default. Every bot allowed, admin hidden:

buildRobotsTxt({ siteUrl, defaultPolicy: { mode: "allow" } });

Block training bots, allow search and grounding — if you want agents to surface your posts in answers but not train on them:

import { AGENT_BOTS, filterBotsByPurpose } from "emdash-plugin-agent-seo/bots";

const trainingBots = filterBotsByPurpose(AGENT_BOTS, ["training"]);
const botPolicies = Object.fromEntries(
  trainingBots.map((b) => [b.id, { mode: "disallow" } as const]),
);
buildRobotsTxt({ siteUrl, botPolicies, defaultPolicy: { mode: "allow" } });

Paid-only crawl — every known bot disallowed, humans through, x402 paywall on the pages. Pairs well with pay-per-fetch revenue:

const botPolicies = Object.fromEntries(
  AGENT_BOTS.map((b) => [b.id, { mode: "disallow" } as const]),
);
buildRobotsTxt({ siteUrl, botPolicies, defaultPolicy: { mode: "allow" } });

The policy is yours. The plugin just renders it.

What to pair it with

robots.txt is the signal that tells bots what they're allowed to take. llms.txt is the signal that tells them what's worth taking. You want both.

llms.txt is a discovery manifest per llmstxt.org — a root-level file that lists your canonical posts and pages as link lines, plus a separate llms-full.txt that inlines the full text so an agent can ingest the whole site in one fetch. I published a separate plugin for that called emdash-plugin-llms-txt. Same design: pure functional generator, no framework coupling, caller controls the data shape.

The combination of an allow-everything robots.txt + a clean llms.txt + a fresh sitemap.xml is the current best-practice for agent-era SEO. It's also almost nothing to ship once you have the plugins.

Wire it up

Drop this in src/pages/robots.txt.ts in any EmDash project:

import type { APIRoute } from "astro";
import { buildRobotsTxt } from "emdash-plugin-agent-seo";

export const GET: APIRoute = async ({ site, url }) => {
  const siteUrl = site?.toString() ?? url.origin;
  const body = buildRobotsTxt({
    siteUrl,
    defaultPolicy: { mode: "allow" },
    globalDisallow: ["/_emdash/"],
  });
  return new Response(body, {
    headers: {
      "content-type": "text/plain; charset=utf-8",
      "cache-control": "public, max-age=3600",
    },
  });
};

That's the whole integration. Nine lines plus the response envelope.

One last thing about Cloudflare

I spent longer than I should have on this, so flagging it explicitly: if your site is behind Cloudflare and you've turned on "AI Scrapers and Crawlers" in the zone settings, your Worker cannot fully override the served robots.txt. Cloudflare prepends its managed content above your Worker response. The dashboard path is Security → Bots → AI Scrapers and Crawlers. Toggle it off if you want the Worker's response to win.

This is not in Cloudflare's published gotcha list anywhere I could find. It's a setting a lot of accounts have enabled by default. I didn't notice for an embarrassing length of time because I kept curl-ing /robots.txt and seeing my agent-seo allow list at the bottom of the response, so I assumed it was working. First-match-wins means the top always wins.

Source

emdash-plugin-agent-seo is MIT-licensed. The code lives in the plugins/ directory of the blog repo until I extract it to a standalone public repo next week. Pull requests welcome, especially for the bot catalog — if I missed a crawler or have stale docs URLs, open an issue or send a patch.

Same story for emdash-plugin-llms-txt — they're sister packages and most people will want both.

This is the kind of thing that should exist as a boring, small, well-tested utility that any EmDash site can install in one command. The v0.1.0 packages are the first pass. The goal is for robots.txt on any EmDash site, anywhere, to be a solved problem — the same way you don't think about how your site gets SSL certs anymore.

Listen to this post

A robots.txt that actually lets agents in

The problem sits at two layers

The bot catalog is data, not code

The generator is 80 lines of pure TypeScript

Policy as data

What to pair it with

Wire it up

One last thing about Cloudflare

Source

Continue reading

Eight plugins, one session, zero npm installs

Selling Claude Code skills for $0.10 USDC

The blog is the PR department now

Keyboard shortcuts