llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet

Every Next.js developer building a public site in the last 18 months has hit the same wall: you Google "how to control what AI crawlers read" and get three different answers pointing at three different files — robots.txt, llms.txt, and ai.txt. They are not the same thing. They do not talk to the same audience. And using the wrong one (or none at all) means AI search engines are either ignoring your content entirely or indexing pages you never intended them to.

This is the one-stop breakdown I wish existed when I was figuring this out.

The three files at a glance

	`robots.txt`	`llms.txt`	`ai.txt`
Proposed by	Martijn Koster (1994)	Anthropic + community	AI-txt.com initiative
Primary audience	Web crawlers (Googlebot, Bingbot, etc.)	LLM training & AI search crawlers	AI assistants (ChatGPT, Claude, Gemini)
Format	Key-value directives	Markdown / structured text	Key-value + JSON blocks
Spec status	RFC standard, universally supported	Emerging, growing adoption	Early proposal, limited adoption
Enforced by	All major search engines	Anthropic, Perplexity, some others	No major enforcer yet
Location	`yourdomain.com/robots.txt`	`yourdomain.com/llms.txt`	`yourdomain.com/ai.txt`

The short mental model: robots.txt is for Googlebot. llms.txt is for ClaudeBot and GPTBot when they are building knowledge, not just indexing. ai.txt is a newer proposal that tries to cover AI assistants reading your site in real time. Use all three if you want complete coverage — but robots.txt + llms.txt is where you get 90% of the value today.

robots.txt — the original gatekeeper

robots.txt has been around since 1994. Every crawler on the internet — Googlebot, Bingbot, DuckDuckBot, GPTBot, ClaudeBot, PerplexityBot — checks it before crawling. If you block GPTBot in robots.txt, it will not crawl your site for training data or AI-search indexing.

Basic syntax:

User-agent: *
Disallow: /admin/
Disallow: /private/

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Allow: /blog/
Disallow: /

User-agent: * applies to all crawlers. Named user-agents override * for that specific bot. Allow and Disallow are path-based — no wildcards by default in the original spec, though most modern crawlers support them.

In Next.js App Router — generate it dynamically from app/robots.ts:

// app/robots.ts
import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: ['/admin/', '/api/private/'],
      },
      {
        userAgent: 'GPTBot',
        allow: ['/blog/', '/services/'],
        disallow: ['/'],
      },
    ],
    sitemap: 'https://yourdomain.com/sitemap.xml',
  }
}

Next.js renders this as text/plain at /robots.txt automatically. No separate file needed.

What robots.txt does NOT do: it does not stop a crawler from reading pages it already knows about from other sources (backlinks, sitemaps). It only stops it from actively crawling those paths. If GPTBot found your /admin/ page linked from a public page, it may already have cached it.

llms.txt — built for the AI era

llms.txt was proposed by Answer.AI and picked up by Anthropic, Perplexity, and others as a structured way to tell LLMs what your site is actually about — not just what they can crawl, but what context they should carry when reasoning about your content.

Unlike robots.txt which is access control, llms.txt is documentation. Think of it as a README for your site aimed at language models.

Basic structure:

# YourSite

> One-line description of what this site is and who it's for.

A few sentences of context. What does this site do? Who is the author?
What should an LLM understand before citing any page from this domain?

## Blog

- [Post title one](https://yourdomain.com/blog/post-one): One-line summary.
- [Post title two](https://yourdomain.com/blog/post-two): One-line summary.

## Services

- [Service name](https://yourdomain.com/services/name): What this service does in one line.

## Contact

- Author: Your Name
- Email: you@yourdomain.com
- LinkedIn: linkedin.com/in/yourhandle

The format is intentionally plain Markdown. No special parser needed — any LLM can read it. The /llms.txt path is the convention; some sites also serve /llms-full.txt with deeper content for models that want more context.

In Next.js — generate dynamically from app/llms.txt/route.ts:

// app/llms.txt/route.ts
import { blogPosts } from '@/data/blog-posts'

export async function GET() {
  const lines = [
    '# YourSite',
    '',
    '> AI-search-ready Next.js development and SEO consulting.',
    '',
    'This site covers AI engineering, GEO/AEO, and production Next.js patterns.',
    '',
    '## Blog',
    '',
    ...blogPosts.map(p => `- [${p.title}](https://yourdomain.com/blog/${p.slug}): ${p.excerpt}`),
    '',
    '## Services',
    '',
    '- [AI-Search Consulting](https://yourdomain.com/services): End-to-end GEO and AEO for Next.js sites.',
  ]

  return new Response(lines.join('\n'), {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' },
  })
}

This keeps llms.txt in sync with your actual content automatically — no manual updates.

Who reads llms.txt today: ClaudeBot (Anthropic's crawler), PerplexityBot, some versions of GPTBot. Adoption is growing fast. If you are publishing content you want AI search engines to cite accurately, this file is non-negotiable.

ai.txt — the wildcard

ai.txt is a newer proposal from a different working group. Where llms.txt focuses on what an LLM should know about your site, ai.txt focuses on granting or denying permission for AI assistants to use your content in responses.

Basic syntax:

# ai.txt
Version: 1.0

[permissions]
allow: true
commercial-use: false
training: false
real-time-access: true

[attribution]
require: true
format: "Source: {title} ({url})"

[contact]
email: you@yourdomain.com

Honest assessment: ai.txt has minimal enforcer support right now. No major AI company officially reads it. The spec is still evolving. That said, if the initiative gains traction (similar to how robots.txt went from informal convention to de-facto standard), having it early costs nothing and signals intent.

For most developers today: add it, keep it simple, and do not spend more than 10 minutes on it.

How a crawler actually decides what to read

The flow for a modern AI crawler like GPTBot or ClaudeBot hitting your domain:

Fetch robots.txt — am I allowed to crawl this path?
If allowed, fetch the page HTML
Fetch llms.txt (periodically, not per-request) — what is this site actually about?
Check ai.txt if the implementation supports it
Index the content with the context from steps 2–4 combined

The key insight: robots.txt is checked per crawl request. llms.txt is fetched periodically and cached — it shapes how the model understands your whole site over time, not just whether it can read one page.

Putting it all together in Next.js App Router

Here is the complete implementation for a Next.js 15 App Router site:

app/
├── robots.ts          ← generates /robots.txt
├── sitemap.ts         ← generates /sitemap.xml
├── llms.txt/
│   └── route.ts       ← generates /llms.txt (dynamic, always in sync)
public/
└── ai.txt             ← static file, update manually

robots.ts (full version with AI crawler rules):

// app/robots.ts
import { MetadataRoute } from 'next'

const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL ?? 'https://yourdomain.com'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      // Default: allow everything
      { userAgent: '*', allow: '/' },

      // GPTBot: allow blog and services, block everything else
      {
        userAgent: 'GPTBot',
        allow: ['/blog/', '/services/', '/tools/'],
        disallow: ['/admin/', '/api/'],
      },

      // ClaudeBot: same rules
      {
        userAgent: 'ClaudeBot',
        allow: ['/blog/', '/services/', '/tools/'],
        disallow: ['/admin/', '/api/'],
      },

      // PerplexityBot: full access (it drives meaningful referral traffic)
      { userAgent: 'PerplexityBot', allow: '/' },

      // Google-Extended (used for Gemini training): restrict to blog only
      {
        userAgent: 'Google-Extended',
        allow: ['/blog/'],
        disallow: ['/'],
      },
    ],
    sitemap: `${BASE_URL}/sitemap.xml`,
  }
}

llms.txt/route.ts (dynamic, pulls from your data layer):

// app/llms.txt/route.ts
import { blogPosts } from '@/data/blog-posts'
import { servicePosts } from '@/data/services'

const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL ?? 'https://yourdomain.com'

export async function GET() {
  const content = `# YourSite

> [One-sentence description of your site and its purpose.]

[Two or three sentences giving an LLM the context it needs to cite your site accurately.
What topics do you cover? Who is the author? What makes this site's perspective unique?]

## Blog

${blogPosts.map(p => `- [${p.title}](${BASE_URL}/blog/${p.slug}): ${p.excerpt}`).join('\n')}

## Services

${servicePosts.map(s => `- [${s.title}](${BASE_URL}/services/${s.slug}): ${s.summary}`).join('\n')}

## Author

- Name: Your Name
- Site: ${BASE_URL}
- Expertise: [Your primary expertise areas]
`

  return new Response(content, {
    headers: {
      'Content-Type': 'text/plain; charset=utf-8',
      'Cache-Control': 'public, max-age=3600, stale-while-revalidate=86400',
    },
  })
}

public/ai.txt (static, update when the spec stabilizes):

# ai.txt
Version: 1.0

[permissions]
allow: true
commercial-use: false
training: false
real-time-access: true

[attribution]
require: true

[contact]
site: https://yourdomain.com

Which file do you actually need?

Start here:

Building a public Next.js site? → Add robots.txt first. Always.
Want AI search engines (ChatGPT, Perplexity, Claude) to cite your content accurately? → Add llms.txt. This is the highest-leverage file for AI-search visibility right now.
Want to future-proof against the ai.txt spec gaining enforcement? → Add a simple public/ai.txt. It costs 10 lines.
Want to block AI crawlers from training on your content? → Set Disallow: / for GPTBot, ClaudeBot, and Google-Extended in robots.txt. This is the only file that actually enforces it today.

The one mistake I see most often: developers add robots.txt but skip llms.txt, then wonder why ChatGPT gives wrong answers about their site even though Googlebot indexes it fine. Googlebot and GPTBot-for-knowledge are completely different crawlers with different purposes.

Three things to verify right now

Open your terminal and check these three URLs on your live site:

curl https://yourdomain.com/robots.txt
curl https://yourdomain.com/llms.txt
curl https://yourdomain.com/ai.txt

If any returns a 404 or HTML error page, that file is missing. For robots.txt, a missing file means all crawlers assume full access — usually fine for public sites, but you lose granular control. For llms.txt, missing means LLMs are forming their understanding of your site from raw page HTML with no structured context — which almost always leads to inaccurate citations.

If you want a deeper look at how AI crawlers read Next.js sites specifically — what RSC payloads they fetch, how streaming affects what they see, and which metadata fields they actually use — I have a longer writeup on the AI-search architecture patterns I use in production that goes further than this cheat sheet.

And if you want this wired up on your own site end-to-end, that is exactly the kind of work I take on.

If your own llms.txt or robots.txt setup looks different from what I showed here — especially if you are on an older Next.js version or using the Pages Router — drop it in the comments. Curious what variations people are running in production.

推荐订阅源

DEV Community