惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Security Latest
Security Latest
U
Unit 42
D
Docker
H
Help Net Security
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Microsoft Azure Blog
Microsoft Azure Blog
C
Cisco Blogs
阮一峰的网络日志
阮一峰的网络日志
S
Schneier on Security
Project Zero
Project Zero
F
Future of Privacy Forum
V
Vulnerabilities – Threatpost
Recent Announcements
Recent Announcements
T
Threatpost
T
True Tiger Recordings
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Recorded Future
Recorded Future
T
The Blog of Author Tim Ferriss
S
SegmentFault 最新的问题
A
Arctic Wolf
Martin Fowler
Martin Fowler
I
InfoQ
Malwarebytes
Malwarebytes
T
Tor Project blog
Hugging Face - Blog
Hugging Face - Blog
M
MIT News - Artificial intelligence
S
Securelist
T
Tailwind CSS Blog
Blog — PlanetScale
Blog — PlanetScale
P
Proofpoint News Feed
W
WeLiveSecurity
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
H
Hacker News: Front Page
The Cloudflare Blog
O
OpenAI News
C
CERT Recently Published Vulnerability Notes
Hacker News: Ask HN
Hacker News: Ask HN
NISL@THU
NISL@THU
E
Exploit-DB.com RSS Feed
Scott Helme
Scott Helme
Jina AI
Jina AI
Spread Privacy
Spread Privacy
T
The Exploit Database - CXSecurity.com
T
Troy Hunt's Blog
N
News | PayPal Newsroom
李成银的技术随笔

DEV Community

A Field Guide to Human–AI Relations (For the Newly Bewildered Mortal) The AI Agent That Learns While It Works — A Complete Guide to Hermes Agent Gemma 4: Google's Open-Weight AI Is a Game Changer for Developers Confessions of a Git Beginner: Why the Terminal Stopped Scaring Me 🚀 I Built a Full Stack Miro Clone with Real-Time Collaboration using Next.js Building an African Economic Data Pipeline with Python, DuckDB & World Bank API Intigriti Challenge 0526 Writeup Business Logic Flaws: How Attackers Skip Steps in Your App to Get What They Should Never Have Why Vibe Coders Need Boilerplates to Save Time, Tokens, and Build More Secure SaaS Projects Idle Cloud Cost Is the New Egress Cost Quark's Outlines: Python Traceback Objects Ghost in the Stack (Part 1): Why uninitialized variables remember old data Building a High-Performance Local Chess Assistant Extension with WebAssembly Stockfish and Manifest V3 Breaking the Trade-off Between Self-Custody and Intelligent Automation on the Stellar Network I Open-Sourced a Practical Fullstack Interview Preparation Repository (React + Node + System Design) 🚀 How I Started Coding as a Student (Beginner-Friendly Guide) WordPress vs. Ghost: Why Automated Bot Attacks Are Making us think much I tested 4 AI agent-governance tools against an open spec - here's the matrix zkML Inference Proof: What the Receipt Proves, and What the Model Still Does Not I Scored 1000/1000 on AWS Certified AI Practitioner (AIF-C01) Here's Every Resource I Used Go - Struct and Interface Handling JSON Requests in Go Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS How I Caught and Fixed an N+1 Query in My Django REST API I got tired of paying $10/month to remove image backgrounds – so I built it for free How to Start Coding as a Student: A Complete Beginner’s Guide 🚀 Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS What Are Buffers? Build AI Agents with Hot Dev The Client Onboarding Checklist That Prevents 90% of Project Problems Scalable Treasure Hunts Are a Myth, But We Almost Made One Gemini 3.5 Flash Has a 1M Token Context Window. Here's What You Can Actually Build With It. I built a ultra-polished developer portfolio template using React & Tailwind v4 (with zero-JSX configuration) Gemini CLI Is Dead. Here's the Better Thing That Replaced It Post-quantum cryptography for embedded and IoT: secure boot, TLS and OTA Understanding Optimistic Preloading in Modern Applications Nobody Wants to Read Your Code (And You Don't Want to Read Theirs) A clothing pairing app E2B vs E4B vs 31B Dense: The Practical Guide to Choosing the Right Gemma 4 Model I built an AI app store screenshot generator because Figma made me cry — looking for brutal feedback Hello DEV Community — My Developer Journey Begins Adaptable apps on ChromeOS: a post-mortem The WordPress Paradox: Why It’s Here to Stay (and How to Stop Ruining It) I built a local voice AI that can change to 9 different personalities! UXRay: I Built an AI That Roasts Your UI Like a Senior Designer Would Wyrly DI: Type-safe Dependency Injection for Modern TypeScript The contract is the interface: agent-driven Steampipe Stave in one command Gemma 4's Hidden Superpower: Why Built-in Thinking Tokens Change Everything for Evaluation Tasks ⚡ WordPress Performance: The Real Truth They Don't Tell You A Mobile App Usually Needs an Admin System First Customer Portals Should Remove Repeated Admin Work Episode 4: The Time Loop (Layers & Caching) I Built ContextForge with Gemma 4: A Project Memory Generator for Developers and AI Coding Agents Why shadow DOM beat iframe for inline tooltips HOW TO CREATE USER AND ASSIGN ROLES IN AZURE WITH ENTRA ID When AI Blackmail Goes Viral Episode 3: The Secret Scroll (The Dockerfile) Monte Carlo Simulation for Engineers: Turning Uncertainty Into Numbers The tokens-per-byte trap: character-level 'compression' adds tokens Nobody Reads Your Code Anymore Why I built a collection of 5 free, zero-signup career finance tools for solo builders 🚀 New React Challenge: Instant UI with useOptimistic Resolvendo a Alucinação da IA na Arquitetura de Software com Code Property Graphs e .NET 9 S1 — Clean Backtrace Crashes: How to Diagnose and Fix Them Cómo solucionar el bucle infinito en useEffect con objetos y arrays The Brutal Reality of Running Gemma 4 Locally I made Claude Code refuse to write code unless the ticket scores 80/100 I Fed React's Entire Hooks Transition History to Gemma 4. Here's What It Found That We Missed. Building a Private RAG System: Lessons from a Local-First AI Journal CodePulse AI — Reviving an AI-Powered Repository Intelligence Platform How to Split Video into Segments with FFmpeg (CLI + API) I've audited dozens of estate agency websites. The same 5 problems show up every single time. Part 1: Taming Asynchronous JavaScript: How to Build a "Mailbox" Queue Building My AI-Powered VS Code Extension 🚀 Google Login in Express with PassportJS & JWT Great example of Gemma 4 moving beyond chatbots into real-world decision support. Using AI to guide everyday actions like recycling shows how impactful applied LLMs can be when designed for usability, not just capability. #Gemma4 #AI #Sustainability Building a Production AI Chatbot for an Educational Institute: Architecture, Lessons & Full Stack Deep-Dive Google Login in Express with PassportJS & JWT How I reclaimed 47GB on my MacBook by cleaning developer project junk Operators Are Not Oracles: How We Learned to Stop Worrying and Love the Configuration I Built 6 Free Developer Tools for AI APIs, Cron, Docker, and Self-Hosting How I Built a Real-Time Precious Metals Price Feed for 30,000 Concurrent Users in Laravel How to Use a SERP API to Validate Whether a Project Idea Is Worth Building Gemma 4 discussions often focus on capability, but real-world impact depends on deployment context. For offline education, especially in low-connectivity regions, latency, cost, and local inference matter as much as model strength. Local Mind Explores it Space Complexity + Ω and Θ Notations Google I/O 2026 Just Confirmed the Shift From AI Chatbots to AI Agents How to Add API Monitoring to an Express App in 5 Minutes (2026) Designing an In-Game Inflation Tracking Algorithm for Web Utility Apps Google AI Studio Just Changed the Shape of App Development If you struggle to learn then this is for you. Best AI Agent Security & Guardrails Tools in 2026: LLM Guard vs NeMo vs Guardrails AI Building Dynamic RBAC in React 19: From Permission Strings to Component-Level Access Control How to Build a Self-Hosted AI Code Review Tool in Python Why We Switched from React to HTMX in Production: A 200-Site Case Study Gemma-Loom: The Intent-Based Virtual Machine (IVM) for Edge Sovereignty Java实习海投攻略:3天300个沟通,我是怎么拿到面试的 I Deployed Netflix's Web Server in 30 Seconds (And So Can You) - Docker Project 1 Debugging Android 14 WebRTC Disconnects on a coturn Relay Path 1/30 Days System Design Question Testing FastAPI + SQLAlchemy with Real PostgreSQL Fixtures: No More Mocking Misery
llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet
Mudassir Kha · 2026-05-23 · via DEV Community

llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet

Every Next.js developer building a public site in the last 18 months has hit the same wall: you Google "how to control what AI crawlers read" and get three different answers pointing at three different files — robots.txt, llms.txt, and ai.txt. They are not the same thing. They do not talk to the same audience. And using the wrong one (or none at all) means AI search engines are either ignoring your content entirely or indexing pages you never intended them to.

This is the one-stop breakdown I wish existed when I was figuring this out.


The three files at a glance

robots.txt llms.txt ai.txt
Proposed by Martijn Koster (1994) Anthropic + community AI-txt.com initiative
Primary audience Web crawlers (Googlebot, Bingbot, etc.) LLM training & AI search crawlers AI assistants (ChatGPT, Claude, Gemini)
Format Key-value directives Markdown / structured text Key-value + JSON blocks
Spec status RFC standard, universally supported Emerging, growing adoption Early proposal, limited adoption
Enforced by All major search engines Anthropic, Perplexity, some others No major enforcer yet
Location yourdomain.com/robots.txt yourdomain.com/llms.txt yourdomain.com/ai.txt

The short mental model: robots.txt is for Googlebot. llms.txt is for ClaudeBot and GPTBot when they are building knowledge, not just indexing. ai.txt is a newer proposal that tries to cover AI assistants reading your site in real time. Use all three if you want complete coverage — but robots.txt + llms.txt is where you get 90% of the value today.


robots.txt — the original gatekeeper

robots.txt has been around since 1994. Every crawler on the internet — Googlebot, Bingbot, DuckDuckBot, GPTBot, ClaudeBot, PerplexityBot — checks it before crawling. If you block GPTBot in robots.txt, it will not crawl your site for training data or AI-search indexing.

Basic syntax:

User-agent: *
Disallow: /admin/
Disallow: /private/

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Allow: /blog/
Disallow: /

Enter fullscreen mode Exit fullscreen mode

User-agent: * applies to all crawlers. Named user-agents override * for that specific bot. Allow and Disallow are path-based — no wildcards by default in the original spec, though most modern crawlers support them.

In Next.js App Router — generate it dynamically from app/robots.ts:

// app/robots.ts
import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: '*',
        allow: '/',
        disallow: ['/admin/', '/api/private/'],
      },
      {
        userAgent: 'GPTBot',
        allow: ['/blog/', '/services/'],
        disallow: ['/'],
      },
    ],
    sitemap: 'https://yourdomain.com/sitemap.xml',
  }
}

Enter fullscreen mode Exit fullscreen mode

Next.js renders this as text/plain at /robots.txt automatically. No separate file needed.

What robots.txt does NOT do: it does not stop a crawler from reading pages it already knows about from other sources (backlinks, sitemaps). It only stops it from actively crawling those paths. If GPTBot found your /admin/ page linked from a public page, it may already have cached it.


llms.txt — built for the AI era

llms.txt was proposed by Answer.AI and picked up by Anthropic, Perplexity, and others as a structured way to tell LLMs what your site is actually about — not just what they can crawl, but what context they should carry when reasoning about your content.

Unlike robots.txt which is access control, llms.txt is documentation. Think of it as a README for your site aimed at language models.

Basic structure:

# YourSite

> One-line description of what this site is and who it's for.

A few sentences of context. What does this site do? Who is the author?
What should an LLM understand before citing any page from this domain?

## Blog

- [Post title one](https://yourdomain.com/blog/post-one): One-line summary.
- [Post title two](https://yourdomain.com/blog/post-two): One-line summary.

## Services

- [Service name](https://yourdomain.com/services/name): What this service does in one line.

## Contact

- Author: Your Name
- Email: you@yourdomain.com
- LinkedIn: linkedin.com/in/yourhandle

Enter fullscreen mode Exit fullscreen mode

The format is intentionally plain Markdown. No special parser needed — any LLM can read it. The /llms.txt path is the convention; some sites also serve /llms-full.txt with deeper content for models that want more context.

In Next.js — generate dynamically from app/llms.txt/route.ts:

// app/llms.txt/route.ts
import { blogPosts } from '@/data/blog-posts'

export async function GET() {
  const lines = [
    '# YourSite',
    '',
    '> AI-search-ready Next.js development and SEO consulting.',
    '',
    'This site covers AI engineering, GEO/AEO, and production Next.js patterns.',
    '',
    '## Blog',
    '',
    ...blogPosts.map(p => `- [${p.title}](https://yourdomain.com/blog/${p.slug}): ${p.excerpt}`),
    '',
    '## Services',
    '',
    '- [AI-Search Consulting](https://yourdomain.com/services): End-to-end GEO and AEO for Next.js sites.',
  ]

  return new Response(lines.join('\n'), {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' },
  })
}

Enter fullscreen mode Exit fullscreen mode

This keeps llms.txt in sync with your actual content automatically — no manual updates.

Who reads llms.txt today: ClaudeBot (Anthropic's crawler), PerplexityBot, some versions of GPTBot. Adoption is growing fast. If you are publishing content you want AI search engines to cite accurately, this file is non-negotiable.


ai.txt — the wildcard

ai.txt is a newer proposal from a different working group. Where llms.txt focuses on what an LLM should know about your site, ai.txt focuses on granting or denying permission for AI assistants to use your content in responses.

Basic syntax:

# ai.txt
Version: 1.0

[permissions]
allow: true
commercial-use: false
training: false
real-time-access: true

[attribution]
require: true
format: "Source: {title} ({url})"

[contact]
email: you@yourdomain.com

Enter fullscreen mode Exit fullscreen mode

Honest assessment: ai.txt has minimal enforcer support right now. No major AI company officially reads it. The spec is still evolving. That said, if the initiative gains traction (similar to how robots.txt went from informal convention to de-facto standard), having it early costs nothing and signals intent.

For most developers today: add it, keep it simple, and do not spend more than 10 minutes on it.


How a crawler actually decides what to read

The flow for a modern AI crawler like GPTBot or ClaudeBot hitting your domain:

  1. Fetch robots.txt — am I allowed to crawl this path?
  2. If allowed, fetch the page HTML
  3. Fetch llms.txt (periodically, not per-request) — what is this site actually about?
  4. Check ai.txt if the implementation supports it
  5. Index the content with the context from steps 2–4 combined

The key insight: robots.txt is checked per crawl request. llms.txt is fetched periodically and cached — it shapes how the model understands your whole site over time, not just whether it can read one page.


Putting it all together in Next.js App Router

Here is the complete implementation for a Next.js 15 App Router site:

app/
├── robots.ts          ← generates /robots.txt
├── sitemap.ts         ← generates /sitemap.xml
├── llms.txt/
│   └── route.ts       ← generates /llms.txt (dynamic, always in sync)
public/
└── ai.txt             ← static file, update manually

Enter fullscreen mode Exit fullscreen mode

robots.ts (full version with AI crawler rules):

// app/robots.ts
import { MetadataRoute } from 'next'

const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL ?? 'https://yourdomain.com'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      // Default: allow everything
      { userAgent: '*', allow: '/' },

      // GPTBot: allow blog and services, block everything else
      {
        userAgent: 'GPTBot',
        allow: ['/blog/', '/services/', '/tools/'],
        disallow: ['/admin/', '/api/'],
      },

      // ClaudeBot: same rules
      {
        userAgent: 'ClaudeBot',
        allow: ['/blog/', '/services/', '/tools/'],
        disallow: ['/admin/', '/api/'],
      },

      // PerplexityBot: full access (it drives meaningful referral traffic)
      { userAgent: 'PerplexityBot', allow: '/' },

      // Google-Extended (used for Gemini training): restrict to blog only
      {
        userAgent: 'Google-Extended',
        allow: ['/blog/'],
        disallow: ['/'],
      },
    ],
    sitemap: `${BASE_URL}/sitemap.xml`,
  }
}

Enter fullscreen mode Exit fullscreen mode

llms.txt/route.ts (dynamic, pulls from your data layer):

// app/llms.txt/route.ts
import { blogPosts } from '@/data/blog-posts'
import { servicePosts } from '@/data/services'

const BASE_URL = process.env.NEXT_PUBLIC_SITE_URL ?? 'https://yourdomain.com'

export async function GET() {
  const content = `# YourSite

> [One-sentence description of your site and its purpose.]

[Two or three sentences giving an LLM the context it needs to cite your site accurately.
What topics do you cover? Who is the author? What makes this site's perspective unique?]

## Blog

${blogPosts.map(p => `- [${p.title}](${BASE_URL}/blog/${p.slug}): ${p.excerpt}`).join('\n')}

## Services

${servicePosts.map(s => `- [${s.title}](${BASE_URL}/services/${s.slug}): ${s.summary}`).join('\n')}

## Author

- Name: Your Name
- Site: ${BASE_URL}
- Expertise: [Your primary expertise areas]
`

  return new Response(content, {
    headers: {
      'Content-Type': 'text/plain; charset=utf-8',
      'Cache-Control': 'public, max-age=3600, stale-while-revalidate=86400',
    },
  })
}

Enter fullscreen mode Exit fullscreen mode

public/ai.txt (static, update when the spec stabilizes):

# ai.txt
Version: 1.0

[permissions]
allow: true
commercial-use: false
training: false
real-time-access: true

[attribution]
require: true

[contact]
site: https://yourdomain.com

Enter fullscreen mode Exit fullscreen mode


Which file do you actually need?

Start here:

  • Building a public Next.js site? → Add robots.txt first. Always.
  • Want AI search engines (ChatGPT, Perplexity, Claude) to cite your content accurately? → Add llms.txt. This is the highest-leverage file for AI-search visibility right now.
  • Want to future-proof against the ai.txt spec gaining enforcement? → Add a simple public/ai.txt. It costs 10 lines.
  • Want to block AI crawlers from training on your content? → Set Disallow: / for GPTBot, ClaudeBot, and Google-Extended in robots.txt. This is the only file that actually enforces it today.

The one mistake I see most often: developers add robots.txt but skip llms.txt, then wonder why ChatGPT gives wrong answers about their site even though Googlebot indexes it fine. Googlebot and GPTBot-for-knowledge are completely different crawlers with different purposes.


Three things to verify right now

Open your terminal and check these three URLs on your live site:

curl https://yourdomain.com/robots.txt
curl https://yourdomain.com/llms.txt
curl https://yourdomain.com/ai.txt

Enter fullscreen mode Exit fullscreen mode

If any returns a 404 or HTML error page, that file is missing. For robots.txt, a missing file means all crawlers assume full access — usually fine for public sites, but you lose granular control. For llms.txt, missing means LLMs are forming their understanding of your site from raw page HTML with no structured context — which almost always leads to inaccurate citations.

If you want a deeper look at how AI crawlers read Next.js sites specifically — what RSC payloads they fetch, how streaming affects what they see, and which metadata fields they actually use — I have a longer writeup on the AI-search architecture patterns I use in production that goes further than this cheat sheet.

And if you want this wired up on your own site end-to-end, that is exactly the kind of work I take on.


If your own llms.txt or robots.txt setup looks different from what I showed here — especially if you are on an older Next.js version or using the Pages Router — drop it in the comments. Curious what variations people are running in production.