惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
G
GRAHAM CLULEY
P
Privacy & Cybersecurity Law Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
宝玉的分享
宝玉的分享
P
Proofpoint News Feed
H
Help Net Security
V
Visual Studio Blog
阮一峰的网络日志
阮一峰的网络日志
C
Cisco Blogs
人人都是产品经理
人人都是产品经理
Know Your Adversary
Know Your Adversary
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Recorded Future
Recorded Future
I
Intezer
罗磊的独立博客
T
The Exploit Database - CXSecurity.com
Blog — PlanetScale
Blog — PlanetScale
Malwarebytes
Malwarebytes
Spread Privacy
Spread Privacy
T
Tor Project blog
V
Vulnerabilities – Threatpost
云风的 BLOG
云风的 BLOG
腾讯CDC
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
F
Future of Privacy Forum
MyScale Blog
MyScale Blog
Latest news
Latest news
IT之家
IT之家
MongoDB | Blog
MongoDB | Blog
The Hacker News
The Hacker News
S
Securelist
博客园 - 【当耐特】
C
CXSECURITY Database RSS Feed - CXSecurity.com
T
Threat Research - Cisco Blogs
Jina AI
Jina AI
Cisco Talos Blog
Cisco Talos Blog
B
Blog
博客园 - 三生石上(FineUI控件)
Last Week in AI
Last Week in AI
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
M
MIT News - Artificial intelligence
V
V2EX
D
Darknet – Hacking Tools, Hacker News & Cyber Security
The Cloudflare Blog
The GitHub Blog
The GitHub Blog
博客园 - 聂微东
F
Full Disclosure
C
CERT Recently Published Vulnerability Notes

DEV Community

African Creators Deserve Better: How I Built a Payment Gateway for Every Corner of the Continent React CRUD basics Beyond the Prompt: How to Build Stateful AI Agents with Persistent Memory and Self-Learning Loops The Universal Remote for AI: A Deep Dive into the Model Context Protocol (MCP) AgentGuard 0.3.0 — macOS menu bar app, Telegram rollback, and more Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent Shopify Functions vs Shopify Scripts: A Migration Walkthrough What Actually Survives a Chicago-Area Winter on Your Deck Rethinking Geo-Blocking and Stripe's Failures in Global Access: A Cautionary Tale of Misoptimization I Built a Free Brat Generator - Here's What I Learned About Next.js Performance published Found a Second Layer to a GitHub Follow Botnet? AI Daily Digest: May 22, 2026 — Agentic Workflows, Coding Agents & Embodied AI How I Secured Internal Microservice Calls Without Passing JWTs Stop Mixing Them Up: SLI vs SLO vs SLA Explained Rebuilding My Engineering Mind Building a Music Production Ecosystem Instead of Just Releasing Plugins The Vonage Dev Discussion: How AI is transforming software development I Gave Our Enterprise AI a Memory. It Started Citing Last Quarter's Incidents. 𝐓𝐡𝐞 𝐂𝐨𝐦𝐦𝐮𝐧𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐒𝐭𝐲𝐥𝐞 𝐂𝐫𝐢𝐬𝐢𝐬 Hermes Agent in the Wild: How I Turned It Into an AI Ops Employee Navigating the Hazy Jungle of Global E-commerce: How We Built a Reliable System for Digital Creators in Tanzania The Cost of Cross-Platform Development: Native Module Integration AI-Native Apps Will Swallow the Web I switched my Gemma 4 model three times in 72 hours. Here's the decision tree I wish I'd had. Inside #100DaysofSolana: A Guided Path into Web3 I Built and Shipped TinyHab: an ADHD-Friendly Habit Tracker for iOS I'm an ECE Student Who Vibe Codes Hardware Projects — Here's What Google I/O 2026 Actually Changed for Me From Fragmented Pipelines to Coherent Intelligence — Why Gemma 4 Actually Changes How I Work Our AI Inference Bill Dropped 65% After We Stopped Treating Every Query the Same Why P95 Latency Is the Only Metric That Matters at 3 AM Recycling made easy: a Polish recycling assistant powered by Gemma 4 The Complete Guide to Running a Midnight Node: Setup, Sync & Monitoring De CSRF a RCE: una visita web cuesta una shell en OpenYak Why We Built a Faster Wiki Building a Browser-Based Inkarnate Alternative for D&D Battle Maps Apache Kafka How to Build a FinTech Platform as a Solo Developer (By Any Means Necessary) Your LLM Logs Deserve Better — Send Claude Code Events to Bronto I built a free tool to track subscriptions and stop getting surprised by charges Building the TEYZIX CORE Internship Portal — My Full-Stack Development Journey PocketCFO: a private personal-finance brain that runs entirely in your browser Go Idioms I Wish I Knew Earlier Hey how are you guys I'm newbie web developer , learning wordpress+elementor Right now I don't know what to make I don't know what to write or use what color can you tell me about it ? Google I/O 2026 Blew My Mind — Here's What It Means for the Family App I'm Building 5 Things I Learned in My First Month as a Dev Intern EU AI Sovereignty Belongs in the Workflow Layer Why AI Coding Agents Need Business Context, Not Just Code Context How I Built 9 Claude AI Features into a Production SaaS Expo SDK 56 HashiCorp built an MCP server for writing Terraform. I built one for reviewing it Why Enterprise AI Agent Deployments Keep Failing Date Shear: A New Term for a Common Programming Pain Point Compass v1.1.0 · we shipped a memory plugin that catches its own consumption drift Zod Validation: Type-Safe APIs & Forms in TypeScript (Complete Guide) GitHub Actions CI/CD: Build a Complete Node.js Pipeline (2026) MCP in 2026: The numbers behind the ecosystem explosion working with an ai model mirror Learnt new things Four Metrics That Actually Tell You Whether Your Enterprise RAG Is Working Beyond the Stateless Prompt: Building an Auditable Product Intelligence Pipeline with Cascadeflow and Hindsight Most Creators Are Building in Pieces. I’m Building the Entire System. The Hidden Privacy Problem in Every AI App CVE-2026-26007: Subgroup Confinement Attack in pyca/cryptography The One Thing I See in Every Developer Who Gets Unstuck AI Memory Governance for Legal Tech: How Contract AI Agents Handle Privileged Data Two tables, zero migrations, full LINQ — a .NET data engine that's been running our production for 3 months Join the GitHub Finish-Up-A-Thon Challenge: $3,000 Prize Pool! I Replaced a $50/Month OCR API with Gemma 4’s Native Vision (And You Can Too) Building a Data-Driven Medical Image Enhancement Pipeline with Differential Evolution 🔥🩻 Why I Like Small Software Beyond the Model: Why the Gemini Ecosystem and Google AI Studio Are Redefining Enterprise AI Architecture in 2026 Complete set of Claude Skills for Solo Developer I read 50 years of network science, then built a CRM that runs entirely in the browser The New AI Workflow Is Not “More Agents” How to Make Large Time-Series Charts Smooth in Vue.js + ApexCharts (and fix Zoom & Scroll behavior issues) I Built a Cross-Platform Port Intelligence Tool to Stop Accidental Process Kills During Local Dev AI is heading toward a wall, and most people still don’t see it... Python String Methods Explained Simply (Common Operations) Why We Built a Zero-Knowledge Clipboard Manager for Developers (And Dropped Native Mobile Apps) Add Your Own Component to Bombie in 5 Edits Why Your OSS Advocacy Strategy Probably Doesn't Fit Building an MCP server for a Swiss hosting provider (and what reverse-engineering its manager taught me) Does MCP Still Matter in the AI Ecosystem? Building a Smart LRU Cache in Java: When Machines Mimic Human Memory 🧠💻 A Beginner’s Guide to Redux in React Build a Real-Time Excalidraw-like Collaborative Canvas using Velt MCP and Antigravity🎉 Using Reddit to Validate SaaS Ideas Before Building How We Built an AI That Evolves Alongside a Creator Through Memory Building a Self-Hosted AI WhatsApp Agent for Structured Invoice Extraction Three Design Decisions That Shaped the Enterprise RAG Retrieval Pipeline How React's Virtual DOM Works Under the Hood Build a Dropbox Paper-Style Collaborative Editor with Next.js and Velt💥 Holy Typos, Batman! How I Built 'SpellJump' How to Test Frontend Error States Without Breaking Your Backend A .NET Dinosaur in Web3. Day 8 — Reading & Writing — WishList Chain Building AI Digital Employees with Markus: An Open-Source Platform for Agent Teams [Boost] The Auditor — High-Reasoning Synthesis and the Ethics of Governance Building 'Offline Brain': How I Wrote My First Custom Agent Skill for Android (Google I/O 2026) 📱🧠 Building a Superhuman-Style Collaborative Email Editor with Next.js and Velt🔥
Should Websites Allow AI Search Crawlers?
AIvsRank · 2026-05-22 · via DEV Community

Should websites allow AI search crawlers?

Not blindly.

Blocking every AI crawler can protect content from some forms of reuse, but it can also make a site less visible in AI answers, citations, and assistant workflows. Allowing every AI crawler can increase exposure, but it can also let AI systems summarize the content without sending traffic back.

The better question is:

Which crawler should be allowed, for which purpose, on which content?

That matters because "AI crawler" is not one category.

Search, AI input, and training are different

A crawler may be used for:

  1. Search indexing
  2. AI answer grounding
  3. Model training
  4. User-triggered fetching
  5. Agent or enterprise workflows

Those are different use cases.

Cloudflare's managed robots.txt documentation uses a helpful split: search, ai-input, and ai-train.

Search means building an index and returning links or short excerpts. Ai-input means using content for real-time generative answers, grounding, or retrieval augmented generation. Ai-train means using content for training or fine-tuning models.

That is the right mental model.

Do not treat all crawling as the same act.

OAI-SearchBot and GPTBot are different

OpenAI separates search visibility from training in its crawler documentation.

OAI-SearchBot is used for ChatGPT search features. OpenAI says sites that opt out of OAI-SearchBot will not be shown in ChatGPT search answers, though they may still appear as navigational links.

GPTBot is different. It is used for content that may be used in training OpenAI's generative AI foundation models.

A site might choose:

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

Enter fullscreen mode Exit fullscreen mode

That means:

  1. Allow ChatGPT search visibility.
  2. Block GPTBot training use.

This is not universal advice. It is a policy pattern. A public documentation site, a SaaS marketing site, a media company, and a paid research database may all choose differently.

The important point is that search and training should be separate decisions.

Googlebot and Google-Extended are different too

Googlebot is used for normal Google Search discovery and indexing. Blocking Googlebot can hurt Google Search visibility.

Google-Extended is a separate robots.txt product token. Google says it can be used to manage whether content Google crawls may be used for certain Gemini training and grounding uses. Google also says Google-Extended does not affect inclusion in Google Search and is not used as a Search ranking signal.

A basic split might be:

User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Disallow: /

Enter fullscreen mode Exit fullscreen mode

But Google-Extended is not a full opt-out from every Google AI feature. Google has said it is exploring more specific controls for Search generative AI features in its website controls update.

So do not block Googlebot if search visibility matters. And do not treat Google-Extended as a universal AI switch.

robots.txt helps, but it is not enough

robots.txt is useful for compliant crawlers.

Google's robots.txt documentation explains that crawlers use the most specific matching user-agent group. If the file is messy, a crawler may follow a different group than the one you expected.

A useful robots.txt review should ask:

  1. Which classic search crawlers do we allow?
  2. Which AI search crawlers do we allow?
  3. Which training crawlers do we block?
  4. Which directories should no crawler access?
  5. Which pages need meta robots or X-Robots-Tag controls?
  6. Which content needs authentication instead of robots.txt?

robots.txt is not:

  1. A security mechanism
  2. A paywall
  3. A copyright contract
  4. A complete bot defense system

Private or premium content needs stronger controls such as authentication, paywalls, network rules, and licensing terms.

Why allow AI search crawlers?

If AI search systems cannot access your content, they may not mention it, cite it, or use it in answers.

That matters for:

  1. SaaS sites
  2. Documentation sites
  3. Ecommerce stores
  4. Local businesses
  5. Education sites
  6. Research projects
  7. Product comparison pages

AIvsRank's AI Crawler Access Checker can help diagnose whether important pages are reachable. Its guide on how to optimize for AI search engines explains the broader workflow: access, eligibility, extractability, citation readiness, visibility, and measurement.

Access is only the first step.

A page also needs to be clear, current, credible, internally linked, and easy to cite.

Why block some AI crawlers?

The main risk is summary substitution.

AI systems can use your content to answer the user's question without sending the user to your page.

Pew Research Center found that Google users clicked a traditional result in 8% of visits when an AI summary appeared, compared with 15% without one. Links inside AI summaries were clicked in only 1% of visits to pages with such summaries, according to Pew's analysis.

So the tradeoff is real:

  1. Blocking can reduce visibility.
  2. Allowing can reduce clicks.
  3. Training use may create value far away from the original site.
  4. AI summaries may weaken attribution or misrepresent the source.

AIvsRank's article on how AI search rewrites information is relevant because the issue is not only ranking. It is also attribution, framing, and representation.

Licensing belongs in the crawler policy

For valuable content, crawler rules are not enough.

Cloudflare Content Signals can express preferences such as:

Content-signal: search=yes, ai-input=no, ai-train=no

Enter fullscreen mode Exit fullscreen mode

The RSL specification also defines a machine-readable way to express usage, licensing, payment, and legal terms for digital assets.

Not every crawler will honor every signal. But the direction is clear: websites need to express not only who can crawl, but what the content can be used for.

robots.txt answers one question:

Who may crawl?

Licensing answers another:

What may the content be used for?

Both questions matter now.

Practical policy by site type

There is no universal robots.txt file for AI crawlers.

The right policy depends on the site.

Broad discovery sites

Examples:

  1. SaaS marketing sites
  2. Public documentation
  3. Ecommerce category pages
  4. Local business pages
  5. Open educational content

Default posture:

  1. Allow major search crawlers.
  2. Allow selected AI search crawlers.
  3. Block training crawlers if training use is not desired.
  4. Monitor AI answer visibility and citation quality.
  5. Keep official facts structured and current.

For these sites, total blocking can make the brand invisible in AI answer surfaces.

Exclusive content sites

Examples:

  1. Paid media
  2. Proprietary research
  3. Subscription databases
  4. Premium newsletters
  5. Specialized datasets

Default posture:

  1. Protect premium content behind authentication.
  2. Allow only crawlers that match the business strategy.
  3. Block training crawlers unless there is a licensing agreement.
  4. Use licensing terms where relevant.
  5. Keep public teaser pages crawlable if discovery still matters.

For these sites, the risk is giving away the answer while losing the subscription, ad impression, lead, or licensing value.

Community and forum sites

Examples:

  1. Support forums
  2. Developer communities
  3. Q&A sites
  4. User-generated content platforms

Default posture:

  1. Protect private or sensitive areas.
  2. Clarify user-generated content terms.
  3. Decide whether public answers should be usable in AI search.
  4. Watch for bot load.
  5. Block crawlers that ignore policy or create operational cost.
  6. Preserve user trust.

Communities have an extra issue: the content comes from users. Crawler policy is not only an SEO decision.

Useful robots.txt patterns

These are starting points, not universal rules.

Pattern 1: Allow AI search, block training

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Disallow: /

Enter fullscreen mode Exit fullscreen mode

This supports ChatGPT search visibility through OAI-SearchBot while blocking GPTBot training use. It also keeps Googlebot open for Search while opting out of Google-Extended uses described by Google.

Pattern 2: Protect premium directories

User-agent: *
Disallow: /members/
Disallow: /premium/
Disallow: /internal/
Allow: /

Enter fullscreen mode Exit fullscreen mode

For truly private content, do not rely only on robots.txt. Use authentication.

Pattern 3: Add content-use signals

User-agent: *
Content-signal: search=yes, ai-input=no, ai-train=no
Allow: /

Enter fullscreen mode Exit fullscreen mode

This is an additional policy signal. It is not a replacement for normal allow and disallow rules.

What to monitor after changing crawler rules

Do not update robots.txt and walk away.

Track:

  1. Server logs for relevant crawlers
  2. Search Console indexing and crawl changes
  3. AI answer visibility for important prompts
  4. Whether cited URLs support the claims attached to them
  5. Referral traffic from search and AI tools
  6. Crawl volume and server load
  7. Suspicious bot behavior
  8. Whether premium content is being summarized publicly

The goal is to learn which layer is working.

If the crawler is blocked, the page cannot be used.

If the crawler can access the page but the page is not cited, the problem may be content structure or authority.

If the page is cited but the user does not click, the problem may be summary substitution.

If the page is cited incorrectly, the problem is representation.

AIvsRank's AI visibility leaderboard can help with category-level visibility, while the free tools hub can help with specific access and eligibility checks. For recurring monitoring, AIvsRank features and AIvsRank Docs can help turn one-off checks into a workflow.

A sensible default

For many public websites, a reasonable default is:

  1. Allow classic search crawlers if organic discovery matters.
  2. Allow selected AI search crawlers if answer visibility matters.
  3. Block training crawlers unless there is a business reason to allow training use.
  4. Protect private or premium content with authentication.
  5. Use licensing terms for commercial reuse.
  6. Monitor logs, citations, AI answers, referral traffic, and bot load.
  7. Review the policy regularly.

The goal is not to be fully open or fully closed.

The goal is to make crawler access match the value exchange you are willing to accept.

FAQ

Should websites block all AI crawlers?

Usually no. Blocking everything can reduce AI answer visibility. Selective access is often better.

Should websites allow OAI-SearchBot?

If ChatGPT search visibility matters, allowing OAI-SearchBot may make sense.

Should websites block GPTBot?

If you do not want content used for OpenAI foundation model training, blocking GPTBot is a common choice.

Does blocking Google-Extended remove a site from Google Search?

No. Google says Google-Extended does not affect inclusion in Google Search and is not used as a Search ranking signal.

Is robots.txt enough for premium content?

No. Use authentication, paywalls, network rules, and licensing terms for premium or private content.

What is the biggest risk of allowing AI crawlers?

The biggest risk is summary substitution: the AI system may use your content to answer the user without sending the user to your site.

What is the biggest risk of blocking AI crawlers?

The biggest risk is invisibility in AI answer surfaces.