Robots.txt in the Age of AI: Which Bots to Allow (and Why It Matters)

Most of us know the robots.txt file as a basic SEO housekeeping tool: a simple text file that tells bots what they can and can't crawl on your site. But in the age of AI, it’s become something more. Your robots.txt file now plays a critical role in shaping how your brand appears (or doesn’t) in generative search, AI assistants, and large language models (LLMs).

It’s no longer just whether you can be indexed by Google. Your website is crawled by a growing cast of AI bots—from OpenAI, Anthropic, Perplexity, and others, all looking to access high-quality content for retrieval or model training. Your robots.txt file acts as the gatekeeper: if these bots aren’t allowed to crawl your site, your content won’t be cited, summarized, or included in the datasets powering answers in AI tools. That means your brand may be missing from the conversations your audience is having with AI.

In this guide, we’ll explain what robots.txt really does, why it matters more than ever, and how brands can approach bot visibility strategically. Whether you're a technical SEO or a CMO making strategic decisions about AI, this guide will help you make sense of what to allow, what to block, and why your decision matters.

A quick refresher: what is robots.txt and why does it matter?

The robots.txt file is a simple set of instructions that sits at the root of your website. It’s the first thing most bots check when they visit your site. It doesn’t control what can be indexed—that’s handled with meta tags—but it does determine what can be crawled.

Why does this matter? Because crawling powers discovery. If a bot can’t crawl your content, it can’t learn from it, serve it in a result, or attribute information back to your brand. That means you could be invisible to the very tools your buyers are using to make decisions.

Why this now matters more than ever:

Bots from LLMs are actively crawling the open web to improve their responses
Generative AI platforms rely on trusted source material to answer questions
Sites that are crawlable are more likely to be cited, summarized, and picked up in SERP features like AI Overviews and tools like Perplexity

Your robots.txt file is now a gatekeeper to your participation in AI-powered search.

Opening vs. blocking: a strategic decision

While there are a lot of upsides to allowing all bots to access your site, some brands are choosing to block AI bots. Media companies and content publishers concerned with copyright, content reuse, or monetization models often fall into this camp.

Blocking AI bots may protect content, but it also cuts off opportunity.

Unless you're a major publication with a monetized gated community, blocking is likely not the best approach for your brand.

By allowing trusted LLMs and search bots to crawl your site, you increase your chances of:

Being cited in ChatGPT responses or AI-generated summaries
Being selected as a source for Perplexity or Gemini
Having your products or services discovered earlier in the journey

It’s a strategic choice—but one with real implications for your future visibility.

Considerations before you open the gates

Before you throw open the doors to every crawler, there are some very real considerations, and depending on the size of your organization and stakeholders, this could take time. Below are some questions every brand should be asking:

Do we have proprietary content we want to protect? Keep client portals, internal tools, and sensitive product data off-limits.
Is our messaging up to date and unified? LLMs can only learn what’s available. If your site is inconsistent or outdated, the wrong story may get amplified.
Are our most important pages easily discoverable and crawlable? A clear structure with optimized content and internal linking maximizes what AI bots can actually see and understand.

The strategic upside of opening your txt to LLMs

When your site is crawlable to the right bots, you:

Increase brand surface area across LLMs and AI assistants
Improve entity recognition (so your brand is understood in the context of your industry)
Support accurate citations and summaries across tools like Perplexity, ChatGPT, and Gemini
Lay groundwork for zero-click visibility in AI Overviews, answers, and assistant replies

AI tools are changing how users find, trust, and engage with brands. Openness fuels your inclusion in that future.

Learn more about entity optimization →

Example: open vs. closed robots.txt

Open robots.txt (default crawl-friendly setup):

This allows all bots to crawl all parts of the site.

Closed to LLM bots (selective blocking):

This blocks OpenAI and Anthropic from crawling your content while leaving the rest open.

Important note: Blocking Google, Bing, or other major engines using Disallow: / will prevent them from crawling your entire site. If you're not absolutely sure what you're doing, don’t guess. You could wipe your site off the face of the web if set incorrectly.

How does this fit into our C.L.A.R.I.T.Y. framework?

Within our C.L.A.R.I.T.Y. framework, robots.txt plays a role early, but its impact carries through:

Crawl: If bots can’t access your content, nothing else matters. We ensure LLMs and search engines can crawl the right parts of your site.
Learn: What are these models picking up about your brand? Crawlability is step one, but accuracy comes from great content and consistent structure.
Integrate: We help clients align their SEO and AI visibility strategies—so your brand shows up clearly across both traditional and generative search.

We audit crawl access, run bot access diagnostics, and help brands think proactively about what they want to expose and what they want to protect.

Discover the C.L.A.R.I.T.Y framework →

Which bots to allow (and block)

Bot Name	User-Agent	What It Crawls	tiptop Recommendation
Googlebot	Googlebot	Google search index	✅ Allow
Bingbot	bingbot	Bing + Microsoft Copilot	✅ Allow
GPTBot	GPTBot	OpenAI / ChatGPT	✅ Allow
ClaudeBot	ClaudeBot	Anthropic / Claude	✅ Allow
GeminiBot	Google-Extended	Google Gemini LLM training	✅ Allow
PerplexityBot	PerplexityBot	Perplexity.ai index + citations	✅ Allow
CCBot	CCBot	Common Crawl (used by many LLMs)	✅ Allow
Amazonbot	Amazonbot	Amazon Alexa + other product crawlers	✅ Allow (case-by-case)
Applebot	Applebot	Siri + Apple services	✅ Allow
Meta Agent	Meta-ExternalAgent	Facebook, Instagram, Threads previews	✅ Allow
X / Twitterbot	Twitterbot	Link previews for X (formerly Twitter)	✅ Allow
YouBot	YouBot	You.com assistant + generative search	✅ Allow
ByteSpider	ByteSpider	TikTok / ByteDance data collection	⚠️ Caution / Evaluate
AhrefsBot	AhrefsBot	SEO tool crawler	⚠️ Optional
SemrushBot	SemrushBot	SEO tool crawler	⚠️ Optional
AllenAI Bot	ai-crawler	AI research via Allen Institute	✅ Allow
DuckDuckGo Bot	DuckDuckBot	Privacy-based search engine	✅ Allow

What about llms.txt (Yes, it's a thing—kind of)

You may have seen chatter about llms.txt, a proposed file designed to let site owners control how large language models access and use their content. It’s meant to be a more explicit signal for training and inference permissions across AI models. It sounds promising, but as of now, it’s not an industry standard. Major models like OpenAI’s GPTBot and Anthropic’s Claude still reference robots.txt, not llms.txt.

That could change in the future. But for now, your best bet is to keep your robots.txt file tightly dialed in. That’s what the big players are listening to—and it’s where visibility decisions are happening today.

Implementation and monitoring

Here’s how to implement a smart robots.txt strategy:

Audit your current file: Use Google Search Console or tools like Screaming Frog to verify what’s blocked.
Test before launch: Use Google’s robots.txt Tester to ensure syntax is correct.
Use the right syntax: A single typo can break your file. Stick to clear User-agent and Disallow/Allow rules.
Segment by bot: Don’t apply blanket rules. You can allow some LLMs while blocking others.
Revisit quarterly: New bots are emerging fast. Reevaluate every few months to ensure you’re aligned with your visibility strategy.
Watch your crawl logs: Use server log data to monitor which bots are hitting your site and what they’re accessing.
Check your impact: Tools like Perplexity’s “Sources” tab or ChatGPT’s web-browsing citations can help confirm if your brand is showing up.

How discoverable is your brand on AI? Let us run an AI audit. Let’s talk→

A note of caution: don’t guess with robots.txt

It’s deceptively simple—but extremely powerful. One wrong line can block your site from Google entirely. If you're not confident editing robots.txt:

Involve your SEO partner or technical team
Use verified tools for testing
Always keep backups of prior versions

Your robots.txt file is a strategic visibility layer

Robots.txt has always mattered. But now, it’s part of a larger strategic play for visibility across AI tools, search engines, and assistants.

Done right, it helps ensure your brand is not only crawlable, but cited, learned from, and trusted by the systems shaping the next generation of search.

Let’s make sure you’re open to the right bots—and closed to the rest.

Need help with implementation or guidance on which bots align with your goals? Let’s talk.

You might also like:

Robots.txt in the Age of AI: Which Bots to Allow (and Why It Matters)

A quick refresher: what is robots.txt and why does it matter?

Opening vs. blocking: a strategic decision

Considerations before you open the gates

The strategic upside of opening your txt to LLMs

Example: open vs. closed robots.txt

How does this fit into our C.L.A.R.I.T.Y. framework?

Which bots to allow (and block)

What about llms.txt (Yes, it's a thing—kind of)

Implementation and monitoring

A note of caution: don’t guess with robots.txt

Your robots.txt file is a strategic visibility layer

About the Author

Tim Lowry

You Might Also Like

Stay Ahead of Search Trends

Local feel.  Global reach.

A quick refresher: what is robots.txt and why does it matter?

Opening vs. blocking: a strategic decision

Considerations before you open the gates

The strategic upside of opening your txt to LLMs

Example: open vs. closed robots.txt

How does this fit into our C.L.A.R.I.T.Y. framework?

Which bots to allow (and block)

What about llms.txt (Yes, it's a thing—kind of)

Implementation and monitoring

A note of caution: don’t guess with robots.txt

Your robots.txt file is a strategic visibility layer

About the Author

Tim Lowry

You Might Also Like

How to Optimize Your Images for SEO (Unique Visuals Matter)

Does SEO Work for ABM? How Organic Search Fuels Account-Based Marketing in the AI Era

Ecommerce Content Strategy in the Age of AI Overviews: Navigating the New Rules of SEO

Stay Ahead of Search Trends