JBAI Insider
pillar

AI Crawler Behavior in 2026: GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended

Quick answer

Six AI-relevant crawlers operate in 2026: GPTBot (OpenAI training), OAI-SearchBot (ChatGPT Search index), ChatGPT-User (live ChatGPT response fetches), ClaudeBot (Anthropic training), PerplexityBot (Perplexity index + live), and Google-Extended (Google's training opt-out). The critical distinction is training crawlers (your content trains future models, no traffic in return) versus live-retrieval crawlers (your content drives citations and referral traffic). Most sites should allow OAI-SearchBot, ChatGPT-User, and PerplexityBot at minimum; decide separately about the training crawlers based on whether you want to contribute to model training.

Robots.txt has become the single most consequential file on a content site. The lines you write there in 2026 determine whether your content trains future AI models, whether it appears in AI Overview citations, and whether the AI revolution happens with or without your fingerprints on it.

The good news: the major AI crawlers are well-documented, named consistently in their user-agent strings, and they respect robots.txt. The bad news: there are now enough of them, with different purposes, that 'block AI bots' is no longer a coherent policy. You need a per-crawler decision matrix.

The six crawlers that matter

Crawler Operator Purpose User-agent token
GPTBotOpenAITraining dataGPTBot
OAI-SearchBotOpenAIChatGPT Search indexOAI-SearchBot
ChatGPT-UserOpenAILive response fetchesChatGPT-User
ClaudeBotAnthropicTraining dataClaudeBot
PerplexityBotPerplexityIndex + live retrievalPerplexityBot
Google-ExtendedGoogleTraining opt-out tokenGoogle-Extended

Training versus live retrieval

The most important distinction is whether a crawler is collecting training data or supporting a live product.

Training crawlers (GPTBot, ClaudeBot, Google-Extended): your content gets baked into future model weights. You don't get direct traffic back. The benefit is indirect (your ideas, framings, and brand may surface in model outputs, often without attribution).

Live-retrieval crawlers (OAI-SearchBot, ChatGPT-User, PerplexityBot): your content is fetched live to answer user queries, with citation links back to your URL. You get direct traffic (small but real) and citation visibility in the AI answer.

If you optimize for one decision: allow the live-retrieval crawlers. Blocking them blocks AI citation. Blocking the training crawlers is a values decision that doesn't materially affect your citation rate.

robots.txt patterns

The cleanest pattern in 2026, suitable for most content sites that want AI citation but no training contribution:

User-agent: *
Allow: /

# Allow live-retrieval crawlers (drives AI citations)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Block training crawlers (opt out of model training)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Sitemap: https://yourdomain.com/sitemap.xml

If you want full participation (training and citation), drop the four Disallow directives. If you want zero AI involvement, replace all four AI crawler blocks with Disallow.

Real log data from our portfolio

We run a daily tracker (see pillar 5 for details) that counts OpenAI bot hits across our three sites. The May 2026 baseline:

The pattern: OAI-SearchBot crawls steadily, GPTBot crawls in bursts (sometimes 40+ hits in a single day, then nothing for a week), ChatGPT-User correlates directly with user activity (someone sharing a URL or asking ChatGPT to read your page).

What happens when you allow versus block

We can't run a clean A/B test on a single domain (you can't simultaneously allow and block the same crawler), but the public record of sites that switched policies in 2024-2025 is consistent:

The clearest result: blocking the live-retrieval crawlers always blocks the citation revenue stream. Blocking the training crawlers has no measurable downside other than the philosophical choice not to contribute to training.

Apache and nginx configuration examples

You can also block crawlers at the web server level (in addition to robots.txt, for crawlers that ignore robots.txt). Apache example:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (Bytespider|MJ12bot|AhrefsBot|SemrushBot|DotBot|PetalBot) [NC]
RewriteRule .* - [F,L]

This blocks bots that consume crawl budget without producing citation value. Notice the list does NOT include any AI-engine crawlers; those should be managed via robots.txt only, not blocked at the server level.

User-agent string parsing patterns

The user-agent strings AI crawlers send are stable enough to grep reliably. The patterns we use in our log analyzers:

GPTBot       -> contains "GPTBot"
OAI-SearchBot-> contains "OAI-SearchBot"
ChatGPT-User -> contains "ChatGPT-User"
ClaudeBot    -> contains "ClaudeBot"
PerplexityBot-> contains "PerplexityBot"
Google-Extended -> contains "Google-Extended"  (rarely seen in logs;
                   appears in robots.txt opt-out lookups)

The full user-agent strings include version numbers and contact URLs (e.g., "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)"). Match on the bot name token, not the full string, so version updates don't break your parser.

Identifying bot IP ranges

User-agent strings are easy to spoof. Bots also publish their IP ranges so you can verify a request claiming to be GPTBot actually came from OpenAI's infrastructure. OpenAI publishes their bot IPs at openai.com/gptbot.json (similarly for OAI-SearchBot and ChatGPT-User). Anthropic publishes ClaudeBot IPs. Perplexity publishes its bot IPs.

Practical verification: for high-stakes decisions (blocking, rate-limiting, alerting), cross-check the IP against the published list. For routine log analysis, trusting the user-agent string is usually fine because spoofing happens but doesn't dominate traffic.

Robots.txt mistakes that backfire

Three common errors we see when reviewing robots.txt files:

Blocking all bots with one line. A blanket `User-agent: * / Disallow: /` blocks Googlebot too, which kills the site's traditional SEO entirely. AI crawlers should be addressed by named user-agent, not by wildcard.

Confusing the AI crawlers. Disallowing GPTBot does nothing to OAI-SearchBot or ChatGPT-User (they're different crawlers despite all coming from OpenAI). Sites that thought they were blocking ChatGPT entirely often still allow OAI-SearchBot, which is fine but worth knowing.

Disallowing without a clear policy. Adding a Disallow line during a panic without a policy decision leads to inconsistent state and confused product decisions later. Write the policy first ("we allow live-retrieval crawlers, block training crawlers") and let robots.txt follow.

The partial-opt-out pattern

You can opt out per-URL, not just site-wide. Useful when you want most content trained on but specific pages excluded (e.g., paid content, members-only content, internal documentation):

User-agent: GPTBot
Disallow: /members/
Disallow: /paid/
Disallow: /api/

User-agent: ClaudeBot
Disallow: /members/
Disallow: /paid/
Disallow: /api/

User-agent: *
Allow: /

The partial-opt-out preserves your AI citation eligibility on public content while keeping gated content out of training data. Worth the few minutes of robots.txt config.

Why crawler activity is the leading indicator

Citation visibility and referrer traffic are lagging indicators: by the time you see them in analytics, the decision the AI engine made (to cite or not cite) happened weeks ago. Crawler activity is the leading indicator: when OAI-SearchBot starts fetching a page weekly instead of monthly, that page is moving up in ChatGPT Search's eligibility pool.

We track per-page OpenAI bot hits daily across our portfolio (see pillar 4 for the methodology). The cost is a 100-line Python script and a cron entry. The signal is real and arrives weeks before any other GEO metric moves.

Verifying AI crawler identity at the application layer

User-agent strings can be spoofed; IP ranges typically cannot. For high-stakes routing decisions (e.g., serving different content to verified AI crawlers, or rate-limiting suspected impostors), verify the source IP against the bot operator's published IP list.

Python pattern for verifying a request claiming to be GPTBot actually came from OpenAI's published IP range:

import ipaddress, requests, time
_OPENAI_RANGES = None
_LAST_FETCH = 0

def is_real_gptbot(ip, user_agent):
    global _OPENAI_RANGES, _LAST_FETCH
    if 'GPTBot' not in user_agent:
        return False
    # Refresh published IP list weekly
    if time.time() - _LAST_FETCH > 7*86400 or _OPENAI_RANGES is None:
        r = requests.get('https://openai.com/gptbot.json', timeout=5)
        _OPENAI_RANGES = [ipaddress.ip_network(p['ipv4Prefix']) for p in r.json()['prefixes']]
        _LAST_FETCH = time.time()
    addr = ipaddress.ip_address(ip)
    return any(addr in net for net in _OPENAI_RANGES)

The same pattern works for OAI-SearchBot (openai.com/searchbot.json), ChatGPT-User (openai.com/chatgpt-user.json), and equivalent endpoints for Anthropic and Perplexity. For most content sites the trust assumption (user-agent string is honest) is fine; for sites where serving differential content matters, IP verification closes the spoof gap.

Rate-limiting AI crawlers without blocking them

When an AI crawler is consuming disproportionate server resources, blocking it entirely is the wrong reaction. The better pattern is rate-limiting: serve the crawler, but cap requests per second to a sustainable rate. Apache's mod_ratelimit and nginx's limit_req module both support this pattern keyed on user-agent.

Realistic per-crawler rate caps for a modest content site (1 vCPU, 1GB RAM):

These are starting points; tune based on your server capacity and observed crawler patterns. The goal is steady serving without 503 errors. Crawlers handle 429 (rate limited) responses gracefully and back off; they treat 503 as a transient failure and retry, which compounds load.

Logging conventions for AI crawler analysis

Standard Apache combined log format works for AI crawler tracking but loses some signal. Recommended additions:

For sites running multiple domains on one server, log each domain to a separate file. Mixing crawler activity across domains in one file makes per-domain analysis painful.

When a crawler suddenly disappears

Crawler traffic drops are a leading indicator of trouble. If OAI-SearchBot was visiting a site at 20 hits/day for months and suddenly drops to zero, the cause is almost always one of:

The fix order: check robots.txt for typos first, check site status next, check content for mass changes third. The bot operator's policy change is rare enough to be the last hypothesis.

Practical action items for getting your crawler policy right

A practical 30-minute audit you can run today on any content site to make sure your AI crawler policy reflects your actual intent:

  1. Pull your current robots.txt and identify every user-agent block that affects an AI crawler. Document what each block actually does (block training, block live retrieval, block both).
  2. Compare what your robots.txt does against what you intend it to do. The most common gap: people who think they blocked ChatGPT have only blocked GPTBot and are still indexed by OAI-SearchBot. Decide which is correct for you.
  3. Pull 28 days of Apache or nginx logs and count user-agent appearances for each of the six AI crawlers. Compare against expected behavior given your robots.txt. Gaps point to either spoofed traffic or robots.txt rules that aren't taking effect.
  4. Check the bot operator IP lists (openai.com/gptbot.json and equivalent) against a sample of bot requests from your logs. Verify the user-agent claims match the source IPs.
  5. Document the resulting policy in a short internal note: which crawlers are allowed, which are blocked, why, and when this was last reviewed. Reviews happen every six months because the AI crawler landscape moves quickly.

The audit takes about 30 minutes once and surfaces most of the operational gaps that quietly compound into citation visibility issues.

The five pillars

FAQ

Should I block GPTBot?

Depends on your view of training. Blocking GPTBot stops OpenAI from training future models on your content but does NOT block ChatGPT Search from retrieving and citing you (that's OAI-SearchBot, a separate crawler). Most publishers allow GPTBot if they want broader cultural reach; block it if they object to AI training generally.

Will my content end up training Claude if I allow ClaudeBot?

Yes, that's the explicit purpose of ClaudeBot. Anthropic uses it for training data collection. Block ClaudeBot in robots.txt if you want to opt out of Anthropic training while still allowing Claude's live web access (which uses different infrastructure).

What's the difference between OAI-SearchBot and ChatGPT-User?

OAI-SearchBot crawls the web to build ChatGPT Search's index; it visits pages at scale on its own schedule. ChatGPT-User fetches specific pages in response to user prompts. Both should generally be allowed because both drive citations.

Does blocking AI crawlers hurt my regular Google rankings?

No. The major AI crawlers are operationally separate from Googlebot. Google-Extended is Google's own training-opt-out crawler and blocking it does not affect Google Search ranking.


← Back to JBAI Insider May 26, 2026