Skip to main content
BLOG

Help AI find your pages: sitemap, canonical, Link headers

No special sitemap for AI crawlers exists. Use sitemap.xml, self-referencing canonicals, and RFC 8288 Link headers, then verify with a free 0–100 scan.

By IMozzUpdated 2026-06-12
Help AI agents find your pages — aiSiteReady

A dedicated "sitemap for AI crawlers" does not exist — and you don't need one. Nearly everything that helps an AI system find and correctly interpret your pages already lives in the classic web-discovery layer: crawlable internal links, a correct sitemap.xml, one unambiguous canonical per page, and a solid <title> and meta description. Where HTML can't carry those signals, Link headers do. Google's guidance on generative-AI features says it plainly: standard SEO practices stay relevant for AI surfaces, and no special AI-only files or magic markup are required (Google).

That's good news, not bad. In 2026, AI visibility usually breaks not because some new AI protocol is missing, but because old web signals are ambiguous. OpenAI, Anthropic, and Perplexity all document crawler access through robots.txt, user-agents, and IP allowlists — not through an AI-specific sitemap standard. So the real question isn't "how do I please the models?" It's the far more tractable "how do I remove ambiguity from discovery and canonicalization?"

Key takeaways

  • There's no AI-specific sitemap standard. Google says standard SEO covers its generative-AI features; OpenAI, Anthropic, and Perplexity manage access through robots.txt and user-agents.
  • Crawlers find URLs two ways: links and sitemaps. An important page needs both — a real <a href> link and a sitemap entry with its canonical URL.
  • One sitemap holds at most 50,000 URLs / 50 MB uncompressed; reference it from robots.txt with the Sitemap: directive. An honest lastmod helps; changefreq and priority are ignored.
  • Every indexable page gets one self-referencing canonical with an absolute URL — and appears in the sitemap under that exact URL. Conflicting signals let the engine pick a canonical for you.
  • Link headers (RFC 8288) declare canonical and hreflang at the HTTP level — ideal for PDFs. After you ship fixes, scan your domain 0–100 to confirm what crawlers actually see.

How do AI and search crawlers discover URLs?

The web has no central registry of pages, so discovery always comes down to two paths. A bot either finds a URL through links on pages it already knows, or it receives a list of URLs in a sitemap. Google describes how its search works in exactly those terms (Google), and AI crawlers inherit the same mechanics.

The first practical consequence: a sitemap complements internal linking — it doesn't replace it. For Google, a crawlable link is specifically an HTML <a> element with an href; links exposed only through script events or non-standard elements are extracted unreliably, if at all (Google).

Two discovery paths feed the same crawler: crawlable anchor links on known pages, and the URL list in sitemap.xml referenced from robots.txt. An important page should be reachable through both

The AI vendors' own docs confirm the route. OpenAI recommends allowing OAI-SearchBot in robots.txt so a site can appear in ChatGPT search (OpenAI); Anthropic documents Claude-SearchBot, ClaudeBot, and Claude-User managed through ordinary robots rules (Anthropic). Today's practical route to AI discovery runs through existing crawling infrastructure. Which bots you then allow or block is a governance question — that's our robots.txt guide.

How do you make sitemap.xml work for AI crawlers?

A sitemap is a cheap crawl manifest, and its format is fully specified. Per the Sitemaps protocol, the file must be UTF-8, the root element is <urlset>, every URL needs a <loc>, and all URLs must belong to a single host (sitemaps.org). One file holds at most 50,000 URLs and 50 MB uncompressed — gzip is fine, but the decompressed size still counts. Bigger sites use a sitemap index, which can itself list up to 50,000 child sitemaps (Google).

The most underused practice costs one line: reference the sitemap from robots.txt. The Sitemap: directive is independent of any User-agent block, may appear anywhere in the file, and can be repeated. If you publish an index, listing just the index is enough.

# robots.txt
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap_index.xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemaps/docs.xml.gz</loc>
    <lastmod>2026-06-10</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemaps/blog.xml.gz</loc>
    <lastmod>2026-06-11</lastmod>
  </sitemap>
</sitemapindex>

What goes into the file matters more than the syntax. Google is direct: list the URLs you'd prefer to see in results — your canonicals — not every variant of the same page. A sitemap is not a dumping ground for parameter duplicates, session IDs, and UTM variants. One entity, one preferred URL.

Two freshness notes. The old HTTP "ping" endpoint was retired in 2023 — submit through robots.txt, Search Console, or its API instead (Google). And Google reaffirmed that an honest lastmod is the element that actually matters, while changefreq and priority are ignored entirely. lastmod works at the index level too: per sitemaps.org, it lets a crawler re-fetch only the child sitemaps that changed since its last visit. Note the wording — a crawler may do this; nothing in the protocol is a guarantee.

Why do canonicals, titles, and descriptions matter for AI?

If a sitemap answers "which URLs do you have?", a canonical answers "which URL is the source of truth?" Google ranks its explicit canonicalization signals by strength, and the signals stack (Google):

SignalStrengthTypical use
301/308 permanent redirectStrongRetiring a duplicate URL for good
rel="canonical" (HTML or Link header)StrongDuplicates that must stay reachable
Inclusion in the sitemapWeaker, still usefulReinforcing the preferred URL

In practice that collapses into one rule: every indexable page carries a self-referencing canonical with an absolute URL, and appears in the sitemap under that same URL. RFC 6596 explicitly permits a self-referential canonical (RFC 6596). Google recommends one on indexable pages — with absolute URLs, so a staging domain or dev mirror never leaks into the tag (Google).

Parameter variants, session IDs, and mirrors all collapse to one canonical URL — and the sitemap entry, the rel canonical tag, and the Link header must all point at that same URL

Know what a canonical is not. It shouldn't point at a URL fragment, and it shouldn't stand in for a real redirect when you're actually retiring duplicates. Per RFC 6596, the target must be a duplicate or a content superset of the referring page. Above all, don't send conflicting signals — one URL in the sitemap, another in rel="canonical", a third in an HTTP header. If your signals disagree, Google may pick a different canonical on its own.

Canonicals need a companion layer: plain metadata. Google assembles a result's title link from the <title> element, visible headings, og:title, and anchor text (Google). Snippets come mostly from page content, with the meta description used when it describes the page better. And Google's generative-AI guide adds the gate that matters here: a page must be indexed and snippet-eligible to qualify for AI features at all. So title and description aren't legacy decoration — they're the cheapest machine-readable summary you can ship.

<head>
  <title>How to configure a sitemap for AI crawlers</title>
  <meta
    name="description"
    content="A practical guide to sitemap.xml, canonicals, and Link headers for AI search and classic crawlers."
  />
  <link rel="canonical" href="https://example.com/blog/ai-crawler-sitemap" />
</head>

The Link HTTP header is the discovery layer's most underrated tool. RFC 8288 gives it the same semantics as the HTML <link> element: the server declares relationships between resources directly in the response, so a client doesn't have to download and parse HTML to read them (RFC 8288, MDN).

Two scenarios dominate. First, a canonical for non-HTML documents — Google supports rel="canonical" via the Link header for files like PDF or Word, which have no <head> to edit. Second, rel="alternate" with hreflang for localized non-HTML files. One header can declare both:

Link: <https://example.com/guide>; rel="canonical",
      <https://example.com/guide.fr>; rel="alternate"; hreflang="fr"

Two syntax warnings, because these break real implementations. The rel parameter is required in every link-value — a rel-less entry isn't "almost fine", it's a broken signal. And if you declare a canonical in both the HTML and the header, the risk of conflict rises sharply: Google supports both methods but calls combining them error-prone. Pick one home per page, and use absolute URLs there too.

What breaks AI discovery most often?

If you remember one section, make it this one — these failures wreck the discovery layer far more often than any missing "AI protocol":

  • Orphan sitemap. The file exists, but robots.txt never references it and nobody submitted it through Search Console or the API. One of its main discovery paths is gone.
  • Non-preferred URLs in the sitemap. Parameter variants, session IDs, and temporary URLs instead of canonicals.
  • lastmod that lies. A CMS that bumps it on every deploy teaches crawlers to ignore it. It's useful only when it tracks significant changes.
  • Conflicting canonicals. One URL in the sitemap, another in the HTML, a third in a header — or multiple rel="canonical" tags, or a canonical sitting in the <body>.
  • Sitemap-only pages. URLs listed in the map but stranded outside the link graph, because navigation is built from script events instead of <a href> links — the same trap covered in making a JavaScript site readable to AI.
  • robots.txt passes, the network blocks. The most common hidden blocker: a WAF or CDN rule stops the bots your robots file welcomes. OpenAI publishes IP ranges to allowlist, and Anthropic warns that crude IP blocking can prevent its bots from even reading your robots.txt.

How do you check your discovery layer?

You don't need a new AI protocol — you need the existing discovery layer to be unambiguous. That's a finite, checkable list, which is exactly what a scan is for. aiSiteReady's discoverability category maps onto this article one-to-one: a sitemap check (present, reachable, referenced from robots.txt), a canonical & metadata check, and a Link headers check. They sit alongside roughly 15 to 20 checks that combine into an Agent Readiness Score from 0 to 100. The exact checks and weights live on the methodology page; for the bigger picture, see what AI agent readiness means.

This page eats its own cooking, by the way. It ships a self-referencing canonical, sits in our sitemap under that same URL, and renders its FAQ as static JSON-LD — open view-source and check. Building the scanner forced us through every spec cited above; the article is the checklist we wish we'd had.

After you ship the fixes, run a free scan: it fetches your site the way an agent would and shows which discovery signals it actually found. You get the first fixes ranked by impact and a developer-ready task, in English, Ukrainian, or Russian — no sign-up, any public URL. The score is a readiness diagnostic, not a ranking guarantee; Google is explicit that meeting every requirement still guarantees nothing.

The cleanest way to state the whole thing: don't make AI and search bots guess your site's structure. Give them an explicit URL list through the sitemap, one source-of-truth URL per page through the canonical, and machine-readable relationships through HTML and HTTP metadata. In 2026, that's the sanest possible answer to "help AI find my pages."

IMozz has 20 years in software development, with the past year spent building with LLMs. He builds aiSiteReady, a read-only scanner that checks whether AI agents can read a site. It server-renders its own content as a working example.

Frequently asked questions

Do I need a special sitemap for AI crawlers?
No. There is no AI-specific sitemap standard. Google's guidance on generative-AI features says standard SEO practices apply and no special AI-only files are required, and OpenAI, Anthropic, and Perplexity all document access through robots.txt, user-agents, and IP allowlists. A correct sitemap.xml, referenced from robots.txt and listing only canonical URLs, serves classic search engines and AI crawlers alike.
How big can a sitemap be, and what if I have more URLs?
A single sitemap may contain up to 50,000 URLs and must stay under 50 MB uncompressed; you can gzip it, but the limit applies after decompression. Larger sites split URLs across multiple sitemaps and publish a sitemap index, which can itself list up to 50,000 child sitemaps. Referencing just the index from robots.txt is enough — crawlers pick up the children from there.
Do changefreq and priority still matter in sitemaps?
No. Google has confirmed it ignores changefreq and priority entirely. The field that matters is lastmod — and only when it consistently reflects reality. If your CMS bumps lastmod on every deploy or trivial edit, crawlers eventually stop trusting it. Set it to the date of the last significant content change, both per URL and at the sitemap-index level.
Can I declare a canonical without editing the HTML?
Yes. Google supports a canonical declared in the HTTP Link response header (RFC 8288), which is the recommended route for non-HTML documents like PDFs that have no head element to edit. The same header can carry hreflang alternates for localized files. Avoid declaring a canonical in both the HTML and the header, though — Google warns that combining the two methods is error-prone.