The 2026 Technical GEO Stack: llms.txt, Schema, and AI Crawlers Explained

Developer reviewing a technical GEO stack: an llms.txt file, a schema knowledge graph, and AI crawler bots at a gateway, connected to a server stack.

Key takeaways

  • Technical GEO has three layers: a crawl layer (robots.txt and AI bots), an entity layer (schema and structured data), and a context layer (llms.txt). Each does a different job, and two of the three are widely misunderstood.
  • llms.txt is real and well-specified, but in 2026 the data says it does almost nothing for AI citations. SE Ranking found no correlation across 300,000 domains, and Google has said twice that it will not use the file.
  • Schema markup will not lift citations on a page that AI already cites. A 2026 Ahrefs study of 1,885 pages found no meaningful citation gain. But schema still feeds the knowledge graph and helps pages get crawled and understood as entities.
  • AI crawlers split into two jobs: training (GPTBot, ClaudeBot, Google-Extended) and retrieval (OAI-SearchBot, PerplexityBot, Claude-SearchBot). Blocking the wrong one quietly removes you from AI answers.

Technical GEO is the set of files and markup that control how AI search engines crawl, understand, and cite your site: robots.txt rules for AI bots, schema or structured data, and the newer llms.txt file. Most guides treat all three as must-do wins. After implementing the full stack on our own site and reading every 2026 study I could find, my honest take is narrower: one of these moves the needle a lot, one helps indirectly, and one does close to nothing right now. This post is the technical reference I wish existed when I started.

I am writing this as a developer, not a marketer. We run BlueJar, a GEO audit platform, and in late May 2026 I rebuilt our own technical GEO stack from scratch. I will show you exactly what we shipped, what the evidence says about each piece, and where I think people are wasting their time. Three of the questions I see most often on r/SEO and r/TechSEO get direct answers below: does llms.txt actually do anything, what to make of Google adding llms.txt to its own docs, and whether piling on schema helps an AI understand your brand.

The three layers of technical GEO

Separate the stack by what each layer talks to. The three layers are not interchangeable, and treating them as one bucket of “generative engine optimization tasks” is where most advice goes wrong.

  • The crawl layer (robots.txt): decides which AI bots may fetch your pages at all. This is access control. Get it wrong and you are invisible no matter how good your content is.
  • The entity layer (schema, structured data): labels who you are and what a page contains in a machine-readable way. This feeds search indexes and knowledge graphs, which some AI engines read from.
  • The context layer (llms.txt): a curated, plain-text map of your site meant for language models. It is the newest idea and the least proven.

Two distinctions run through everything that follows. The first is crawl versus retrieval: AI search happens in two stages, indexing your content and then choosing what to cite at answer time. Whether you get picked at that second stage is mostly a citation readiness problem, not a file-format one. The second distinction is train versus retrieve: some bots collect data to train models, others fetch pages to answer a live query. Keep both in mind and the whole stack gets clearer.

AI crawlers and the robots.txt rules that gate them

The crawl layer is the one piece of technical GEO with no debate about whether it matters. If you block the bot that surfaces you in AI answers, you disappear from those answers. The trap is that the bots are not interchangeable, and blocking one to protect your content can quietly cost you visibility.

OpenAI is the clearest example. Its crawler documentation spells out four user agents, and the key line is that you can treat them independently: “a webmaster can allow OAI-SearchBot in order to appear in search results while disallowing GPTBot to indicate that crawled content should not be used for training.” In other words, GPTBot is for training and OAI-SearchBot is for citations. Block GPTBot if you do not want your content training foundation models. Block OAI-SearchBot and, per OpenAI, your site “will not be shown in ChatGPT search answers.” Those are very different outcomes from two lines in the same file.

The same split repeats across providers. Anthropic runs ClaudeBot for training, Claude-SearchBot for its search index, and Claude-User for user-triggered fetches. Perplexity runs PerplexityBot for search and Perplexity-User for live questions. Google handles it differently again: Google-Extended is a robots.txt control token, not a separate crawler, that governs whether your already-crawled content trains future Gemini models. Google states it “does not impact a site’s inclusion in Google Search nor is it used as a ranking signal.”

AI crawler bots passing through an open gate while one is blocked, illustrating robots.txt allow and disallow rules for AI crawlers.

Here is the practical map of the major AI crawlers and what blocking each one actually does.

User agent Company Job Train or retrieve What blocking it does
GPTBot OpenAI Collects content for model training Train Your content is not used to train OpenAI foundation models. Citations unaffected.
OAI-SearchBot OpenAI Surfaces sites in ChatGPT search Retrieve You will not appear in ChatGPT search answers.
ChatGPT-User OpenAI User-triggered page fetch Retrieve Limited effect; user-initiated, so robots.txt may not apply.
ClaudeBot Anthropic Collects content for model training Train Your future content is excluded from Claude training data.
Claude-SearchBot Anthropic Indexes content for Claude search Retrieve Reduced visibility and accuracy in Claude search results.
Claude-User Anthropic User-triggered page fetch Retrieve Claude cannot retrieve your page to answer a user question.
PerplexityBot Perplexity Surfaces and links sites in results Retrieve You will not be surfaced or linked in Perplexity answers.
Perplexity-User Perplexity User-triggered page fetch Retrieve Limited; this fetcher “generally ignores robots.txt rules.”
Google-Extended Google Controls Gemini training and grounding use Train Your content is not used to train Gemini. Google Search is unaffected.
CCBot Common Crawl Open web archive many models train on Train Your content is excluded from the Common Crawl dataset.

One caveat on enforcement: blocking is a request, not a wall. In August 2025, Cloudflare reported that Perplexity used undeclared crawlers impersonating Chrome on macOS to fetch pages on brand-new test domains that blocked both of Perplexity’s declared bots in robots.txt and at the firewall. Cloudflare de-listed Perplexity as a verified bot over it. So a robots.txt allow list documents your intent and works for compliant crawlers, but it is not a guarantee, and the user-triggered agents (ChatGPT-User, Perplexity-User) often ignore robots.txt by design because a human asked for the page.

My default for most sites that want AI visibility: allow the retrieval bots (OAI-SearchBot, PerplexityBot, Claude-SearchBot), then decide on the training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) based on how you feel about your content training models. There is no citation penalty for blocking the training bots, which is the part most people get backwards. If your goal is specifically to show up when someone asks ChatGPT a question, the allow rule for OAI-SearchBot is step zero, and the rest of the work is covered in how to get cited by ChatGPT.

Does llms.txt actually do anything?

This is the question I get most, usually phrased exactly that way on r/SEO. The short answer for 2026: it is a real, well-designed standard that currently produces no measurable citation lift for most sites. Let me explain what it is before I explain why I am skeptical.

llms.txt was proposed by Jeremy Howard of Answer.AI on September 3, 2024. It is a markdown file at your site root that gives language models a curated overview of your site. The reasoning, straight from the proposal, is that LLM context windows are too small to ingest a full site, and converting HTML full of nav, ads, and JavaScript into clean text is “difficult and imprecise.” So site authors, who know their content best, write the map themselves.

The format is specced precisely. The only required element is an H1 with your project name. After that you add a blockquote summary, optional detail paragraphs, then H2 sections that each hold a list of [name](url): notes links. There is an optional “Optional” section for secondary links a model can skip when it needs a shorter context. The proposal also suggests serving clean markdown twins of pages at the same URL with .md appended.

So the idea is sound. The problem is the evidence. SE Ranking crawled roughly 300,000 domains and found two things. First, only 10.13% had an llms.txt file at all, and adoption was flat across traffic tiers (9.88% for sites with 0 to 100 monthly visits, 8.27% for sites with 100,000-plus). High-authority sites adopt it less than small ones. Second, and more damning, they found no correlation between having an llms.txt and how often a domain gets cited by LLMs. Removing the file from their XGBoost model actually improved its predictive accuracy, which means the variable was adding noise rather than signal.

Server logs tell the same story. A 48-day log study by wislr.com (February to March 2026) recorded 12,099 AI bot requests and found zero requests to /llms.txt from any AI crawler. I will add a fair counterpoint here, because I have seen it firsthand and so have others: some operators do see OpenAI polling their llms.txt. SEO Ray Martinez posted server logs showing OpenAI hitting his file every 15 minutes or so. So “no crawler ever touches it” is too strong. The accurate statement is that the files get fetched inconsistently and have not been shown to move citations.

What to make of Google adding llms.txt to its own docs

This one came straight off r/SEO too, and it confused a lot of people, so it is worth untangling. In December 2025, Google quietly added an llms.txt file to its own Search Central developer docs. Cue a wave of “see, Google does use it” posts. They do not follow.

Google has said the opposite twice, on the record. John Mueller posted in June 2025 that “no AI system currently uses llms.txt,” and separately compared the file to the long-dead keywords meta tag. Then at Search Central Live Deep Dive on July 23, 2025, Gary Illyes told the room that “Google doesn’t support LLMs.txt and isn’t planning to,” and that normal SEO is all you need to appear in AI Overviews. That session was recapped by Search Engine Land via attendee Kenichi Suzuki.

So why does Google’s own docs site have one? Because adding an llms.txt is a default gesture for developer documentation platforms (Mintlify-style setups generate them automatically). It signals dev-docs hygiene, not a change in how Google’s retrieval pipeline works. Mueller’s own public reaction to the discovery was a dry “hmmn :-/”. A file appearing on one Google property is not a policy. The two clear statements from Google about whether it uses the file for ranking and answers still stand.

Should llms.txt include all pages or just key content?

If you do ship one, this is the right question, and r/seogrowth asks it often. The answer is key content, curated, not a clone of your sitemap. That is the explicit intent of the spec: a concise, expert-level summary in a single place, not a full URL dump. You already have sitemap.xml for completeness.

The three files are complementary, not competing. Here is what each one is actually for.

File Audience Format Contains content context? Status
robots.txt All crawlers, including AI bots Plain text directives No, access rules only Universal, respected by compliant bots
sitemap.xml Search and AI crawlers XML No, URLs and dates only Near-universal
llms.txt Language models Markdown Yes, curated summaries and links ~10% adoption, no proven citation effect

Given the evidence, here is the only honest recommendation I can give. Ship an llms.txt if you publish API or SDK docs that developers paste into AI tools, if your CMS generates it for free, or if you want a cheap hedge in case the standard gains traction later. Skip it as a citation tactic if you are a marketing site, an e-commerce brand, a local business, or a publisher, because no study I have seen shows it earning citations in those contexts. If you ship it, write it well: a clear site description, your topics and entities, and links to your strongest pages with one-line descriptions. A stale file that points to dead URLs is worse than none.

Does extensive schema help an AI understand your entity?

This is the r/TechSEO question, and it deserves a careful answer because the popular one is wrong in both directions. The honest version: schema does not work the way most “44% more citations” headlines claim, and it is also not useless. You have to separate two things, what AI engines do at retrieval and what they do at indexing.

Start with retrieval, because this is where the myth breaks. In October 2025, searchVIU built a test page and checked whether ChatGPT, Claude, Perplexity, Gemini, and Google AI Mode actually read structured data when they fetch a page live to answer a question. None of them did. Every system extracted only the visible HTML content. JSON-LD, Microdata, and RDFa were all ignored during direct retrieval. So if your mental model is that an AI reads your JSON-LD at answer time and gets smarter about you, that model is wrong.

Then there is the citation question, and here the best study is brutal and clarifying. Ahrefs tracked 1,885 pages that added JSON-LD between August 2025 and March 2026, matched them against around 4,000 control pages, and measured citation changes with a difference-in-differences design. The result: adding schema produced no major citation uplift on any platform. Google AI Mode came in at plus 2.4% and ChatGPT at plus 2.2%, both statistically indistinguishable from zero, and Google AI Overviews actually dropped 4.6% (small, and both groups were already declining). Four separate tests all pointed the same way. Schema is not a citation shortcut.

A schema knowledge graph linking an organization, a person, and an article through connected nodes.

So why bother with schema at all? Two reasons the studies actually support. First, the Ahrefs study has a critical caveat: every page it measured was already cited heavily, with 100-plus AI Overview citations before any schema was added. As Ahrefs put it, if a page is already getting picked up, schema will not push it higher, but “for pages that aren’t being seen by AI systems at all, schema markup might still play a role in helping them get crawled, parsed, or indexed in the first place.” searchVIU makes the same point: schema is likely used in the crawl and index phases, “especially by Google AI Overviews and Bing Copilot, which have access to search indexes,” even though it is ignored at direct fetch.

Second, schema feeds the knowledge graph, and that is the real entity-understanding payoff. As BrightEdge describes it, an AI engine will not “parse your JSON-LD to form its answer word-for-word,” but schema “makes your content more digestible to search crawlers and knowledge graphs,” which “turns your site into a machine-readable knowledge graph.” Google’s own guidance is consistent: AI Overviews need no special markup, just normal SEO, but schema gives “extra clarity.” The single most useful property for entity work is sameAs, which links your organization to its profiles elsewhere so engines can connect the dots into one entity.

So, does extensive schema help an AI understand your entity better? It helps the indexing and knowledge-graph layer that some engines read from, and it disambiguates who you are. It does not get read at inference, and bolting more of it onto already-cited pages does nothing. That nuance is the whole answer. And note one practical limit: Google deprecated FAQ rich results from Search in August 2023 (they now show only for authoritative government and health sites), so FAQPage schema is no longer a rich-result play for most sites, even though the markup itself is still valid and harmless.

Which schema types are worth your time, and in what order

If schema is for entity grounding and crawl-stage clarity rather than citation lift, that changes which types you prioritize. I would focus on the ones that establish identity and structure, and use JSON-LD, which Google recommends and which is easier to maintain than Microdata or RDFa because it lives in a separate script tag instead of tangling your HTML.

  • Organization: establishes your brand as an entity. Add @id, alternateName, knowsAbout, and sameAs links to your real profiles. This is the foundation for everything else.
  • Person: for named authors and founders, with jobTitle, sameAs to LinkedIn and X, and worksFor pointing back to the Organization by @id. This carries your real E-E-A-T signals, which I cover in E-E-A-T for AI search citations.
  • Article or BlogPosting: on content pages, with author, datePublished, and an honest dateModified you actually update.
  • WebSite with SearchAction: describes the site itself and connects the graph.
  • Product or Service, BreadcrumbList, FAQPage: where they genuinely match the page. Do not mark up content that is not there.

The bigger lever than adding more types is connecting the ones you have. Reference your Person schema from your Article via @id, point Person at Organization, point Organization at your WebSite, and you build one interconnected graph an engine can traverse instead of a pile of disconnected blocks. One warning that comes straight from Google’s John Mueller: do not overdo it. He cautions against schema bloat, marking up everything for its own sake. Mark up what explains the content, validate it, and move on. If you want the full page-level rubric, our GEO audit checklist of 25 factors breaks schema down check by check.

What we actually shipped on BlueJar, and why

In late May 2026 I rebuilt our own technical GEO stack, so I can show you the real thing rather than a hypothetical. I want to be clear up front: I have no before-and-after ranking data to wave around. This is an implementation account, not a results claim. We did it because the stack is cheap to get right and because we would rather practice what we audit.

On the crawl layer, we replaced our default robots.txt with a physical file that explicitly allows the AI crawlers we want, GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, and CCBot among them, allows /llms.txt, and disallows the admin and API paths. We chose to allow training bots too, but that is a values call you can make either way without affecting citations.

On the entity layer, we did the work that the evidence says matters. Our Organization schema now carries an @id, an alternateName array, a knowsAbout list of the topics we cover, and a founder array linking both founders by @id. Each founder has a full Person schema with jobTitle, sameAs to LinkedIn and X, and worksFor. We added a WebSite schema with a SearchAction and wired the whole thing into a single @graph so every node references the others by @id. This is the knowledge-graph grounding work, not a citation hack.

On the context layer, we did rewrite our llms.txt, expanding it from a thin 8.3 KB link directory into a 14.7 KB narrative file with nine sections: intro, an At a Glance table, About, Team, Use Cases per audience, Pricing, a competitor comparison, inline FAQs, and social links. Given everything above, you might ask why we bothered. Honestly, partly as a cheap hedge and partly so I could write about it from experience. I would not tell a client it will earn them citations. I would tell them it took an afternoon and cannot hurt.

Want to see where your own technical GEO stands? Run a free GEO audit at bluejar.ai. It scores your page across schema, technical SEO, E-E-A-T, and citation readiness, rolls that into a single GEO score, and tells you which of these layers is actually holding you back.

Frequently asked questions

Does llms.txt actually do anything for AI visibility?

By the 2026 evidence, not much for most sites. SE Ranking analyzed roughly 300,000 domains and found no correlation between having an llms.txt and AI citation frequency, and removing the file from their predictive model improved its accuracy. A 48-day server-log study by wislr.com recorded zero AI crawler requests to /llms.txt across 12,099 bot hits, though some operators do see OpenAI polling the file inconsistently. It is a sound idea with no proven payoff yet.

Does Google use llms.txt? It added one to its own docs.

No. Google added an llms.txt to its Search Central developer docs in December 2025, but that is a standard developer-docs gesture, not a retrieval-policy change. Google has stated twice that it does not use the file: John Mueller said in June 2025 that “no AI system currently uses llms.txt,” and Gary Illyes said at Search Central Live in July 2025 that Google “doesn’t support LLMs.txt and isn’t planning to.”

Should llms.txt include all my pages or just key content?

Just your key content, curated with short descriptions. The llms.txt spec is explicitly designed as a concise, expert-level summary, not a sitemap clone. Use sitemap.xml for complete URL coverage and reserve llms.txt for your strongest pages, main topics, and the entities you are authoritative about.

Does extensive schema markup help an AI understand my entity better?

It helps the indexing and knowledge-graph layer, not the inference step. A 2025 searchVIU experiment found that ChatGPT, Claude, Perplexity, Gemini, and Google AI Mode all ignored JSON-LD, Microdata, and RDFa when fetching a page live, extracting only visible HTML. Schema still feeds search indexes and knowledge graphs that some engines rely on, and properties like sameAs disambiguate your brand as a single entity. So it aids entity grounding, but adding more of it is not a citation shortcut.

Will adding schema increase my AI citations?

Probably not on pages AI already cites. Ahrefs tracked 1,885 pages that added JSON-LD between August 2025 and March 2026 against matched controls and found no meaningful citation gain on any platform. Schema may still help pages that are not being seen yet get crawled, parsed, and indexed, but it does not push an already-cited page higher.

Which AI crawlers should I allow in robots.txt?

Allow the retrieval bots if you want to appear in AI answers: OAI-SearchBot for ChatGPT, PerplexityBot for Perplexity, and Claude-SearchBot for Claude. Decide separately on the training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) based on whether you want your content training models, because blocking them carries no citation penalty. The mistake to avoid is blocking a retrieval bot like OAI-SearchBot, which removes you from ChatGPT search answers entirely.

What is the difference between training crawlers and retrieval crawlers?

Training crawlers collect content to teach future models (GPTBot, ClaudeBot, Google-Extended, CCBot). Retrieval crawlers fetch pages to answer a live user query and decide what gets cited (OAI-SearchBot, PerplexityBot, Claude-SearchBot, plus user-triggered agents like ChatGPT-User and Perplexity-User). Blocking a training crawler protects your data without hurting visibility. Blocking a retrieval crawler removes you from that engine’s answers.

How do I know if my technical GEO setup is actually working?

Check two things. First, your server logs: look for fetches from the retrieval bots above over a 30-day window to confirm you are being crawled. Second, run a citation panel by asking 30 to 50 real queries across ChatGPT, Perplexity, Gemini, and Copilot and recording where you appear. BlueJar’s GEO audit automates the second part and scores your schema and technical setup, so you can see which layer is the bottleneck rather than guessing.

About the author
Badal Satyarthi
Badal Satyarthi Co-Founder & AI Engineer, BlueJar

Badal Satyarthi is the cofounder of BlueJar, the AI visibility platform for GEO audits and optimization. He writes about generative engine optimization, AI search, and the future of content discovery.