Retrieval: the second pillar of Answerability.
A page that answers the question perfectly is still uncited if the engine cannot reach it, cannot read it without running JavaScript, or cannot tell what it is. Retrieval is the engineering layer — the easiest to fix, and the one companies fail most quietly.
§1The layer everything sits on
Retrieval is the precondition the other two pillars depend on operationally, though not in a way that makes the three sequential. The pillars still fail independently — a perfectly retrievable page can be unanswerable or untrusted on its own terms — but an engine cannot evaluate the Content or Trust of a page it could not read in the first place. Retrieval is where a citation is silently lost before the other two pillars are ever consulted.
It is also the pillar most continuous with two decades of technical SEO. Where the two practices overlap — crawl access, server-side rendering, structured data, sitemap hygiene — they are saying the same thing, and we do not invent novelty where none exists. A company whose technical SEO is in good order will usually find its Retrieval already strong. The divergence is narrow but real: AI crawlers are a different set of user-agents with their own access rules, most of them do not execute JavaScript, and the machine-readable signals that matter for being understood by a model are a superset of the ones that mattered for ranking in a list. The bar moved; it did not move far.
The scenario below is the one this pillar exists to prevent: a site with strong Content and respectable Trust that scores near zero on Retrieval, and is therefore invisible regardless of how good the other two are.
§2Crawl access and the AI user-agents
The first question is the bluntest: can the engine's crawler fetch the page at all? AI crawlers are distinct user-agents, documented by their operators, and they are governed by the same Robots Exclusion Protocol as any other bot.1 The agents that matter today include OpenAI's GPTBot and OAI-SearchBot, Anthropic's ClaudeBot, PerplexityBot, Google's Google-Extended, and Common Crawl's CCBot, which is a training-data substrate for several systems.
Two failure shapes recur. The first is accidental blocking: a robots.txt rule or a CDN bot-mitigation layer that drops the AI crawlers along with the scrapers, frequently without anyone deciding it should. The second is the absence of any decision at all — no AI-crawler directives, so the agents fall under whatever User-agent: * rules exist, which may or may not serve the company's interest. Neither failure is visible in analytics, because a crawler that was blocked leaves no trace in the place anyone looks.
A site that intends to be cited should make the decision explicitly. Answerability's own robots.txt is the shape we recommend — every relevant agent named and allowed, nothing left to a default:
# Answerability.ai — robots.txt (excerpt) # This site exists to be cited by AI search systems. Allow the crawlers that matter. User-agent: GPTBot Allow: / ← OpenAI User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / ← Anthropic User-agent: PerplexityBot Allow: / ← Perplexity User-agent: Google-Extended Allow: / ← Google User-agent: CCBot Allow: / ← Common Crawl
How common is that explicitness in practice? We checked the robots.txt of the firms the engines actually cited in the captures from the Content note. The result is a fair characterization of the field as it stands.
| Site (cited by the engines) | Names any AI crawler? | Effective access |
|---|---|---|
| KBKG | No | Default * rules |
| Clio | No | Default * rules |
| CSSI Services | No | Default * rules |
| Seneca Cost Seg | No | Default * rules |
| MyCase | No | Default * rules |
| Engineered Tax Services | No | Default * rules |
| G2 (review aggregator) | Yes | Explicit AI-crawler rules |
| Capterra (review aggregator) | Yes | Explicit AI-crawler rules |
The pattern
None of the eight service firms cited for "best provider" questions names a single AI crawler in robots.txt. The two platforms that do manage AI-crawler access explicitly — G2 and Capterra — are the ones whose business model is being a data source others cite.
User-agent: *. The firms getting cited are, for the most part, not managing Retrieval at all — they are reached through the third-party listicles documented in the Content note, not through their own crawl hygiene. Reddit and Wikipedia were excluded from this table: Reddit returned a network-policy block to automated fetch, which is itself a Retrieval fact worth its own note.2The lesson is not that these firms are doing something wrong — many are cited regardless. It is that crawl access is rarely the binding constraint people assume it is, and rarely the advantage they assume it would be. Managing it is necessary hygiene, cheap to do, and worth doing before any expensive work — but on its own it moves nothing. It simply removes a way to lose.
§3Parse: what survives ingestion
Reaching a page is not reading it. The second Retrieval question is whether the substance of the page is present in the raw HTML the crawler receives, or whether it materializes only after a browser executes JavaScript. Most current AI crawlers do not execute JavaScript.3 Content that requires client-side rendering to appear is, to those crawlers, an empty document — the headline loads, the framework boots, and the substance never arrives.
This is the single most consequential parse failure, and it is invisible to the people who built the site, because their browser executes the JavaScript and shows them a complete page. The test is not "does it look complete in my browser." The test is "view source, disable JavaScript, and read what remains." If the answer-bearing content is gone, so is the citation.
The remaining parse criteria are less dramatic but compound:
- Semantic structure over div soup. Real headings (
h1–h3), paragraph-level prose, and lists give the parser a document it can segment. A page built entirely from styleddivs with visual-only hierarchy reads as undifferentiated text. - Substance in the body, not the chrome. Content that lives in image captions, hover tooltips, accordion panels that load on click, or scroll-triggered reveals is frequently absent from the parsed document. The more interaction a page requires to reveal its substance, the less of that substance survives ingestion.
- No accidental exclusion. A
noindexmeta tag inherited from a staging template, a canonical pointing at a different URL, a paywall interstitial — each silently removes a page from the candidate set.
The test is not whether the page looks complete in your browser. It is what remains when the JavaScript does not run.
§4Machine-readable scaffolding
Past access and parse, Retrieval asks whether the engine can tell what the page is without inferring it from prose. Two conventions carry that signal: structured data and llms.txt.
Structured data (schema.org)
Schema.org JSON-LD describes a page in a vocabulary the engine can consume directly — that this is an Organization, this a Service with a price, this an Article with an author and a date. Applied cleanly, it removes ambiguity from machine reading. The failure mode is not usually absence; it is markup that is technically valid and semantically empty, or worse, contradictory:
- Schema that names an organization the visible page never mentions, or claims a date that does not match the visible timestamp — contradictions that reduce trust in the markup rather than building it.
- A
Serviceblock with noserviceTypeorprovider— valid JSON-LD that says nothing. - Dates present in schema but absent from the visible HTML — invisible to engines that parse rendered output rather than JSON-LD.
Valid markup that does not describe the page is not a Retrieval asset. It is overhead.
llms.txt
The emerging /llms.txt convention is a plaintext file that summarizes a site's content and structure for AI agents — a sitemap-equivalent written in prose rather than XML.4 Adoption is uneven and the format is still settling, but the cost of maintaining one is small. A good llms.txt leads with one sentence on what the site is, names the key pages and what each contains, summarizes the underlying framework in prose, and is kept current. A bad one lists URLs with no context, restates a marketing tagline, or has not been touched since deployment. The difference is whether a model reading it comes away able to describe the site accurately — which is the only thing the file is for.
§5Sitemaps, canonicals, and entity resolution
Two pieces of hygiene close the pillar. A current sitemap.xml with accurate lastmod values, no orphaned URLs, and no entries that 404 or redirect, gives the crawler a reliable map. Canonical tags that point at the intended URL — not at a staging host, not at a parameterized duplicate — ensure the engine attributes the content to the page you want cited.
The subtler boundary is entity resolution: whether the engine can tell which entity a page is about. This is where Retrieval hands off to Trust, and the seam is visible in real captures. In the cost-segregation capture from the Content note, one engine, answering a question about the cost-seg firm "Seneca," cited a tuition page from Seneca Polytechnic — a Canadian college that shares the name.5 The retrieval was mechanically successful; the disambiguation was not. The engine reached a real page about the wrong entity.
Name collisions of that kind are not fixable with crawl hygiene. They are fixed by making the right entity unambiguous to the machine — consistent organization schema, a resolvable entity in the knowledge graphs, sameAs links that tie the on-site entity to its off-site identity. That work belongs to Trust, and the Trust note treats it directly. Retrieval's job is to make the page reachable and readable; it cannot, on its own, make the page the answer rather than a page that happened to match a string.
§6Necessary, not sufficient
Retrieval is the pillar most likely to be mistaken for the whole job, because it is concrete, checkable, and shippable. A team can audit crawl access, fix the JavaScript-rendering gap, add schema, publish an llms.txt, and clean the sitemap in a sprint — and feel, justifiably, that it has done real work. It has. But a page that passes every Retrieval check is exactly as uncited as before if its Content does not answer the question or its source is not trusted. We have audited perfectly engineered sites — clean schema, server-rendered, current sitemap, well-formed llms.txt — that scored near zero on a given engine because the underlying entity was absent from the knowledge graph that engine depends on.
This is why Retrieval is the right place to begin operationally and the wrong place to stop. It removes the ways to lose before the real contest — Content and Trust — is joined. A company that treats Retrieval as the destination will have the most crawlable invisible site in its category. The value of scoring it as one pillar of three is that a strong Retrieval score is read correctly: as a floor cleared, not a race won.
References & method
- Internet Engineering Task Force (2022). RFC 9309: Robots Exclusion Protocol. rfc-editor.org/rfc/rfc9309.html. Operator crawler documentation is published separately by each engine and changes without notice.
- Audit method:
robots.txtwas fetched once per host on 2026-05-24 and parsed for blocks naming the major AI user-agents (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, and others). "Names any AI crawler" is true where a host has aUser-agentblock for at least one such agent. Sites that returned a non-robots.txtresponse to automated fetch — Reddit served a network-policy block page — were excluded rather than recorded as a policy. - Whether and when each engine executes JavaScript is documented unevenly and changes over time; the conservative and durable assumption for content that must be cited is that the crawler reads raw HTML. Content present only after client-side execution should be treated as at risk.
- The
llms.txtconvention was proposed by Jeremy Howard in September 2024 and is documented at llmstxt.org. Schema.org vocabulary is at schema.org. - Observed in a five-engine capture for the prompt "What are the best cost segregation study firms for commercial real estate investors in 2026?", recorded 2026-05-24. One engine's cited set included
senecapolytechnic.ca, a college unrelated to the cost-segregation firm "Seneca" the answer was discussing. See the Content note for the full capture.
This note extends §3 of the working primer, which defines the Answerability framework and its three pillars.