How the AI Answerability Diagnostic works.

The full method, published in version. A method you cannot inspect is not a method — so a partner, a journalist, or an AI engine can read exactly how a finding was produced, and check it.

Most tools in this category keep their method behind the product. The visibility number is generated by a model the buyer cannot see, weighted by factors the vendor does not disclose. We take the opposite position. The protocol below is fixed, versioned, and human-reviewed; the only thing that changes per engagement is the input — your buyers, your category, your URLs. What follows is the whole of it. In the end it measures one thing: your Retrieval Surface — the slice of the web AI systems can actually reach, parse, and trust about you.

01The pipeline

Five steps, in order. The first two gather evidence; the third turns it into a score; the last two deliver and verify it.

STEP 01
Prompt set
~60 prompts from 4–6 buyer archetypes
STEP 02
Capture
5 engines · 21-day window · 300 observations
STEP 03
Scoring
Content · Retrieval · Trust, 0–100 → Answerability
STEP 04
Report
Written intelligence dossier + work orders
STEP 05
Re-audit
Day 90 · same prompts · delta

02Step 1 — The prompt set

The unit of measurement is the buyer-intent prompt — the actual sentence a buyer types into an engine, not a keyword. We construct a standing set of roughly sixty, built from four to six buyer archetypes (more for broad consumer audiences), spread across the stages of the decision: awareness, comparison, risk, pricing, fit, and post-purchase.

Archetype construction draws on the engagement's stated ideal customer profile, sales-call language where transcripts are available, and the vocabulary buyers actually use in adjacent communities — vertical forums, trade press, Reddit threads. No keyword tools are involved, because keyword tools report what people type into a search box to get a list, not what they ask a model to get an answer. The two phrasings diverge, and the conversational, situated phrasing is what the engines are answering.

The count is held near sixty deliberately. More prompts would add statistical power and a worse signal-to-noise ratio: engines vary across runs, phrasing produces outcome variance, and re-auditing a long-tail set against a moving target reads as noise rather than learning. Sixty is the count at which there is enough surface to characterize a category without drowning the re-audit signal.

03Step 2 — The capture

Each prompt is issued to five engines — ChatGPT, Claude, Gemini, Perplexity, and Grok — through their web-grounded interfaces, within a 21-day window. Sixty prompts across five engines yields three hundred captured answers. For each, we record verbatim: which URLs the engine cited, which competitors it named, whether the engagement sponsor was cited or absent, the full answer text, and any model-disclosed reasoning.

The capture is scored per engine, not pooled, because the engines diverge sharply on the same prompt — and that divergence is the most operationally important pattern in the data. A single per-domain score would smooth over exactly what matters. The exhibit below is one real prompt, captured across the engines, recorded as it came back.

One prompt · web-grounded captureRecorded 2026-05

What are the best cost segregation study firms for commercial real estate investors in 2026?

Perplexity

KBKG · Engineered Tax Services · CSSI · Seneca Cost Seg

ChatGPT

KBKG · CSSI · CohnReznick · Engineered Tax Services

Claude

KBKG · CSSI · RE Cost Seg · Madison SPECS · McGuire Sponsel

Gemini

KBKG · Engineered Tax Services · CSSI · Duffy+Duffy · Capstan

On every list

KBKG — the rest diverge engine to engine.

Real capture, May 2026. Same prompt, four web-grounded engines; one firm named by all, the rest divergent. This is why scoring is per engine.

04Step 3 — The scoring

Every cited URL is scored independently across the three pillars of the Answerability framework, each on a 0–100 scale, calibrated against citation patterns observed in the capture window. The three roll up into a composite Answerability score — which is constrained by the weakest pillar, not the average, because a citation requires all three at once.

Pillar	What it scores	Scale
Content	Whether answer-shaped content exists for the questions buyers actually ask	0–100
Retrieval	Whether engines can access, crawl, and parse that content	0–100
Trust	Whether engines treat the source as cite-worthy (internal evidence + external corroboration)	0–100

Pillar

What it scores

Scale

Content

Whether answer-shaped content exists for the questions buyers actually ask

0–100

Retrieval

Whether engines can access, crawl, and parse that content

0–100

Trust

Whether engines treat the source as cite-worthy (internal evidence + external corroboration)

0–100

Worked example · illustrative

One firm, two of its pages. Each clears a different bar and fails a different one — so the composite tracks the weakest pillar, and the work order differs page to page.

Trust binds. A strong, answer-shaped page that every engine can crawl — but with no independent corroboration and a thin entity graph, so engines read it and don't cite it. The Content and Retrieval strength past the dashed line is inert until Trust rises.

Retrieval binds. Authoritative and well-corroborated — but the page is JavaScript-rendered and blocks GPTBot, so most engines never parse it. An engine cannot cite what it cannot fetch.

Illustrative scores, not a client engagement. Two-tone bars show the effective score (solid) against the headroom above the ceiling (faint) that stays inert until the binding pillar rises.

The rubric extends the information-retrieval evaluation tradition (TREC, 1992–) to LLM-mediated answers. It is designed against a documented failure mode: Ding et al. (Citations and Trust in LLM Generated Responses, AAAI 2025) find that citations raise a reader's trust even when the citations are random, and that trust falls only on verification. We score whether corroboration is independent and primary, not merely abundant — because abundance is the part that can be manufactured.

05Step 4 — The report

The output is a written intelligence dossier, not a dashboard. It carries the executive finding, the per-engine citation landscape, URL-level scores ranked by expected lift against effort, scoped work orders, and a sequenced 30-day roadmap. At its center is a map of the company's citation territory — the buyer-question clusters where retrieval systems repeatedly surface competitors, return fragmented answers, consolidate around incumbents, or leave territory unexpectedly open. The deliverable and sample pages are shown on the diagnostic pages. A dashboard becomes a tab nobody opens; a document is something a partner group reads, marks up, and acts on.

06Step 5 — The day-90 re-audit

Ninety days after delivery, the same sixty prompts are re-run against the updated site, and the delta is reported per pillar and per engine. The same set, not a fresh one, is the point: re-running a new prompt set against a moving target measures the noise floor of prompt selection, not the impact of the work. The re-audit is how a claim of movement becomes a measurement of movement.

A method you cannot inspect is not a method. We publish ours so it can be checked.

07What this method is not

Stating the limits is part of the method. The following are deliberate boundaries, not omissions.

We claim	We do not claim
Observed co-occurrence within a bounded sample	Causation, or any engine's declared ranking weights
A point-in-time reading, compared at day 90	A stable measurement — engines are non-stationary and change frequently
Overlap with technical SEO on the Retrieval pillar	That this replaces SEO, or that SEO replaces this
A defensible artifact and a sequenced roadmap	A guarantee of citation — no honest practice can promise one
Per-engine findings, because divergence matters	That the engines will converge on one retrieval-and-citation logic

We claim

We do not claim

Observed co-occurrence within a bounded sample

Causation, or any engine's declared ranking weights

A point-in-time reading, compared at day 90

A stable measurement — engines are non-stationary and change frequently

Overlap with technical SEO on the Retrieval pillar

That this replaces SEO, or that SEO replaces this

A defensible artifact and a sequenced roadmap

A guarantee of citation — no honest practice can promise one

Per-engine findings, because divergence matters

That the engines will converge on one retrieval-and-citation logic

08Why it is versioned

AI engines change, sometimes materially, on short timescales. A method that described their behavior in May 2026 will need revision by the next cycle, and pretending otherwise would be the dishonest move. So the protocol carries a version (this is v2.4) and a date, and is revised in the open as engine behavior shifts. The day-90 re-audit exists for the same reason: the method assumes its own findings have a half-life.

Publishing the method in full is, finally, a Trust position in the sense the framework means it: we are asking buyers and engines to treat this practice as cite-worthy, and the way to earn that is to be inspectable. The page you are reading is the evidence.

Evidence standard Engine captures shown on this site are real: each prompt is issued once to the major engines through their web-grounded interfaces, and cited sources are recorded verbatim. Grok is currently omitted where its live-search API is unavailable. A single-run capture characterizes behavior within a window, not a longitudinal measurement — engine behavior changes frequently, which is why every engagement includes a day-90 re-audit. Illustrative figures are marked as such.

See the method run on your company.

The same instrument, your category — every cited URL scored across five engines, with work orders and monthly Visibility Intelligence.

Order the Diagnostic See how we work

← Home · Engagements & pricing

Methodology v2.4 · hello@answerability.ai · Confidential engagements under MNDA · Corrections welcome

How the AI Answerability Diagnostic works.

01The pipeline

Prompt set

Capture

Scoring

Report

Re-audit

02Step 1 — The prompt set

03Step 2 — The capture

04Step 3 — The scoring

05Step 4 — The report

06Step 5 — The day-90 re-audit

07What this method is not

08Why it is versioned

See the method run on your company.