How the AI Answerability Diagnostic works.
The full method, published in version. A method you cannot inspect is not a method — so a partner, a journalist, or an AI engine can read exactly how a finding was produced, and check it.
Most tools in this category keep their method behind the product. The visibility number is generated by a model the buyer cannot see, weighted by factors the vendor does not disclose. We take the opposite position. The protocol below is fixed, versioned, and human-reviewed; the only thing that changes per engagement is the input — your buyers, your category, your URLs. What follows is the whole of it.
01The pipeline
Five steps, in order. The first two gather evidence; the third turns it into a score; the last two deliver and verify it.
- STEP 01
Prompt set
~60 prompts from 4–6 buyer archetypes
- STEP 02
Capture
5 engines · 21-day window · 300 observations
- STEP 03
Scoring
Content · Retrieval · Trust, 0–100 → Answerability
- STEP 04
Report
Written intelligence dossier + work orders
- STEP 05
Re-audit
Day 90 · same prompts · delta
02Step 1 — The prompt set
The unit of measurement is the buyer-intent prompt — the actual sentence a buyer types into an engine, not a keyword. We construct a standing set of roughly sixty, built from four to six buyer archetypes (more for broad consumer audiences), spread across the stages of the decision: awareness, comparison, risk, pricing, fit, and post-purchase.
Archetype construction draws on the engagement's stated ideal customer profile, sales-call language where transcripts are available, and the vocabulary buyers actually use in adjacent communities — vertical forums, trade press, Reddit threads. No keyword tools are involved, because keyword tools report what people type into a search box to get a list, not what they ask a model to get an answer. The two phrasings diverge, and the conversational, situated phrasing is what the engines are answering.
The count is held near sixty deliberately. More prompts would add statistical power and a worse signal-to-noise ratio: engines vary across runs, phrasing produces outcome variance, and re-auditing a long-tail set against a moving target reads as noise rather than learning. Sixty is the count at which there is enough surface to characterize a category without drowning the re-audit signal.
03Step 2 — The capture
Each prompt is issued to five engines — ChatGPT, Claude, Gemini, Perplexity, and Grok — through their web-grounded interfaces, within a 21-day window. Sixty prompts across five engines yields three hundred captured answers. For each, we record verbatim: which URLs the engine cited, which competitors it named, whether the engagement sponsor was cited or absent, the full answer text, and any model-disclosed reasoning.
The capture is scored per engine, not pooled, because the engines diverge sharply on the same prompt — and that divergence is the most operationally important pattern in the data. A single per-domain score would smooth over exactly what matters. The exhibit below is one real prompt, captured across the engines, recorded as it came back.
Perplexity
KBKG · Engineered Tax Services · CSSI · Seneca Cost Seg
ChatGPT
KBKG · CSSI · CohnReznick · Engineered Tax Services
Claude
KBKG · CSSI · RE Cost Seg · Madison SPECS · McGuire Sponsel
Gemini
KBKG · Engineered Tax Services · CSSI · Duffy+Duffy · Capstan
On every list
KBKG — the rest diverge engine to engine.
04Step 3 — The scoring
Every cited URL is scored independently across the three pillars of the Answerability framework, each on a 0–100 scale, calibrated against citation patterns observed in the capture window. The three roll up into a composite Answerability score — which is constrained by the weakest pillar, not the average, because a citation requires all three at once.
| Pillar | What it scores | Scale |
|---|---|---|
| Content | Whether answer-shaped content exists for the questions buyers actually ask | 0–100 |
| Retrieval | Whether engines can access, crawl, and parse that content | 0–100 |
| Trust | Whether engines treat the source as cite-worthy (internal evidence + external corroboration) | 0–100 |
One firm, two of its pages. Each clears a different bar and fails a different one — so the composite tracks the weakest pillar, and the work order differs page to page.
Trust binds. A strong, answer-shaped page that every engine can crawl — but with no independent corroboration and a thin entity graph, so engines read it and don't cite it. The Content and Retrieval strength past the dashed line is inert until Trust rises.
Retrieval binds. Authoritative and well-corroborated — but the page is JavaScript-rendered and blocks GPTBot, so most engines never parse it. An engine cannot cite what it cannot fetch.
The rubric extends the information-retrieval evaluation tradition (TREC, 1992–) to LLM-mediated answers. It is designed against a documented failure mode: Ding et al. (Citations and Trust in LLM Generated Responses, AAAI 2025) find that citations raise a reader's trust even when the citations are random, and that trust falls only on verification. We score whether corroboration is independent and primary, not merely abundant — because abundance is the part that can be manufactured.
05Step 4 — The report
The output is a written intelligence dossier, not a dashboard. It carries the executive finding, the per-engine citation landscape, URL-level scores ranked by expected lift against effort, scoped work orders, and a sequenced 30-day roadmap. The deliverable and sample pages are shown on the diagnostic pages. A dashboard becomes a tab nobody opens; a document is something a partner group reads, marks up, and acts on.
06Step 5 — The day-90 re-audit
Ninety days after delivery, the same sixty prompts are re-run against the updated site, and the delta is reported per pillar and per engine. The same set, not a fresh one, is the point: re-running a new prompt set against a moving target measures the noise floor of prompt selection, not the impact of the work. The re-audit is how a claim of movement becomes a measurement of movement.
A method you cannot inspect is not a method. We publish ours so it can be checked.
07What this method is not
Stating the limits is part of the method. The following are deliberate boundaries, not omissions.
| We claim | We do not claim |
|---|---|
| Observed co-occurrence within a bounded sample | Causation, or any engine's declared ranking weights |
| A point-in-time reading, compared at day 90 | A stable measurement — engines are non-stationary and change frequently |
| Overlap with technical SEO on the Retrieval pillar | That this replaces SEO, or that SEO replaces this |
| A defensible artifact and a sequenced roadmap | A guarantee of citation — no honest practice can promise one |
| Per-engine findings, because divergence matters | That the engines will converge on one retrieval-and-citation logic |
08Why it is versioned
AI engines change, sometimes materially, on short timescales. A method that described their behavior in May 2026 will need revision by the next cycle, and pretending otherwise would be the dishonest move. So the protocol carries a version (this is v2.4) and a date, and is revised in the open as engine behavior shifts. The day-90 re-audit exists for the same reason: the method assumes its own findings have a half-life.
Publishing the method in full is, finally, a Trust position in the sense the framework means it: we are asking buyers and engines to treat this practice as cite-worthy, and the way to earn that is to be inspectable. The page you are reading is the evidence.