Methodology · the standing protocol v2.4 ·

How the AI Answerability Diagnostic works.

The full method, published in version. A method you cannot inspect is not a method — so a partner, a journalist, or an AI engine can read exactly how a finding was produced, and check it.

Most tools in this category keep their method behind the product. The visibility number is generated by a model the buyer cannot see, weighted by factors the vendor does not disclose. We take the opposite position. The protocol below is fixed, versioned, and human-reviewed; the only thing that changes per engagement is the input — your buyers, your category, your URLs. What follows is the whole of it.

01The pipeline

Five steps, in order. The first two gather evidence; the third turns it into a score; the last two deliver and verify it.

  1. STEP 01

    Prompt set

    ~60 prompts from 4–6 buyer archetypes

  2. STEP 02

    Capture

    5 engines · 21-day window · 300 observations

  3. STEP 03

    Scoring

    Content · Retrieval · Trust, 0–100 → Answerability

  4. STEP 04

    Report

    Written intelligence dossier + work orders

  5. STEP 05

    Re-audit

    Day 90 · same prompts · delta

02Step 1 — The prompt set

The unit of measurement is the buyer-intent prompt — the actual sentence a buyer types into an engine, not a keyword. We construct a standing set of roughly sixty, built from four to six buyer archetypes (more for broad consumer audiences), spread across the stages of the decision: awareness, comparison, risk, pricing, fit, and post-purchase.

Archetype construction draws on the engagement's stated ideal customer profile, sales-call language where transcripts are available, and the vocabulary buyers actually use in adjacent communities — vertical forums, trade press, Reddit threads. No keyword tools are involved, because keyword tools report what people type into a search box to get a list, not what they ask a model to get an answer. The two phrasings diverge, and the conversational, situated phrasing is what the engines are answering.

The count is held near sixty deliberately. More prompts would add statistical power and a worse signal-to-noise ratio: engines vary across runs, phrasing produces outcome variance, and re-auditing a long-tail set against a moving target reads as noise rather than learning. Sixty is the count at which there is enough surface to characterize a category without drowning the re-audit signal.

03Step 2 — The capture

Each prompt is issued to five engines — ChatGPT, Claude, Gemini, Perplexity, and Grok — through their web-grounded interfaces, within a 21-day window. Sixty prompts across five engines yields three hundred captured answers. For each, we record verbatim: which URLs the engine cited, which competitors it named, whether the engagement sponsor was cited or absent, the full answer text, and any model-disclosed reasoning.

The capture is scored per engine, not pooled, because the engines diverge sharply on the same prompt — and that divergence is the most operationally important pattern in the data. A single per-domain score would smooth over exactly what matters. The exhibit below is one real prompt, captured across the engines, recorded as it came back.

One prompt · web-grounded captureRecorded 2026-05
What are the best cost segregation study firms for commercial real estate investors in 2026?

Perplexity

KBKG · Engineered Tax Services · CSSI · Seneca Cost Seg

ChatGPT

KBKG · CSSI · CohnReznick · Engineered Tax Services

Claude

KBKG · CSSI · RE Cost Seg · Madison SPECS · McGuire Sponsel

Gemini

KBKG · Engineered Tax Services · CSSI · Duffy+Duffy · Capstan

On every list

KBKG — the rest diverge engine to engine.

Real capture, May 2026. Same prompt, four web-grounded engines; one firm named by all, the rest divergent. This is why scoring is per engine.

04Step 3 — The scoring

Every cited URL is scored independently across the three pillars of the Answerability framework, each on a 0–100 scale, calibrated against citation patterns observed in the capture window. The three roll up into a composite Answerability score — which is constrained by the weakest pillar, not the average, because a citation requires all three at once.

PillarWhat it scoresScale
ContentWhether answer-shaped content exists for the questions buyers actually ask0–100
RetrievalWhether engines can access, crawl, and parse that content0–100
TrustWhether engines treat the source as cite-worthy (internal evidence + external corroboration)0–100
Worked example · illustrative

One firm, two of its pages. Each clears a different bar and fails a different one — so the composite tracks the weakest pillar, and the work order differs page to page.

An illustrative URL scored 92 Content, 86 Retrieval, 41 Trust; Trust is the binding constraint and the composite is 41 /services/cost-segregation ANSWERABILITY 41 Content 92 Retrieval 86 Trust 41 BINDS answerability ceiling

Trust binds. A strong, answer-shaped page that every engine can crawl — but with no independent corroboration and a thin entity graph, so engines read it and don't cite it. The Content and Retrieval strength past the dashed line is inert until Trust rises.

An illustrative URL scored 78 Content, 44 Retrieval, 80 Trust; Retrieval is the binding constraint and the composite is 44 /insights/bonus-depreciation-2026 ANSWERABILITY 44 Content 78 Retrieval 44 BINDS Trust 80 answerability ceiling

Retrieval binds. Authoritative and well-corroborated — but the page is JavaScript-rendered and blocks GPTBot, so most engines never parse it. An engine cannot cite what it cannot fetch.

Illustrative scores, not a client engagement. Two-tone bars show the effective score (solid) against the headroom above the ceiling (faint) that stays inert until the binding pillar rises.

The rubric extends the information-retrieval evaluation tradition (TREC, 1992–) to LLM-mediated answers. It is designed against a documented failure mode: Ding et al. (Citations and Trust in LLM Generated Responses, AAAI 2025) find that citations raise a reader's trust even when the citations are random, and that trust falls only on verification. We score whether corroboration is independent and primary, not merely abundant — because abundance is the part that can be manufactured.

05Step 4 — The report

The output is a written intelligence dossier, not a dashboard. It carries the executive finding, the per-engine citation landscape, URL-level scores ranked by expected lift against effort, scoped work orders, and a sequenced 30-day roadmap. The deliverable and sample pages are shown on the diagnostic pages. A dashboard becomes a tab nobody opens; a document is something a partner group reads, marks up, and acts on.

06Step 5 — The day-90 re-audit

Ninety days after delivery, the same sixty prompts are re-run against the updated site, and the delta is reported per pillar and per engine. The same set, not a fresh one, is the point: re-running a new prompt set against a moving target measures the noise floor of prompt selection, not the impact of the work. The re-audit is how a claim of movement becomes a measurement of movement.

A method you cannot inspect is not a method. We publish ours so it can be checked.

07What this method is not

Stating the limits is part of the method. The following are deliberate boundaries, not omissions.

We claimWe do not claim
Observed co-occurrence within a bounded sampleCausation, or any engine's declared ranking weights
A point-in-time reading, compared at day 90A stable measurement — engines are non-stationary and change frequently
Overlap with technical SEO on the Retrieval pillarThat this replaces SEO, or that SEO replaces this
A defensible artifact and a sequenced roadmapA guarantee of citation — no honest practice can promise one
Per-engine findings, because divergence mattersThat the engines will converge on one retrieval-and-citation logic

08Why it is versioned

AI engines change, sometimes materially, on short timescales. A method that described their behavior in May 2026 will need revision by the next cycle, and pretending otherwise would be the dishonest move. So the protocol carries a version (this is v2.4) and a date, and is revised in the open as engine behavior shifts. The day-90 re-audit exists for the same reason: the method assumes its own findings have a half-life.

Publishing the method in full is, finally, a Trust position in the sense the framework means it: we are asking buyers and engines to treat this practice as cite-worthy, and the way to earn that is to be inspectable. The page you are reading is the evidence.

Evidence standard Engine captures shown on this site are real: each prompt is issued once to the major engines through their web-grounded interfaces, and cited sources are recorded verbatim. Grok is currently omitted where its live-search API is unavailable. A single-run capture characterizes behavior within a window, not a longitudinal measurement — engine behavior changes frequently, which is why every engagement includes a day-90 re-audit. Illustrative figures are marked as such.

Methodology v2.4 · [email protected] · Confidential engagements under MNDA · Corrections welcome