On this page
- Abstract
- 1. Background
- 2. Methodology
- 3. Findings
- 3.1 Two signals graduate
- 3.2 Effect sizes
- 3.3 Claude vs GPT reverse
- 3.4 Cross-scanner divergence
- 3.5 Ecosystem nascency
- 3.6 Bot-protection
- 3.7 Respectarium score
- 4. Caveats and limitations
- 5. Conflict of interest
- 6. Reproducibility
- 7. Acknowledgments
- 8. Citation and license
Abstract
We present the first cross-sectional study pairing agent-readiness signals with LLM visibility outcomes. Three independent scanners (Respectarium, Cloudflare, and Fern) produced 72 agent-readiness predictors — 66 individual scanner checks plus 5 aggregate metrics and 1 derived signal — measured against a sample of 908 brands across 50 B2B SaaS categories. Outcomes were captured from Respectarium's leaderboards across three large language models (Claude, GPT, and Gemini).
Under analytical thresholds pre-registered before any results were viewed, 2 of 50 evaluated predictors graduate to scored-signal status: cloudflare.level (Cloudflare's aggregate readiness level) and respectarium.markdown-negotiation (HTTP Accept: text/markdown content-negotiation support). Of the 72 predictors fed in, 22 were excluded before evaluation (20 by the variance filter at <5% adoption, plus 2 by additional pre-registered data-quality filters), leaving 50 evaluated against five LLM-visibility outcomes.
All effect sizes are small to medium (Cohen's d ≤ 0.65). No "large" effects (d > 0.8) appear anywhere in the study. The most striking structural finding is that Claude and GPT reverse direction on four signals simultaneously — the same agent-readiness check predicts higher Claude listing and lower GPT listing (or vice versa), at FDR-significant levels in both directions.
Twenty of the 66 per-check signals have <5% real-world adoption in the brand universe, indicating that the agent-readiness specification ecosystem is in its infancy as of 2026-04. Cross-scanner divergence on shared check names (ρ ≈ 0.03 on three pairs of same-named checks) is documented as a primary finding rather than a footnote.
We publish the methodology, code, and dataset for reproducibility. We acknowledge a conflict of interest — Respectarium operates one of the three scanners evaluated — and report findings unfavorable to the Respectarium scanner transparently. Longitudinal re-measurement is needed to test causal hypotheses.
Pre-registered analytical thresholds, applied mechanically across 11 statistical scripts.
66 per-check + 5 aggregates + 1 derived (bot_protected)
<5% of brands deviate from modal status (ecosystem-nascency)
further pre-registered data-quality filters before evaluation
evaluated against 5 outcomes (250 tests)
failed univariate AND multivariate AND no subgroup signal
passed some criteria but not all four for promotion
passed all four pre-registered criteria
cloudflare.levelrespectarium.markdown-negotiationSource: results/10-verdicts.json + results/00-data-quality.json (study-2026-04).
1. Background
1.1 What is agent-readiness?
"Agent-readiness" describes how accessible a website is to AI agents and automated HTTP clients. It encompasses HTTP-protocol practices (robots.txt with AI-bot rules, content-negotiation for markdown, OAuth discovery metadata, MCP server cards) and content-shape practices (llms.txt files, page-size limits, redirect-behavior cleanliness, link headers).
The space is fragmented as of 2026-04: at least three open scanner systems implement different methodologies for measuring agent-readiness, with check definitions that often share names but diverge substantively in operationalization.
1.2 Why this study?
A common claim in the agent-readiness discourse is that improving these signals will improve LLM visibility — that a site becomes more discoverable and rank-able by AI assistants when it adopts agent-readiness practices. This claim has, until now, lacked cross-sectional empirical evidence at scale.
This study provides the first such evidence. We measure 72 agent-readiness predictors (66 individual scanner checks plus 5 aggregate metrics and 1 derived signal) from three scanners against five LLM-visibility outcomes on a sample large enough to support both univariate and multivariate analyses with proper multiple-testing corrections.
1.3 The three scanners
We use three independent agent-readiness scanners:
- Respectarium scanner — a closed-source implementation of the Agent-Adoption Specification V1, maintained by Respectarium.
- Cloudflare — Cloudflare's public
isitagentready.comAPI. - Fern — Fern's open-source
afdocsscanner.
Each scanner outputs per-check results in its native enum (pass / fail / neutral for Respectarium and Cloudflare; pass / warn / skip / fail / error for Fern). We do not artificially unify the enums; instead, each is encoded numerically and analyzed in its native shape.
1.4 The Agent-Adoption Specification
The Respectarium scanner implements the Agent-Adoption Specification V1, an open methodology document. Anyone may build an additional implementation against the same specification; cross-implementation comparison is itself a research activity.
2. Methodology
2.1 Pre-registered analytical thresholds
The promotion / drop / informational rules below were committed in writing on 2026-04-24, two days before any results were viewed (analysis began 2026-04-26). The thresholds are applied mechanically by the analysis script — they are not retrofitted to data. The verbatim pre-registered thresholds are preserved as methodology.md §1 of the study repository, with the 2026-04-24 commit date verifiable from git history (the commit predates the 2026-04-26 analysis runs).
A signal graduates to PROMOTE_SCORED status when ALL of:
- FDR-adjusted p < 0.05 (Benjamini-Hochberg) in univariate Spearman correlation against the predictor's best outcome
- 95% confidence interval on the multivariate β coefficient excludes zero (OLS with eligible-category fixed effects)
- Direction-consistent in at least 2 of 3 LLMs in binary "listed by this LLM (1) vs not (0)" analysis
- Not in a redundancy cluster with another signal that has higher mean |ρ| (clustering at |ρ| ≥ 0.8)
A signal drops entirely when ALL of:
- FDR-adjusted univariate p > 0.10 AND
- Multivariate β 95% CI spans zero AND
- No subgroup (per-LLM, per-category) shows direction-consistent effect
All other signals are KEEP_INFORMATIONAL.
2.2 Data sources and acquisition
Brand universe construction
The 908-brand universe was constructed by querying three large language models — Claude, GPT, and Gemini — across 50 B2B SaaS categories. For each category, each LLM was asked, in substance, "What are the top 10–20 brands or products in {category}?" (exact prompt phrasing varied slightly per LLM; the intent in all cases was a ranked list of category-leading brands as the LLM understood them).
The brands appearing in any LLM's response for any category, deduplicated by domain, form the brand universe. Capture window: leaderboard data anchored to 2026-04-19; full corpus frozen for analysis on 2026-04-25.
This sampling design has an important consequence — the selection effect documented as a methodological caveat below.
Outcome variables
The leaderboard data yields five outcome variables per brand:
Note that not every brand is listed by every LLM. A brand is in the universe if any of the three LLMs listed it; per-LLM coverage varies because the three LLMs' implicit selection functions differ.
Predictor data — three scanners
For predictor data, three independent agent-readiness scanners ran against each of the 908 brands' primary domains:
Scanner sweep methodology
The three scanners ran via a parallel orchestrator over the period 2026-04-25 / 2026-04-26. Each scanner runs in its own concurrency lane with rate-limiting and retry policy tuned to the scanner's transport:
- Cloudflare — concurrency 1, ~6-second gap between requests (compliant with the public API's announced rate limit)
- Fern — concurrency 4 (local CLI, CPU-bound; no upstream rate limit)
- Respectarium — concurrency 2, ~1-second gap (compliant with the scanner's backend capacity)
Per-domain scan results are written atomically as JSON to data/scans/<runId>/<scanner>/<domain>.json (success) or <domain>.error.json (persistent failure). Transient errors (network timeouts, HTTP 5xx, Retry-After responses) are retried up to 2 attempts before being marked as persistent failures. The orchestrator is idempotent on (runId, domain, scanner) — re-running a sweep on the same runId skips already-completed work, supporting partial recovery from interrupted runs.
After all three scanners completed, a merge step joined per-scan results on the domain key into a single canonical dataset (data/merged.json). Per-domain rows include all three scanners' outputs plus the brand's leaderboard data; missing-scanner cases are preserved as null rather than dropped.
Bot-protected brands — those for which one or more scanners returned success: false because the target site blocked the scanner's fingerprint — are kept in the dataset with their non-blocked scanners' data intact. They form the basis of the derived bot_protected predictor (§3.6).
Selection effect (important methodological caveat)
Every brand entered the dataset by being mentioned in at least one LLM's listing for at least one category. The findings therefore characterize relative ranking among already-LLM-discovered brands, NOT LLM-mention probability. We do not have non-mentioned brands in the dataset, so we cannot estimate the effect of agent-readiness on whether an LLM mentions a brand at all — only on relative rank position once mentioned. This limitation is discussed further in §4.
2.3 Sample
- n = 908 brands (target was 1500; sample expansion is planned for the next quarterly study cycle)
- 50 categories of which 24 have ≥ 20 brands and qualify for per-category breakouts
- Single snapshot, captured 2026-04-25 / 2026-04-26 (cross-sectional, not longitudinal)
2.4 Encoding
Native enum statuses are encoded numerically for correlation analysis:
- Cloudflare and Respectarium statuses →
pass = +1, fail = -1, neutral = 0, missing = null - Fern statuses →
pass = +1, warn = +0.5, skip = 0, fail = -1, error = -1, missing = null
A derived predictor bot_protected is set to 1 when any scanner reported success: false for the brand (typically due to fingerprint-based bot blocking by the target site). Otherwise bot_protected = 0.
2.5 Variance filter
Per-check predictors with <5% variance (one outcome dominates ≥ 95% of brands) are excluded from correlation analysis. With near-zero variation, no statistical relationship can be detected. The exclusion preserves transparency: the excluded checks are listed separately as a finding in their own right (see §3.5).
2.6 Multiple-testing correction
Benjamini-Hochberg FDR is the primary correction throughout, applied within each script's family of tests (e.g., across all 250 univariate tests in the univariate analysis). Bonferroni is reported alongside as a more conservative reference. Per-LLM analyses apply FDR within each (LLM, strategy) cell separately AND globally across all per-LLM tests.
2.7 Determinism and reproducibility
All analyses are deterministic — no random sampling or seeded randomness. Re-running the analysis scripts on the same dataset produces numerically identical outputs. Code, dataset, and complete reproducibility instructions are at github.com/respectarium/agent-adoption-research.
3. Findings
3.1 Two signals graduate to PROMOTE_SCORED
Of 50 tested predictors, only two pass all four pre-registered thresholds:
Univariate ρ vs predictor's best outcome (claudeRank for cloudflare.level; avgRank for respectarium.markdown-negotiation). Multivariate β reported for best-by-|t| outcome.
Both effects are small to medium in magnitude. Neither is a "silver bullet" predictor. Their value is in being the only signals that survive every check our pre-spec required.
Sites with cleaner basic crawler hygiene (Cloudflare's aggregate level reflects robots.txt quality, AI-bot rules presence, sitemap availability) are associated with modestly better LLM visibility outcomes, after controlling for category. Sites that respond appropriately to Accept: text/markdown content-negotiation requests show similar modest improvement. These two signals capture different layers of agent-readiness — protocol-level clarity vs content-presentation flexibility — and both are confirmed predictively useful.
OLS regression with category fixed effects. 95% CI excluding zero is one of four pre-registration criteria for PROMOTE_SCORED.
β coefficient (rank positions for rank outcomes; points for claiScore)
Source: results/04-multivariate.json (study-2026-04). Each row = predictor's best-by-|t| outcome.
3.2 Effect sizes are small to medium across the board
No effect sizes reach the conventional "large" threshold (Cohen's d > 0.8) anywhere in the study. The narrative this supports is "agent-readiness is a real but small contributor to LLM visibility" — not the stronger claim that adoption produces dramatic visibility gains.
3.3 Claude vs GPT structurally reverse on 4 signals
The most striking structural finding. For four respectarium.* checks, Claude and GPT correlate in opposite directions with whether the LLM lists the brand:
(Asterisks mark FDR-adjusted p < 0.05.)
Predictor → being-listed-by-LLM correlation. Same predictor, opposite signs. All four pairs FDR-significant in both columns.
Source: results/02-per-llm.json, binary strategy (study-2026-04).
These are not noise. The directions are statistically significant in both columns simultaneously, on the same predictors against the same outcome family. Claude and GPT select brands for their listings using structurally different criteria.
Note that all four reversal signals are Respectarium-scanner predictors. The same conceptual checks as measured by Cloudflare or Fern do not produce the same reversal pattern, likely because the three scanners operationalize these checks differently (see §3.4 on cross-scanner divergence). The reversal phenomenon may be partially scanner-implementation-specific.
Plausible mechanisms include:
- Training-data recall hypothesis: GPT's training data is comparatively older. Established household-name brands are recognized by GPT from training without active crawling — and these brands disproportionately have basic web hygiene like robots.txt and sitemaps. Claude's training is comparatively more recent, putting more weight on directly-readable site signals.
- Crawler-policy hypothesis: GPT's crawlers may be blocked by sites with aggressive robots.txt configurations. Sites without robots.txt are implicitly permissive. The negative direction in GPT could reflect this.
- Selection bias hypothesis: When asked "give me top brands in category X," each LLM applies its own implicit selection function. The four reversal-signals correlate with category structure differently across LLMs.
We cannot disambiguate these mechanisms with this dataset. What we can say is that a single universal agent-readiness score that optimizes outcomes across all three LLMs is structurally constrained — improvements that benefit Claude listing actively hurt GPT listing on these four checks. Universal optimization is unreachable; weighted compromises remain possible but cannot resolve the underlying reversal.
3.4 Cross-scanner divergence: same names, different things
Three pairs of scanners share check names but produce essentially uncorrelated results:
Other shared-name pairs correlate moderately:
11 cross-scanner pairs of identically-named checks. Three correlate at ρ < 0.05 — they measure different things.
Source: results/06-redundancy.json (study-2026-04). All 11 same-name pairs ranked.
The three independent scanners agree on which brand is which (they all use domain as join key) but they often disagree on which brands satisfy a given check, even when the check has the same name. The "Respectarium / Cloudflare / Fern" labels are not interchangeable measurement instruments — they are different operationalizations of overlapping concepts.
This is itself a publishable finding for the agent-readiness research community: claims of the form "this site is agent-ready by Specification X" are scanner-implementation-specific. Any cross-scanner comparison must be made with the divergence explicitly acknowledged.
3.5 Twenty of 66 per-check signals have <5% adoption — the ecosystem is in its infancy
The variance filter (§2.5) excludes 20 of the 66 per-check signals because <5% of brands have anything other than the modal status. (These 20 are part of the 22 predictors excluded before evaluation — the remaining 2 exclusions are aggregate or derived predictors filtered by additional pre-registered data-quality criteria.) The excluded set is almost entirely the bleeding-edge agent-protocol family:
- Cloudflare exclusions:
mcpServerCard,oauthProtectedResource,oauthDiscovery,agentSkills,contentSignals,webBotAuth,a2aAgentCard - Respectarium exclusions:
llms-txt-exists,llms-txt-valid,llms-txt-size,agents-md-detection,mcp-server-card,agent-skills,web-bot-auth,a2a-agent-card,content-signals,api-catalog,oauth-protected-resource - Fern exclusions:
llms-txt-directive,tabbed-content-serialization
These checks could not be evaluated for predictive power — too few brands have implemented them.
Source: results/00-data-quality.json (study-2026-04).
Practical adoption of these specifications, as of 2026-04, ranges from approximately 0% to 4% in the surveyed brand universe. The specifications exist — agent-protocol families like MCP, A2A, OAuth-discovery, AGENTS.md, and Cloudflare's commerce-protocol stack (x402, mpp, ucp, acp, ap2) are public and documented. The practice has not yet arrived.
This is publishable on its own merit. We cannot measure the predictive power of these checks until adoption rises high enough to produce variance against outcomes. The next study cycle (planned Q2 2026) will re-measure this universe; the comparison of "X% adoption now vs Y% adoption six months ago" becomes its own data point.
3.6 Bot-protection: meaningful covariate, not standalone signal
12% of brands (109 of 908) had at least one scanner blocked by the target site's bot-protection. Welch's t-test comparing outcomes between bot-protected and unblocked brands:
Standalone, bot-protection has zero detectable effect on outcomes.
In multivariate regression with category fixed effects, however, bot_protected emerges as a meaningful covariate: β = -13 claiScore points, p < 0.02. The within-category negative effect is masked at the across-category level because bot-protection is concentrated in enterprise/incumbent categories that have higher baseline claiScore overall.
bot_protected is best treated as a covariate to control for in multivariate models, not as a scored signal in its own right.
A specific per-LLM finding worth flagging: GPT listing is positively correlated with bot-protection (+8.1 percentage point listing rate for bot-protected brands vs unblocked brands). Claude (-2.1pp) and Gemini (-0.4pp) show no such effect. The pattern is consistent with GPT preferentially listing established brands recognized from training data, even when those brands' websites cannot be crawled directly.
3.7 The Respectarium scanner's score aggregate has zero predictive power
We report this finding transparently as part of our conflict-of-interest commitment (§5).
respectarium.score (the headline 0–100 number the Respectarium Agent-Adoption Check tool produces for each brand) has the lowest predictive power of any aggregate predictor measured:
- Mean |ρ| across 5 outcomes: 0.016
- FDR-adjusted p (vs avgRank, the best-performing outcome): 0.69
- Multivariate t-statistic (best across outcomes): 1.84, raw p = 0.07, CI spans zero
For comparison, cloudflare.level (the analogous Cloudflare aggregate) shows mean |ρ| = 0.109 — almost an order of magnitude stronger. The Respectarium scanner's individual checks include strong predictors (markdown-negotiation graduates to PROMOTE_SCORED), but the v1 weighting scheme that combines them into the score aggregate dilutes the signal.
The intended use of this finding is informative: a v2 scoring scheme rebuilt around the empirically-validated signals should produce a more predictive aggregate. We publish the weak v1 result rather than concealing it.
4. Caveats and limitations
4.1 Selection effect at the brand-recruitment level
The most important methodological caveat. Every brand in the dataset entered by virtue of being LLM-mentioned in at least one category for at least one of the three LLMs. Findings characterize:
- Relative ranking among already-LLM-discovered brands — what predicts higher rank when an LLM does mention you
- NOT LLM-mention probability — we cannot estimate whether agent-readiness causes a previously-unmentioned brand to become LLM-mentioned
This is structural to the dataset construction. Resolving it would require collecting outcome data on a broader, non-LLM-mentioned brand pool — a substantial expansion that future studies may attempt.
4.2 Cross-sectional only
Single snapshot, 2026-04-25 / 26. We cannot make causal inferences about whether adopting agent-readiness signals causes higher LLM rank, only whether the two are correlated at this point in time.
Quarterly re-runs over the next 4–8 quarters will permit panel analysis and substantially stronger inference. The Q2 study tag (study-2026-07) will be the first time-series step.
4.3 Sample size n = 908 (target was 1500)
Adequate for univariate and per-LLM (binary-outcome) analyses; per-category breakouts demanding (|ρ| ≥ 0.45 needed for raw p < 0.05 at n = 20). Sample expansion to ~1500 brands is planned for the next study cycle, primarily via category expansion (50 → 80–100 categories).
4.4 No brand-size proxy
The Respectarium leaderboard data does not include employee count, revenue, domain age, or other size-related metadata. Multivariate regression uses only category fixed effects as a control. This is a meaningful gap — brand size is plausibly a strong confound for LLM visibility (larger brands are more frequently mentioned in training data) and we cannot control for it directly.
Possible Q2 enhancements include WHOIS-based domain-age enrichment as a covariate, and external traffic-tier data where licensing permits.
4.5 Bot-protection asymmetry
The set of brands blocked by Respectarium ≠ the set blocked by Cloudflare ≠ the set blocked by Fern. Different scanner fingerprints trip different bot-defenses. bot_protected reflects "this scanner's fingerprint was blocked" rather than "this site is universally bot-protected." Independent implementations will produce different bot-protected sets.
The bot_protected covariate is the union of scanner-block events: a brand is flagged as bot-protected if any of the three scanners was blocked. The 12% rate reflects this union, not universal blocking. A brand could be bot_protected: 1 because Cloudflare blocked it while Fern and Respectarium succeeded — independent implementations would produce different bot-protected sets and could observe different effects.
4.6 The 20 zero-variance checks cannot be evaluated
We cannot say agent-protocol checks (MCP, A2A, OAuth, AGENTS.md, content-signals, commerce protocols) do not predict LLM visibility — only that adoption is currently too sparse to measure their predictive power. Their effect on outcomes becomes measurable only as adoption rises.
4.7 Three-LLM scope
Only Claude, GPT, and Gemini are surveyed. Other AI assistants (Perplexity, Copilot, Gemini-AI-Mode, etc.) may behave differently. The study's findings should not be generalized to LLMs outside the surveyed three without re-measurement.
5. Conflict of interest disclosure
Respectarium operates one of the three scanners evaluated in this study. To mitigate analytical bias:
- Pre-registered analytical thresholds. All promotion / drop / informational rules were committed in writing on 2026-04-24, two days before any results were viewed. The threshold logic is implemented mechanically in
scripts/10-verdicts.ts— auditable. - Findings unfavorable to the Respectarium scanner are reported transparently. §3.7 documents that the Respectarium scanner's
scoreaggregate has zero predictive power for LLM-visibility outcomes. This finding is published rather than concealed; it is a key input to v2 spec design. - Per-scanner reporting throughout. Readers can examine each scanner's signals independently. The Respectarium scanner does not receive special treatment in any tabulation.
- Cross-scanner divergence on shared check names is documented as a primary finding (§3.4), not a footnote. We do not minimize the implication that scanner outputs diverge.
- The Agent-Adoption Specification is open. The Respectarium scanner implements an open spec at respectarium.com/spec/agent-adoption/v1. Independent implementations are invited and would be welcomed as additional comparison data.
6. Reproducibility
All analysis is deterministic and fully reproducible. The complete analytical pipeline is published at:
github.com/respectarium/agent-adoption-research (study tag: study-2026-04)
Contents:
data/merged.json— the canonical merged dataset (908 rows × ~80 columns when flattened)data/merged.csv— flat CSV view of the same datascripts/— 11 TypeScript analysis scripts and 5 typed helper modulesresults/— canonical outputs from running scripts on the dataset (matches output of any peer's reproduction)methodology.md— pre-registered thresholds, encoding rules, and conflict-of-interest disclosureREPRODUCIBILITY.md— step-by-step reproduction protocol
Reproduction prerequisites: Node.js 22+, ~150 MB disk space, ~1 minute of compute. Numeric outputs are byte-identical between runs (no random seeds, no sampling, deterministic dependencies).
7. Acknowledgments
This study used data from three publicly-available scanner systems:
- Respectarium Agent-Adoption Check tool — closed-source implementation of the open Agent-Adoption Specification V1, available at respectarium.com/agent-adoption-check. Specification published openly at respectarium.com/spec/agent-adoption/v1.
- Fern
afdocsCLI — open-source agent-documentation scanner from Fern, published at github.com/fern-api/fern. Used as published. - Cloudflare
isitagentready.comAPI — public API from Cloudflare. Used as published.
Outcome data was sourced from Respectarium's tracked leaderboards across 50 B2B SaaS categories, captured 2026-04-25.
We thank the broader agent-readiness research and engineering community whose published specifications (W3C, IETF, individual companies' open work) informed the check definitions evaluated here.
8. Citation and license
Citation
Respectarium Research Team. Agent-Adoption Correlation Study — 2026-04 Baseline. Respectarium, 2026-04-26.
Web: https://respectarium.com/research/correlation-2026-04
Source + data: https://github.com/respectarium/agent-adoption-research/releases/tag/study-2026-04
License
This article and its underlying data are licensed under Creative Commons Attribution 4.0 International (CC-BY 4.0). The analysis code is licensed under MIT. Both licenses permit modification and redistribution with appropriate attribution.
Future studies
Quarterly re-measurement is planned. The Q2 study (study-2026-07) will publish to the same repository as a new immutable tag. Past tags remain accessible as published; the comparative time series across tags is itself a research output.
For updates: open an Issue or watch the repository on GitHub.
Respectarium Research, 2026-04-26