A plain-language read of our 2026-04 correlation study. For the formal preprint, see the research page. For the data and analysis code, the GitHub repository.
10 min read · Published 2026-04-26 by Respectarium · Version 1.0 · CC-BY 4.0
Imagine you run a B2B SaaS company. You read the latest agent-readiness guidance and decide to invest. You add a clean robots.txt. You publish a sitemap. You expose OAuth discovery metadata at .well-known/openid-configuration. You configure your server to respond appropriately to Accept: text/markdown requests.
You ship. Then you check whether AI assistants list your company in response to "top brands in [your category]" queries. According to our 2026-04 study, here's the pattern in the data:
Both correlations are statistically significant after multiple-testing correction. We're not measuring noise. The two LLMs select brands by structurally different criteria — at least on these four specific checks.
Four agent-readiness checks where Claude and GPT correlate in opposite directions with whether they list the brand.
Source: 908-brand correlation study, 2026-04. All four reversals are statistically significant after FDR correction in both LLM directions simultaneously.
This is the most counterintuitive finding of our study, and we want to lead with it because the practical implication matters: a single agent-readiness score that optimizes for visibility across all major LLMs is not achievable — at least not with the signals we measured.
The rest of this article unpacks what we found, what we didn't find, and what it means for anyone trying to think about agent-readiness in 2026.
We took 908 brands across 50 B2B SaaS categories — every brand that at least one of three LLMs (Claude, GPT, Gemini) mentioned when asked for the top 10–20 brands in a category. Then we ran three independent agent-readiness scanners against each brand's website:
isitagentready.com APIafdocs CLIThat gave us 66 per-check measurements per brand — 25 from Respectarium, 19 from Cloudflare, 22 from Fern. Plus each scanner's aggregate readiness score and one derived “is the site blocking our scanner?” flag — bringing the total to 72 predictors going into the analysis.
Before testing, we excluded 22 of those 72. Twenty had less than 5% real-world adoption (no variance to detect anything statistical — we cover those separately below in the ecosystem-nascency section). Two more were dropped by additional pre-registered data-quality filters. That left 50 predictors evaluated against five LLM-visibility outcomes.
Combined with the LLM rank data, we ran 11 statistical analyses on the merged dataset to test which signals predict whether an LLM lists a brand and where it ranks.
Important caveat baked into how we collected the data: every brand in our dataset was already mentioned by at least one LLM. Brands that no LLM mentioned never entered the study. So our findings are about relative ranking among already-mentioned brands, not about whether agent-readiness gets your brand mentioned in the first place. We can't answer the second question without a different sampling design.
Before any analysis began, we wrote down the criteria a signal must meet to be classified as "predictive enough to score." The rules were committed in writing on 2026-04-24 — two days before we ran the analysis. They are applied mechanically by our analysis script.
For a signal to graduate to PROMOTE_SCORED status, all four of these must be true:
Why pre-register? Because without pre-registration, statistical analysis becomes a fishing expedition where you can find "significant" results just by trying enough combinations. With pre-registered thresholds applied mechanically, the result you get is the result you get — no shopping for a story.
The result: 2 of 50 tested signals graduate.
Pre-registered analytical thresholds, applied mechanically across 11 statistical scripts.
66 per-check + 5 aggregates + 1 derived (bot_protected)
<5% of brands deviate from modal status (ecosystem-nascency)
further pre-registered data-quality filters before evaluation
evaluated against 5 outcomes (250 tests)
failed univariate AND multivariate AND no subgroup signal
passed some criteria but not all four for promotion
passed all four pre-registered criteria
cloudflare.levelrespectarium.markdown-negotiationSource: results/10-verdicts.json + results/00-data-quality.json (study-2026-04).
Two predictors passed all four pre-registered criteria:
Cloudflare's isitagentready.com produces an aggregate score from 0 to 5 based on a portfolio of basic web-presence signals: clean robots.txt, declared AI-bot rules, sitemap availability, etc. Sites that score one level higher rank — on average — about 1.6 positions higher in the average LLM listing, controlling for industry category. That's not enormous, but it's robust across multiple analytical lenses.
If your server responds appropriately when a client requests text/markdown instead of text/html, you're flagged as passing this check. Sites that pass rank, on average, about 1.8 positions higher in Gemini specifically — and in the same direction (though smaller magnitude) for Claude.
These two signals capture different layers of agent-readiness: protocol-level clarity (robots.txt, AI-bot rules) and content-presentation flexibility (markdown alternatives to HTML). Both modestly predict where a brand lands in LLM rankings — once that brand has been included in the LLM's pool of candidates.
per +1 level on Cloudflare's readiness scale, on average across LLM listings.
Aggregates whether your site has clean robots.txt, AI-bot rules, sitemap, and basic crawler hygiene.
for sites that respond to Accept: text/markdown requests.
Whether your server serves markdown content to clients that ask for it (alternative to HTML).
Two predictors. Both small effects. Both robust to multiple-testing correction, multivariate control, cross-LLM consistency, and redundancy filtering.
Here is every signal that survives the strictest test we ran (Welch's t-test with FDR correction), expressed in the original outcome scale rather than abstract correlation coefficients:
These are real differences. A 10-point improvement on a 0-100 quality score is observable — readers can feel the difference between a brand that scores 65 vs 75. A 3-position rank improvement is noticeable when the listing is 1–20 deep.
But these are also clearly not transformative. None of these effects reach the conventional "large effect" threshold (Cohen's d > 0.8). The honest framing is "agent-readiness is a small but real contributor to LLM visibility" — not "implement these checks and watch your brand suddenly rank higher."
The narrative space between those two claims is where careful interpretation lives. Industry voices that promise dramatic LLM visibility wins from agent-readiness adoption are getting ahead of the evidence.
The four signals where Claude and GPT diverge — sitemap-exists, oauth-discovery, robots-txt-exists, markdown-negotiation — are all checks from the Respectarium scanner. Claude associates them positively with listing a brand. GPT associates them negatively. Three plausible mechanisms:
robots.txt configurations — including AI-bot-specific blocking rules. The "negative" direction of the GPT correlation might reflect "sites without robots.txt are implicitly permissive to all crawlers, including GPT's." Sites that bother having a robots.txt file are more likely to also have rules that block GPT specifically.We can't disambiguate these mechanisms with cross-sectional data alone. What we can confidently say: the disagreement is real, statistically significant, and has practical implications for anyone designing a "universal agent-readiness score." Such a score is structurally constrained by these reversal effects — any single number will trade Claude visibility for GPT visibility on these specific signals. Universal optimization is unreachable; weighted compromises remain possible.
Of the 66 per-check signals we measured (the bulk of our 72 predictors), 20 had less than 5% real-world adoption in our 908-brand sample — they were among the 22 we excluded before testing. We couldn't even include them in the correlation analysis — there's no statistical signal to detect when 95% of brands sit in the same status bucket.
The list of un-adopted checks reads like a tour through the bleeding-edge agent-protocol future:
web-bot-auth (cryptographic proof-of-bot identity)content-signals (declared content-usage permissions for AI training)x402, mpp, ucp, acp, ap2 (agent-to-payment protocols)Adoption rates for these range from 0.0% to 3.6%. The specifications are public, documented, and championed by major industry players. The practice has not yet arrived.
These checks could not be evaluated for predictive power — too few brands have implemented them.
Source: results/00-data-quality.json (study-2026-04).
This is itself a publishable finding. The agent-readiness standards landscape in 2026 is comparable to where the AMP standard was in 2016 or where structured data was around 2012-2013: the technology exists, the documentation is available, the major players are advocating for it, and almost nobody has actually implemented it yet. Adoption follows specifications by years, not months — and we now have a baseline measurement to track that adoption against in future quarterly studies.
Respectarium operates one of the three scanners we evaluated. We told you that upfront in §1 of the preprint. Here's the consequence: our scanner produces a 0–100 aggregate score for each brand it scans. We tested whether that score predicts LLM visibility outcomes.
It doesn't.
Specifically: the Respectarium scanner's score aggregate has a mean correlation of 0.016 across all five outcome variables. The FDR-adjusted p-value is 0.69 — far above any reasonable significance threshold. Effectively no predictive power.
For comparison, Cloudflare's level aggregate has a mean correlation of about 0.11 — almost an order of magnitude stronger. Some of the Respectarium scanner's individual checks (like markdown-negotiation) graduate to PROMOTE_SCORED status. But the way the v1 weighting scheme combines those checks into a single score dilutes the signal of the few actually-predictive ones.
We're publishing this finding because methodological transparency is the credentialing mechanism of this whole research program. If we quietly buried the result that our own scanner's aggregate score doesn't predict outcomes, we'd be exactly the kind of vendor-research operation we want to distinguish ourselves from.
Practically, this finding is the empirical foundation for a v2 specification. The next version of the Agent-Adoption Specification will rebuild the scoring formula around the signals that survived our pre-registered tests, rather than the more comprehensive but signal-diluted v1 weighting.
The agent-readiness signals that empirically correlate with LLM visibility — clean robots.txt, AI-bot rules in robots.txt, content-negotiation for markdown, OAuth-discovery metadata — are already well-documented as web-development best practices. They're not exotic. If you're investing in any kind of structured-data, SEO, or developer-experience polish, you probably already implement most of them.
What we can confirm based on our data:
The full preprint, methodology, dataset, and source code are at github.com/respectarium/agent-adoption-research. Reproduction takes about a minute on a laptop with Node.js installed. We welcome critique, replication attempts, and proposed analytical improvements.
Our finding that 20 of 66 per-check signals have <5% adoption in 2026 establishes a baseline. Quarterly re-runs will show how the adoption curve develops. The specs that go from <1% adoption now to >5% adoption in 6 months are the ones to watch — they'll be the next data points testable against LLM-visibility outcomes.
We will repeat this study quarterly. The next release (study-2026-07) will:
Past tags remain accessible forever. The comparative time series across quarterly tags is itself a research output — we expect to be able to make stronger longitudinal claims by Q3 (~6 months out) and meaningful causal-direction claims by Q4 (~9 months).
We're also working on a v2 of the Agent-Adoption Specification that incorporates these findings into the scoring formula. Specifically: rebuild the score around the predictively-validated signals, treat bot-protection as a covariate rather than a quality penalty, and demote the bleeding-edge agent-protocol checks to "informational" until adoption produces variance to test against.
For updates on quarterly re-runs and v2 spec development, watch the GitHub repository.
Respectarium Research, 2026-04-26