RESPECTARIUM·Research

What we found when we tested 50 agent-readiness signals against three LLMs

A plain-language read of our 2026-04 correlation study. For the formal preprint, see the research page. For the data and analysis code, the GitHub repository.

10 min read · Published 2026-04-26 by Respectarium · Version 1.0 · CC-BY 4.0

Two of the three major LLMs disagree about what makes a brand worth listing

Imagine you run a B2B SaaS company. You read the latest agent-readiness guidance and decide to invest. You add a clean robots.txt. You publish a sitemap. You expose OAuth discovery metadata at .well-known/openid-configuration. You configure your server to respond appropriately to Accept: text/markdown requests.

You ship. Then you check whether AI assistants list your company in response to "top brands in [your category]" queries. According to our 2026-04 study, here's the pattern in the data:

Claude is more likely to list you if your site has these features
GPT is more likely to list you if it doesn't

Both correlations are statistically significant after multiple-testing correction. We're not measuring noise. The two LLMs select brands by structurally different criteria — at least on these four specific checks.

Same site change. Two LLMs. Opposite verdicts.

Four agent-readiness checks where Claude and GPT correlate in opposite directions with whether they list the brand.

Source: 908-brand correlation study, 2026-04. All four reversals are statistically significant after FDR correction in both LLM directions simultaneously.

This is the most counterintuitive finding of our study, and we want to lead with it because the practical implication matters: a single agent-readiness score that optimizes for visibility across all major LLMs is not achievable — at least not with the signals we measured.

The rest of this article unpacks what we found, what we didn't find, and what it means for anyone trying to think about agent-readiness in 2026.

What we measured

We took 908 brands across 50 B2B SaaS categories — every brand that at least one of three LLMs (Claude, GPT, Gemini) mentioned when asked for the top 10–20 brands in a category. Then we ran three independent agent-readiness scanners against each brand's website:

Respectarium's Agent-Adoption Check (the closed-source tool we built and operate)
Cloudflare's public isitagentready.com API
Fern's open-source afdocs CLI

That gave us 66 per-check measurements per brand — 25 from Respectarium, 19 from Cloudflare, 22 from Fern. Plus each scanner's aggregate readiness score and one derived “is the site blocking our scanner?” flag — bringing the total to 72 predictors going into the analysis.

Before testing, we excluded 22 of those 72. Twenty had less than 5% real-world adoption (no variance to detect anything statistical — we cover those separately below in the ecosystem-nascency section). Two more were dropped by additional pre-registered data-quality filters. That left 50 predictors evaluated against five LLM-visibility outcomes.

Combined with the LLM rank data, we ran 11 statistical analyses on the merged dataset to test which signals predict whether an LLM lists a brand and where it ranks.

Important caveat baked into how we collected the data: every brand in our dataset was already mentioned by at least one LLM. Brands that no LLM mentioned never entered the study. So our findings are about relative ranking among already-mentioned brands, not about whether agent-readiness gets your brand mentioned in the first place. We can't answer the second question without a different sampling design.

We pre-registered the rules before looking at results

Before any analysis began, we wrote down the criteria a signal must meet to be classified as "predictive enough to score." The rules were committed in writing on 2026-04-24 — two days before we ran the analysis. They are applied mechanically by our analysis script.

For a signal to graduate to PROMOTE_SCORED status, all four of these must be true:

The signal correlates with at least one outcome with FDR-adjusted p < 0.05 (a standard multiple-testing correction)
The effect remains statistically significant after we control for category (which is why something looks predictive across the dataset can disappear when you compare apples to apples within a single industry segment)
At least 2 of 3 LLMs reward the signal in the same direction
The signal isn't redundant with another, stronger signal that already qualifies

Why pre-register? Because without pre-registration, statistical analysis becomes a fishing expedition where you can find "significant" results just by trying enough combinations. With pre-registered thresholds applied mechanically, the result you get is the result you get — no shopping for a story.

The result: 2 of 50 tested signals graduate.

From 72 measured signals to 2 promoted predictors

Pre-registered analytical thresholds, applied mechanically across 11 statistical scripts.

Total measured signals72

66 per-check + 5 aggregates + 1 derived (bot_protected)

Excluded — zero variance20

<5% of brands deviate from modal status (ecosystem-nascency)

Excluded — additional filters2

further pre-registered data-quality filters before evaluation

Tested predictors50

evaluated against 5 outcomes (250 tests)

DROP30

failed univariate AND multivariate AND no subgroup signal

KEEP_INFORMATIONAL18

passed some criteria but not all four for promotion

PROMOTE_SCORED2

passed all four pre-registered criteria

cloudflare.levelrespectarium.markdown-negotiation

Source: results/10-verdicts.json + results/00-data-quality.json (study-2026-04).

What graduated, and what it means

Two predictors passed all four pre-registered criteria:

cloudflare.level — Cloudflare's aggregate readiness level

Cloudflare's isitagentready.com produces an aggregate score from 0 to 5 based on a portfolio of basic web-presence signals: clean robots.txt, declared AI-bot rules, sitemap availability, etc. Sites that score one level higher rank — on average — about 1.6 positions higher in the average LLM listing, controlling for industry category. That's not enormous, but it's robust across multiple analytical lenses.

respectarium.markdown-negotiation — content-negotiation for markdown

If your server responds appropriately when a client requests text/markdown instead of text/html, you're flagged as passing this check. Sites that pass rank, on average, about 1.8 positions higher in Gemini specifically — and in the same direction (though smaller magnitude) for Claude.

These two signals capture different layers of agent-readiness: protocol-level clarity (robots.txt, AI-bot rules) and content-presentation flexibility (markdown alternatives to HTML). Both modestly predict where a brand lands in LLM rankings — once that brand has been included in the LLM's pool of candidates.

Two predictors graduated — what they actually measure

PROMOTED

Basic readiness level

+1.6rank positions

per +1 level on Cloudflare's readiness scale, on average across LLM listings.

Aggregates whether your site has clean robots.txt, AI-bot rules, sitemap, and basic crawler hygiene.

Sample: 826 brands · controlling for industry categorycloudflare.level →

PROMOTED

Markdown content-negotiation

+1.8rank positions in Gemini

for sites that respond to Accept: text/markdown requests.

Whether your server serves markdown content to clients that ask for it (alternative to HTML).

Sample: 845 brands · controlling for industry categoryrespectarium.markdown-negotiation →

Two predictors. Both small effects. Both robust to multiple-testing correction, multivariate control, cross-LLM consistency, and redundancy filtering.

The effects are real but small

Here is every signal that survives the strictest test we ran (Welch's t-test with FDR correction), expressed in the original outcome scale rather than abstract correlation coefficients:

If a brand passes this check…	…it ranks (on average)	Effect size
`respectarium.oauth-discovery`	+3.7 positions higher in Claude	medium
`cloudflare.robotsTxt`	+10.2 points higher on Respectarium's quality score (0–100)	small
`cloudflare.robotsTxt`	+3.1 positions higher in Claude rank	medium
`cloudflare.robotsTxtAiRules`	+8.5 points higher on quality score	small
`fern.redirect-behavior`	+7.1 points higher on quality score	small
`cloudflare.robotsTxtAiRules`	+2.6 positions higher in Claude rank	small

These are real differences. A 10-point improvement on a 0-100 quality score is observable — readers can feel the difference between a brand that scores 65 vs 75. A 3-position rank improvement is noticeable when the listing is 1–20 deep.

But these are also clearly not transformative. None of these effects reach the conventional "large effect" threshold (Cohen's d > 0.8). The honest framing is "agent-readiness is a small but real contributor to LLM visibility" — not "implement these checks and watch your brand suddenly rank higher."

The narrative space between those two claims is where careful interpretation lives. Industry voices that promise dramatic LLM visibility wins from agent-readiness adoption are getting ahead of the evidence.

Why Claude and GPT disagree (some hypotheses)

The four signals where Claude and GPT diverge — sitemap-exists, oauth-discovery, robots-txt-exists, markdown-negotiation — are all checks from the Respectarium scanner. Claude associates them positively with listing a brand. GPT associates them negatively. Three plausible mechanisms:

Hypothesis 1 — Training-data recency: GPT's training corpus includes a lot of older web content. Many household-name B2B brands appeared in that older content extensively, regardless of their current agent-readiness posture. GPT may rely more on training-time brand recognition; the absence of agent-readiness signals doesn't penalize a brand it already "knows." Claude, with newer training data, may weight current-state crawler signals more directly.
Hypothesis 2 — Crawler accessibility: GPT's crawler may be blocked by sites with stricter robots.txt configurations — including AI-bot-specific blocking rules. The "negative" direction of the GPT correlation might reflect "sites without robots.txt are implicitly permissive to all crawlers, including GPT's." Sites that bother having a robots.txt file are more likely to also have rules that block GPT specifically.
Hypothesis 3 — Selection by listing process: When asked for "top brands in category X," each LLM applies its own implicit selection function — combining training-data recall, retrieval, and ranking. Different LLMs converge on different brand subsets. The agent-readiness signals correlate with category structure differently for different LLMs because each LLM's category-scope is structurally different.

We can't disambiguate these mechanisms with cross-sectional data alone. What we can confidently say: the disagreement is real, statistically significant, and has practical implications for anyone designing a "universal agent-readiness score." Such a score is structurally constrained by these reversal effects — any single number will trade Claude visibility for GPT visibility on these specific signals. Universal optimization is unreachable; weighted compromises remain possible.

Specs exist; practice has barely arrived

Of the 66 per-check signals we measured (the bulk of our 72 predictors), 20 had less than 5% real-world adoption in our 908-brand sample — they were among the 22 we excluded before testing. We couldn't even include them in the correlation analysis — there's no statistical signal to detect when 95% of brands sit in the same status bucket.

The list of un-adopted checks reads like a tour through the bleeding-edge agent-protocol future:

MCP server cards (Model Context Protocol — for agentic interactions)
A2A agent cards (Agent-to-Agent communication declarations)
OAuth protected-resource metadata (formal OAuth 2.0 endpoint discovery)
AGENTS.md files (proposed convention for agent guidelines per repo)
web-bot-auth (cryptographic proof-of-bot identity)
content-signals (declared content-usage permissions for AI training)
The Cloudflare commerce-protocol stack: x402, mpp, ucp, acp, ap2 (agent-to-payment protocols)

Adoption rates for these range from 0.0% to 3.6%. The specifications are public, documented, and championed by major industry players. The practice has not yet arrived.

20 of 66 per-check signals: adoption rates 0–4%

These checks could not be evaluated for predictive power — too few brands have implemented them.

respectariumcloudflarefern

Source: results/00-data-quality.json (study-2026-04).

This is itself a publishable finding. The agent-readiness standards landscape in 2026 is comparable to where the AMP standard was in 2016 or where structured data was around 2012-2013: the technology exists, the documentation is available, the major players are advocating for it, and almost nobody has actually implemented it yet. Adoption follows specifications by years, not months — and we now have a baseline measurement to track that adoption against in future quarterly studies.

The unflattering finding we published anyway

Respectarium operates one of the three scanners we evaluated. We told you that upfront in §1 of the preprint. Here's the consequence: our scanner produces a 0–100 aggregate score for each brand it scans. We tested whether that score predicts LLM visibility outcomes.

It doesn't.

Specifically: the Respectarium scanner's score aggregate has a mean correlation of 0.016 across all five outcome variables. The FDR-adjusted p-value is 0.69 — far above any reasonable significance threshold. Effectively no predictive power.

For comparison, Cloudflare's level aggregate has a mean correlation of about 0.11 — almost an order of magnitude stronger. Some of the Respectarium scanner's individual checks (like markdown-negotiation) graduate to PROMOTE_SCORED status. But the way the v1 weighting scheme combines those checks into a single score dilutes the signal of the few actually-predictive ones.

We're publishing this finding because methodological transparency is the credentialing mechanism of this whole research program. If we quietly buried the result that our own scanner's aggregate score doesn't predict outcomes, we'd be exactly the kind of vendor-research operation we want to distinguish ourselves from.

Practically, this finding is the empirical foundation for a v2 specification. The next version of the Agent-Adoption Specification will rebuild the scoring formula around the signals that survived our pre-registered tests, rather than the more comprehensive but signal-diluted v1 weighting.

What this means for you

If you run a B2B SaaS site

The agent-readiness signals that empirically correlate with LLM visibility — clean robots.txt, AI-bot rules in robots.txt, content-negotiation for markdown, OAuth-discovery metadata — are already well-documented as web-development best practices. They're not exotic. If you're investing in any kind of structured-data, SEO, or developer-experience polish, you probably already implement most of them.

What we can confirm based on our data:

✅ These practices correlate (modestly) with better LLM visibility
✅ The cost of implementing them is low for most modern web stacks
⚠️ Don't expect dramatic visibility wins — the effects are small
⚠️ Don't optimize aggressively for "agent-readiness" if the cost is real engineering work — the ROI is small at current evidence

If you're a researcher or methodologist

The full preprint, methodology, dataset, and source code are at github.com/respectarium/agent-adoption-research. Reproduction takes about a minute on a laptop with Node.js installed. We welcome critique, replication attempts, and proposed analytical improvements.

If you're tracking the agent-readiness standards landscape

Our finding that 20 of 66 per-check signals have <5% adoption in 2026 establishes a baseline. Quarterly re-runs will show how the adoption curve develops. The specs that go from <1% adoption now to >5% adoption in 6 months are the ones to watch — they'll be the next data points testable against LLM-visibility outcomes.

What's next

We will repeat this study quarterly. The next release (study-2026-07) will:

Re-measure the same 908-brand universe (or expanded to ~1500 if our sample expansion succeeds)
Use the same 11-script analytical pipeline applied unchanged
Compare directly to Q1: which signals strengthened, which weakened, which newly-adopted checks now have measurable variance
Publish to the same GitHub repository as a new immutable tag

Past tags remain accessible forever. The comparative time series across quarterly tags is itself a research output — we expect to be able to make stronger longitudinal claims by Q3 (~6 months out) and meaningful causal-direction claims by Q4 (~9 months).

We're also working on a v2 of the Agent-Adoption Specification that incorporates these findings into the scoring formula. Specifically: rebuild the score around the predictively-validated signals, treat bot-protection as a covariate rather than a quality penalty, and demote the bleeding-edge agent-protocol checks to "informational" until adoption produces variance to test against.

Where to read more

Formal preprint: respectarium.com/research/correlation-2026-04 — full methodology, complete results tables, references, conflict-of-interest disclosure
Source code, data, results: github.com/respectarium/agent-adoption-research — MIT-licensed code, CC-BY 4.0-licensed data, full reproducibility instructions
Agent-Adoption Specification V1: respectarium.com/spec/agent-adoption/v1 — the open specification the Respectarium scanner implements
Run the Agent-Adoption Check on your own site: respectarium.com/agent-adoption-check

For updates on quarterly re-runs and v2 spec development, watch the GitHub repository.

Respectarium Research, 2026-04-26