AI answers aren't fixed: why one ChatGPT check isn't a measurement

Key takeaways

AI search variance is real: ask ChatGPT the same buying query ten times and the brands it recommends shuffle between runs, with the same query, on the same day, with nothing changed.
That is not a glitch. A 2026 study across Perplexity, OpenAI’s SearchGPT, and Google Gemini found the same query returns a different set of cited sources on repeat, and many gaps between brands sit inside the measurement noise (Sielinski, arXiv 2026).
One screenshot proves you can appear. It says nothing about how often. Named in 8 of 10 runs is a strong position; 1 of 10 is a fluke you happened to catch. They look identical if you only ran it once.
The fix is to measure a hit rate instead of a yes or no: pick your real buying queries, run each several times per engine, and track the rate. The engines diverge, so measure each one.

Here is a test you can run in the next five minutes. Open ChatGPT, ask it “what’s the best [your category] for [your customer],” and screenshot the answer. Now ask the exact same query again. And again. If you are like most brands I have checked, the list of recommended companies will not be the same twice. This is AI search variance, and it is the single biggest reason one ChatGPT check is not a measurement of anything.

That shuffle is not a bug. AI answer engines are non-deterministic, which is a technical way of saying the model writes a fresh answer every time and a bit of randomness is built into how it does that. A 2026 study that submitted the same queries to Perplexity, OpenAI’s SearchGPT, and Google Gemini found that “identical queries submitted at different times can produce different responses and cite different sources,” and that the citation rankings were unstable from one run to the next (Sielinski, arXiv 2026).

So when someone shows you a screenshot of ChatGPT naming you (or not naming you), they have shown you one draw from a deck of cards. This post is about why that matters, what the screenshot actually tells you, and how to measure AI visibility properly: as a rate, not a verdict.

Google gives you a list, AI writes you an answer

Classic SEO trained all of us to expect a stable result. You search a keyword, you get a ranked list of links, and that list barely moves if you refresh. One check is a fair snapshot, because the system underneath is mostly fixed.

AI search broke that habit. The researchers behind a separate 2026 paper put it cleanly: in classic search “a single query often provides a representative snapshot,” but the probabilistic nature of AI search means “answers can vary across runs, prompts, and time, making one-off observations unreliable” (Schulte et al., arXiv 2026). Two different university teams, looking at this independently, landed on the same conclusion: stop measuring once.

Why does the answer move when the query did not? Because the model builds its reply one word at a time, and at each step it samples the next word from a set of probabilities instead of always picking the single likeliest one. Run it again and a few of those coin flips land differently, and the answer drifts. This even happens at the lowest randomness setting. When one team generated 1,000 completions of the same prompt at temperature zero (the setting meant to be deterministic), they got 80 different outputs, with the first divergence as early as the 103rd word (He, Thinking Machines Lab, 2025). The short version: sameness is not the default here. Variation is.

What one screenshot actually tells you

A single check has exactly one honest reading: it tells you whether you showed up that one time. It cannot tell you how often, and how often is the whole game.

Picture two brands. Brand A is named in 8 of 10 runs of the same query. Brand B is named in 1 of 10. Brand A is in a genuinely strong position; Brand B got lucky once. But if you only ran the query a single time and happened to catch the run where each appeared, the two screenshots look identical. Same logo, same “yes, ChatGPT recommends us,” wildly different reality.

The bad-news screenshot is just as shaky. “We checked and we’re not there” might really be “we’re named in 3 of 10, and you happened to screenshot one of the 7 misses.” You would walk into a client meeting declaring a problem that is partly a sampling accident, or worse, you would spend budget fixing a gap that is smaller than you think. A miss you caught once is not a diagnosis.

This is why I treat any single AI screenshot, mine or a competitor’s, as a vanity metric. It is a real data point of size one. The honest query is never “am I in the answer,” it is “what share of the time am I in the answer, and for which queries.”

The number that matters is a hit rate

Once you treat an answer as a distribution rather than a fact, the method is obvious. You stop recording a yes or a no. You start recording a rate.

The method I use, and the one BlueJar automates, has four steps:

Pick your real buying queries. Not “best CRM software” in the abstract, but the phrasings a ready buyer actually types: “best CRM for a 5-person real estate team,” “is [competitor] worth it for solo agents.” Real demand beats invented prompts.
Run each query several times, per engine. One run is a guess. A handful of runs starts to show a rate. The exact count you need depends on the engine (more on that below), but the principle is fixed: repeat before you conclude.
Record the hit rate, not the screenshot. “Named in 7 of 10 runs on ChatGPT, 4 of 10 on Perplexity” is a measurement. “Here is a screenshot” is an anecdote.
Tag how you appeared, not just whether. Being cited as a linked source is different from being named in the prose without a link, which is different from being absent. Those need different fixes.

The reason this is worth the extra effort is that the gaps you are chasing are often inside the noise. When the Sielinski team built proper confidence intervals around their measurements, “many apparent differences between domains” turned out to “fall within the noise floor of the measurement process” (Sielinski, arXiv 2026). A single check cannot see the noise floor at all, because one number has no spread. A hit rate can.

One check versus a hit rate, side by side

Here is the same situation read two ways. The screenshot column is what a single check reports. The hit-rate column is what is actually true once you run the query ten times.

What you observe	One screenshot says	The hit rate says	What to actually do
You appear in the answer	“We’re visible in ChatGPT”	Named in 8 of 10 runs: strong. Named in 1 of 10: a fluke.	Only the 8/10 is a position worth defending. Re-check the 1/10 before celebrating.
You do not appear	“We’re invisible, big problem”	Could be 0 of 10 (a real gap) or 3 of 10 with a caught miss.	Run it again before you scope the fix. 0/10 and 3/10 are different jobs.
A competitor appears	“They beat us”	The gap between you and them may sit inside the noise floor.	Compare rates, not single answers, before claiming they outrank you.
Result changes next week	“Something broke”	Normal variance plus real drift. Both exist.	Track the rate over re-runs so you can tell a wobble from a trend.

The engines don’t agree, so measure each one

There is a second layer of variance on top of run-to-run randomness: the engines behave differently from each other. A blended “AI visibility score” averaged across all of them hides which engine you are winning and which you are losing.

The reasons are mechanical. Google’s AI surfaces (AI Overviews and AI Mode) are grounded on Google’s search index, so classic SEO carries over fairly directly. ChatGPT leans more on what the model already learned during training, plus a thinner live-retrieval layer, so your broad presence across the web matters more than any single ranking. Perplexity leans hard on live retrieval and community sources. You can read the deeper per-engine playbooks in our guides on ranking in Perplexity and appearing in Google’s AI Overviews.

You can watch this divergence in the data. A Semrush study of 230,000 prompts, with weekly snapshots from July to October 2025, found that ChatGPT cited Reddit in close to 60% of responses in early August, then that share collapsed to around 10% by mid-September, while Reddit stayed the single top source on Perplexity the whole time (Semrush, 2025). Same source, two engines, completely different behavior, and a swing of 50 points within one engine over a few weeks. If you measured ChatGPT once in August you would have drawn the opposite conclusion to checking it once in September.

This is also the cleanest rebuttal to “GEO is just rebranded SEO.” It is roughly true for Google’s grounded surfaces and noticeably false for the rest. Same brand, same query, a different fix per engine. We unpack that argument in full in GEO vs SEO in 2026.

How many runs is enough

You do not need a statistics degree, but the honest answer is “more than one, and it depends on the engine.” The Sielinski study actually measured this. To pin a brand’s citation share to a 95% confidence interval roughly five percentage points wide, they found:

Gemini: about 40 to 50 queries.
Perplexity: about 100 queries.
SearchGPT: 150 or more, because it cites fewer sources per answer and is the noisiest to pin down (Sielinski, arXiv 2026).

Those are research-grade numbers for nailing an exact percentage. For a practical agency or in-house read you do not need that precision on every prompt, but the shape of the lesson holds: a single run is nowhere near enough, the engines need different amounts of sampling, and a brand that looks tied with a competitor after one check may not be after a hundred. The paper’s blunt summary is worth keeping on a sticky note: “a single measurement provides no information about measurement uncertainty.”

The good news: the rate is a number you can move

None of this means AI visibility is luck you cannot influence. Once you are measuring a hit rate, you have a number that responds to work. The Princeton and IIT Delhi team that coined “generative engine optimization” tested content changes across 10,000 queries and found that adding relevant statistics to a page lifted its visibility in AI answers by up to 41%, with the biggest gains for pages that started lower, around a 115% lift for pages near position five (Aggarwal et al., SIGKDD 2024).

That is the optimistic read of variance. Your appearance rate is not fixed, which is exactly why it is worth measuring and worth improving. Clearer answers, citation-ready stats, structured content, and a spot on the third-party pages the engines keep quoting will all move it. Our guide on citation readiness covers the on-page side. The hit rate is how you prove the work landed, instead of guessing from one lucky screenshot.

And the reason the in-answer mention is worth fighting for at all: the click is leaving. Pew Research tracked 900 US adults across 68,879 Google searches and found that when an AI summary appears, people click a traditional result in just 8% of visits, versus 15% without one, and click a link inside the summary only 1% of the time (Pew Research, 2025). If the user reads your name in the answer and never clicks, being named is the prize, and how often you are named is the scoreboard.

How BlueJar measures the rate, not the screenshot

This is the whole reason BlueJar runs a panel of more than 400 prompts per analysis across ChatGPT, Perplexity, Gemini, and Copilot, rather than a spot check. Each prompt is one draw from the deck. The panel is how you see the distribution instead of a single card. The prompts are not random either: they come from a structured 7-zone by 4-type by 3-funnel matrix, so the rate is broken out by query intent, by buyer stage, and by engine, not flattened into one vanity number.

One honest caveat, because we sell against the hype: BlueJar is a point-in-time analysis you re-run on a cadence you choose. It is not a 24/7 monitor, and we will not pretend it is. What it gives you is a defensible read of your hit rate today, with the gaps tagged as cited-not-mentioned, mentioned-not-cited, or fully invisible, so you know which lever to pull. For the manual companion to this, see how to track brand mentions across AI engines, and for how the rate rolls up into a single figure, what a GEO score is.

Stop guessing from one screenshot. Run your first analysis free at bluejar.ai to see your real hit rate across ChatGPT, Perplexity, Gemini, and Copilot, with every prompt and full AI response behind the number.

Frequently asked questions

Why does ChatGPT give a different answer to the same query?

Because the model writes each answer one word at a time and samples the next word from a set of probabilities rather than always picking the single likeliest one. Run it again and some of those choices land differently. This happens even at the lowest randomness setting: one test got 80 different outputs from 1,000 runs of the same prompt at temperature zero (Thinking Machines Lab, 2025).

How many times should I run a prompt to measure AI visibility?

More than once, and it depends on the engine. To pin citation share to a tight confidence interval, one 2026 study needed about 40 to 50 runs on Gemini, about 100 on Perplexity, and 150 or more on SearchGPT (Sielinski, arXiv 2026). For a practical read you can use fewer, but a single run gives you no information about how stable the result is.

Is one ChatGPT screenshot enough to prove my brand is visible?

No. A screenshot tells you that you appeared that one time, not how often. Named in 8 of 10 runs is a strong position; named in 1 of 10 is a fluke, and both produce an identical-looking screenshot if you only checked once. The same goes for a “we’re not there” screenshot, which might be a 3-of-10 result with a caught miss.

What is a good hit rate for AI search?

There is no universal pass mark, because it varies by engine and by how competitive the query is. The useful comparison is relative: your rate versus named competitors on the same prompts, and your rate over time on re-runs. A gap that looks decisive after one check often sits inside the measurement noise, so compare rates, not single answers (Sielinski, arXiv 2026).

Do all AI engines behave the same way?

No, and that is why a single blended score is misleading. Google’s AI surfaces are grounded on its search index, ChatGPT leans more on training-weight plus light retrieval, and Perplexity leans on live retrieval and community sources. Semrush watched ChatGPT’s Reddit citation share swing from about 60% to about 10% in weeks while Perplexity’s stayed steady (Semrush, 2025). Measure each engine on its own.

How does BlueJar handle AI search variance?

BlueJar runs a panel of more than 400 prompts per analysis across ChatGPT, Perplexity, Gemini, and Copilot, using a 7-zone by 4-type by 3-funnel matrix, so you see a hit rate broken out by intent, stage, and engine instead of one screenshot. It is a point-in-time analysis you re-run on a cadence you choose, not a continuous monitor, and it tags each result as cited, mentioned, or invisible so you know which fix to apply.

AI answers aren’t fixed: why one ChatGPT check isn’t a measurement