Synthetic Testing & Digital Twins for Marketing Research

Domain 7: Customer Intelligence & Synthetic Testing

The newest standalone domain in the OS. The work is pre-launch validation, audience simulation, message testing, and customer behavior modeling, most of which wasn't possible at this scale or speed before generative AI. The domain owns synthetic personas (AI-generated audience profiles for concept testing), digital twins (continuously evolving models of actual customers), synthetic panels (large-scale simulated audiences for survey-style research), message testing pre-launch, packaging and pricing simulation, and the validation infrastructure that decides what you ship versus what dies in a draft folder.

"It's the struggling moment where they can't do something that causes them to take the leap." (Bob Moesta, Inside Intercom podcast, May 17, 2018)

"People buy things to help them make progress." (Bob Moesta, same Intercom interview)

See also: Mahmoud's customer-research-playbook for interview craft and JTBD methodology, Mahmoud's competitor-research-playbook for synthetic competitor-persona patterns, Domain 1 (Sensing) for sales-call mining as Domain 7 fuel, Domain 2 (Strategy) for synthetic and Wynter sequencing on positioning shifts, Domain 3 (Content) for the customer-language vector store, Domain 8 (Measurement) for synthetic-to-live correlation as a continuous KPI, Domain 0 (AgentOps) for governance on synthetic outputs in regulated industries.

Why this matters now

McKinsey estimates agentic AI could support up to two-thirds of current marketing activities, synthetic audience testing included. The category has attracted more than $1B in disclosed venture capital across 2023 to 2026, with confirmed major rounds at Simile ($100M Series A, Feb 2026, Index Ventures + Bain Capital + A* + angels Fei-Fei Li and Andrej Karpathy), Aaru (~$50M+ Series A at a $1B headline valuation, Dec 2025, Redpoint, though Aaru's ARR is still under $10M), Listen Labs ($69M in 2025), Outset ($17M Series A from 8VC), and Keplar ($3.4M from Kleiner Perkins in 2025).

The capability is real. Park et al.'s 2024 paper, "Generative Agent Simulations of 1,000 People", ran two-hour AI interviews on 1,052 participants and found the agents replicated the participants' General Social Survey responses 85% as accurately as the same participants replicated their own answers two weeks later. The combined interview-plus-survey agents reached 86%, versus demographic-only agents at 74%. They reduced racial and ideological bias by 36 to 62%. (The "94% accuracy" number you'll see cited everywhere is a miscitation; the actual Park figure is 85%.) Toubia et al.'s 2025 Twin-2K-500 reference dataset covers 2,058 US participants across 500-plus questions in four waves and is the cleanest validation set the field has.

The cautions are also real. The same Toubia team published the Funhouse Mirrors paper in 2025 showing average twin-to-human correlation of 0.197, which is roughly the correlation between height and intelligence. Twin standard deviation was lower than human in 93.9% of cases. The canonical critique from outside academia is Conjointly's "Synthetic Respondents Are the Homeopathy of Market Research", which documented income variance from $111,348 to $272,014 from prompt rephrasing alone. Nielsen Norman Group's stance is the most-cited UX-research line on this: "synthetic users help with hypothesis generation, not validation."

The honest middle is that synthetic testing is a useful pre-test layer for narrowing concepts before live validation. It is not a substitute for talking to actual customers.

Six pieces of work in this domain

Synthetic persona generation is building AI personas from your own first-party data: CRM records, survey responses, customer interview transcripts. You give the model enough real material that it can construct profiles representative of the buyers you actually have, then you can ask those personas questions and watch how they react under different conditions. Built on imagination alone (no real data, just an LLM) the personas reflect the LLM's training-data bias, typically younger, more educated, and more liberal than your actual customers.

Concept and message testing is using those personas, or simulated panels, to pre-test things before you spend real money. Does this headline land? Is this value prop sharper than the alternative? Will this price point feel reasonable or insulting? Will this packaging confuse buyers? Cheap to do at scale. Useful as a filter. Not a substitute for live testing.

Digital twin modeling is the more ambitious version: continuously updating models of individual customers, used for personalization at scale, behavioral prediction, and simulating customer experience changes before you make them. CVS Health is the most-cited consumer example, with 100,000+ "agentic twins" built on 2.9 million consented responses across 200+ behavioral scenarios. Their own announcement is explicit that simulations don't replace real-world research.

Synthetic panels and survey-style research are large simulated audiences (n=1,000 to 5,000-plus) that look demographically like a target market segment. Useful when real recruiting is hard or expensive: niche audiences, regulated populations, markets that would take weeks to assemble in person.

Validation infrastructure is the unglamorous but essential layer: calibrating synthetic predictions against real customer behavior, detecting and correcting bias, documenting your methodology so other people can trust it, and routing high-stakes decisions to live research instead of relying on the synthetic output. Without this layer, the rest reads as homeopathy.

Customer voice capture overlaps with Domain 1 but lives here too. Interview transcript synthesis, review mining for verbatim language, sales call analysis, support-ticket pattern recognition. The work of pulling actual customer speech out of the data you already have and feeding it back into your messaging.

What works in 2026

Use synthetic audiences as a front-end filter, not a finish line. They're good at producing testable hypotheses (narrowing twenty concepts to three before you spend on live testing). They're not validated truth. The strongest teams run synthetic and live in sequence, not as substitutes for each other.

Build your personas on real first-party data. Garbage in, garbage out, aggressively. Reliable persona generation starts with customer interviews, CRM patterns, support logs, social listening, and behavioral data you actually own. Personas built only from public data and an LLM's training drift toward the demographics LLMs over-represent.

Don't rely on synthetic output for consequential decisions in regulated sectors. In healthcare, finance, anything involving minors, anything with regulator exposure, synthetic persona output should never be the sole basis for messaging or product decisions. Human review, legal review, and live research remain non-negotiable.

Calibrate continuously. A synthetic persona is a model, and models drift. Compare your synthetic predictions to actual customer behavior on every campaign and product launch, then adjust. This is the line between rigorous applied research and a confidence-boosting illusion.

Document the inputs. When someone on your team says "the persona predicts strong adoption," they should also be able to explain why, based on what inputs, and with what known limitations. Transparency is what earns the trust to use this work in real decisions.

Use synthetic for breadth, live for depth. Synthetic panels are unmatched for quick volume testing, fifty ad creatives in a day, twenty pricing variations across segments. Live research is unmatched for understanding why people react the way they do, the cultural nuance, the tacit knowledge that synthetic just doesn't have access to.

Tools & Platforms

Pure-Play Synthetic Research Platforms

Simile ($100M from Index Ventures, Feb 2026). Founded by Stanford's Joon Sung Park (inventor of generative agents), Michael Bernstein, Percy Liang. The deepest pedigree in the space.
Aaru, synthetic audiences with a focus on consumer behavior
Ditto (askditto.io), research platform with extensive market mapping
Evidenza, synthetic research for B2B
SYMAR, synthetic market research
Synthetic Users. UX-research-focused
Ask Rally, virtual focus groups, GenPop panel calibrated on real interviews
Delve AI, persona generation + digital twin chat

Embedded in Survey Platforms

Qualtrics Edge Audiences, synthetic respondents in the world's largest survey platform; fine-tuned on 25+ years of Qualtrics research data
Toluna HarmonAIze, synthetic respondents from 79M-member panel data
YouGov (via Yabble acquisition). AI-augmented insights

Hybrid AI + Human Panels

Quantilope. AI for analysis and survey design, human respondents
Remesh. AI moderation, real participants
Conjointly, pricing and feature research

B2B-Specific

Wynter, message testing with verified B2B audience pools (~$299-$1,000+/mo)
PyMC Labs. Bayesian-modeled synthetic consumers; Fortune 500 deployments

Custom Builds

Anthropic Claude / OpenAI GPT with custom system prompts + structured persona docs
Retell AI / Vapi, voice-based synthetic interviews

Notable Practitioners & Frameworks

Joon Sung Park (Stanford, now Simile). Pioneer of generative agents. The Park et al. 2024 paper (1,000 People) is the foundational reference.
Michael Bernstein, Percy Liang (Stanford). Co-architects of generative agents.
PyMC Labs team. Bayesian methodology. "LLMs Reproduce Human Purchase Intent" (2025). Semantic Similarity Rating method; 57 surveys, 9,300 human responses (Colgate-Palmolive collaborators); 90% correlation on product ranking, 85%+ distributional similarity.
Bob Moesta. JTBD (real-customer-research; informs synthetic).
Indi Young. Listening as a research practice (the "why" that synthetic can't fully replicate).
Olivier Toubia (Columbia). Lead author on Twin-2K-500 + Funhouse Mirrors; the academic counter-weight to vendor claims.
Ray Poynter (NewMR), practitioner-side critic on synthetic-data limits.

Named Case Studies

Case	What they did	Result	Notes
CVS Health × Simile	Built generative agents on 2.9M consented responses from 400,000+ participants across 200+ behavioral scenarios. Operates 100,000+ "agentic twins"	Use case: medication adherence drivers; twins surfaced trust/confidence/convenience as primary, barriers as confusion/refill anxiety/prior frustration	Critical caveat from CVS's own announcement: "simulations don't replace real-world research" (they prioritize what to test). Governance monitors tone, fairness, safety. Strongest consumer-healthcare example
EY × Evidenza	C-suite executive research; "Synthetic CMO" feature with Sharp/Ritson/Binet/Field clones	EY CMO Toni Clayton-Hine reported 95% correlation with EY's actual Global Brand Survey of C-suite execs. Confirmed clients: BlackRock, Microsoft, JP Morgan, Salesforce, Dentsu, ServiceNow	Self-reported correlations not externally audited; pricing ~$50K-$100K/yr
Aaru, 2024 NY Democratic primary	~5,000 AI agents predicting election	Within <400 votes of actual at ~1/10 the cost of traditional polls	Strongest prediction-validation in public record; weakest is that one-off election results don't prove repeatable methodology (Nate Silver's team published the critical "AI polls are fake polls" piece)
Park 1,000 People + NN/G three-study evaluation (calibration)	Where synthetic & real agreed: directional preferences, demographic patterns, personality dimensions (Big Five 0.80)	Where they diverged: behavioral data (online courses, where synthetic claimed completion when real users hadn't), drone delivery (synthetic favorable, real users impractical), dog-food purchase intent (synthetic SD lower than human, magnitude off)	Twin-to-human correlation averaged 0.197 in Columbia mega-study
PandaDoc × Wynter (the anti-synthetic control)	50-person verified B2B marketing panel on PandaDoc messaging	Found "on-brand docs" was confusing/generic for the ICP, where live human responses caught what synthetic might have missed	12 to 48 hour turnaround
Regulated-industry limit	No FDA regulation specifically governs synthetic personas. CVS's pattern (never sole-basis decision-making, governance layer) is the de facto regulated-industry playbook	Pharma firms using synthetic for HCP message testing keep human IRB-approved validation pass on every consequential decision	FDA's 2024 draft guidance on AI in drug development is silent on synthetic respondents

Tools & Platforms, head-to-head

Pure-play synthetic platforms

Platform	Pedigree	Pricing	Best fit	Watch for
Simile	Stanford founders (Park/Bernstein/Liang); $100M Index	Enterprise (undisclosed)	Fortune 100 longitudinal twins; CVS-style 100K-agent deployments	Newest; bandwidth limited to large accounts
Aaru	Teen-founder team; $1B headline; Redpoint	Enterprise	Election-style large-N consumer simulations; corporate executive simulations (Lumen)	<$10M ARR; valuation ahead of revenue
Ditto / FishDog	300K pre-built population-true personas	$50K-$75K/yr unlimited	Self-serve consumer brand testing; Figma/Canva/Framer integrations	Now FishDog after rename
Synthetic Users	Kwame Ferreira; UX focus	$2-$27 per interview, +$5 RAG	UX hypothesis generation; pre-research scoping	NN/G explicitly cautions against using as research replacement
Evidenza	Lombardo + Weinberg ex-LinkedIn B2B Institute	~$50K-$100K/yr	B2B CMO-level brand strategy; "Synthetic CMO" feature	No self-serve, no API, 72-hour turnaround
Ask Rally	Calibrated GenPop panel via Turing test	Mid-market	Rapid small-to-medium decision testing	Calibration is per-persona; uneven coverage

Embedded-in-platform

Platform	Approach	Source pool
Qualtrics Edge Audiences	Fine-tuned proprietary LLM on 25+ yrs of Qualtrics studies	Booking.com, Dollar Shave Club, Gabb
Toluna HarmonAIze	Each persona = individual synthetic respondent (not segment average)	19.4M-member US panel (now expanding UK/FR)
YouGov (via Yabble)	AI insight layer on existing panel	YouGov panel

B2B-specific

Platform	Method	Numbers
Wynter	Verified human B2B panel (saturation methodology, 12-13 responses)	70K-80K verified B2B professionals; LinkedIn + corporate-email verified; from $798/mo, 12-48hr
PyMC Labs	Bayesian-modeled synthetic + Semantic Similarity Rating	Custom; 90% product-ranking correlation, 85%+ distributional similarity

Decision frame: Wynter when you need real B2B humans on your ICP for high-stakes message/positioning/pricing decisions. PyMC Labs when you need scientifically grounded synthetic at scale and can invest in custom Bayesian validation. Evidenza when buying committee is C-suite Fortune 500 and you can absorb $50K+ engagements.

Critical distinction: AI-interview hybrid ≠ synthetic

Listen Labs ($69M Sequoia), Outset ($17M 8VC), Keplar ($3.4M Kleiner Perkins) all use AI to interview real humans at scale. These are NOT synthetic, they are AI-moderated qualitative. The distinction is often lost in press coverage.

Tactical Playbooks

Playbook A. Synthetic persona from first-party data (Claude Project build)

Export to Markdown: 30 customer interview transcripts + last 90 days of support tickets + last 200 sales-call transcripts (Gong/Fireflies) + open-ended NPS verbatims.
Cluster by JTBD using Claude (3-5 distinct jobs).
For each job, build a Claude Project with: ICP firmographics, top-3 quotes per pain dimension, top-3 trigger events, observed-language vocabulary list, list of objections actually voiced (verbatim).
System prompt: "You are [Persona]. Answer ONLY using language and frames from the provided transcripts. If asked something outside these transcripts, say 'I don't know' rather than inventing." (Explicitly defends against the sycophancy problem NN/G flagged.)
Validate quarterly: pose the same 5 questions to 3 real customers and compare. If correlation drops, refresh the corpus.

Cross-link: This is the bridge between Mahmoud's customer-research-playbook (real interviews) and synthetic, the playbook produces the inputs; the Project produces the simulator.

Playbook B, 50 ad creatives in a day, narrowed to 3

Generate 50 ad-creative variations (Claude or Midjourney); store in spreadsheet
Run synthetic panel on Ditto/Ask Rally for each variant: predicted CTR rank, recall rank, brand-fit rank
Filter to top 10 (kill bottom 80% on synthetic alone; "front-end filter, not finish line")
Expert review (PMM + creative director) prunes 10 → 5
Live test 5 on Wynter (B2B) or Meta CBO-test (B2C)
Final 2-3 get media spend

Why this works: synthetic is high-recall low-precision; live is low-recall high-precision. The funnel respects each layer's strengths.

Playbook C. Synthetic + live in sequence

Discover (live): 12-15 customer interviews. Bob Moesta switch interviews, Indi Young listening sessions
Hypothesize (synthetic): Translate findings into 5-10 testable concepts; run on synthetic panel for directional ranking
Validate (live): Top 2-3 concepts go to Wynter / focus groups / live A/B
Calibrate (continuous): After every launch, compare synthetic prediction vs. live result; track synthetic-to-live correlation as a KPI. If it decays past 0.7, the persona corpus is stale.

Cross-References to Mahmoud's Skills

customer-research-playbook owns interview craft, JTBD, switch interviews, listening, mental models. Domain 7 does not duplicate any of that. Domain 7 picks up where the playbook ends, when you have 30 real interviews, what synthetic layer should you build, and where does it break?
Domain 1 (Sensing) overlap, sales-call mining is a Domain 1 input that becomes Domain 7 fuel.
Domain 8 (Measurement) overlap, synthetic-to-live correlation IS the measurement metric for Domain 7.

Industry overlay (Q2 2026)

Industry	ICP / motion difference	Tools that win	Biggest pitfall	Compliance overlay
B2B SaaS	Synthetic personas built on Gong + Zendesk + interview corpus; Wynter for B2B panel validation; calibrate quarterly	Wynter ($799+/mo verified B2B); PyMC Labs for Bayesian rigor; Claude Project personas from first-party data; Synthetic Users for UX	Treating synthetic as truth; Park 1,000 People shows 85% replication ceiling, Toubia "Funhouse Mirrors" finds 0.197 average correlation. Front-end filter only	None
Biopharma	Synthetic HCP/patient personas are advisory only, never a substitute for IRB-approved research. CVS × Simile is the published pattern. EY × Evidenza for C-suite	Simile (Stanford pedigree, $100M Index); Evidenza (~$50-100K, "Synthetic CMO" with KOL clones); KOL advisory boards (gold standard); patient panels via Rare Patient Voice/Carenity	Using synthetic patient response to drive a label change, MoA messaging, or clinical claim, which regulators view as inadequate basis; payer tie-in falls apart	FDA Mar 2026 NAM draft guidance allows digital twins in trials; IRB review on patient-facing tests; HIPAA on synthetic patient cohorts built from real PHI; ABPI/PhRMA Code on HCP simulation
DTC	Synthetic ad pre-test, 50 creatives down to 10 down to 5, then live Meta CBO; concept ranking before media spend	Ditto/FishDog (300K personas, $50-75K/yr); Ask Rally; Suzy for human panel; Meta Advantage+ Lift Studies for live	Trusting synthetic CTR predictions and running media; DTC results vs. synthetic correlation breaks at scale; always live-test top-3	FTC truth-in-advertising on testing claims if marketed externally
Dev tools	Synthetic developers are the weakest area, since LLMs can't simulate "I tried it, the SDK threw an error." Real DX testing on Discord/beta lists wins	Real beta programs; UserTesting.com with engineer-screen; Maze for unmoderated dev research; Synthetic Users only for hypothesis generation	Using synthetic devs to validate API ergonomics, which gives you what reads well rather than what compiles. Will mislead DX decisions	None beyond standard

Key insight: Biopharma's synthetic-testing posture is uniquely advisory only. CVS Health's own framing is that 100K-twin simulations prioritize what to test next, never replace IRB-approved research. Any synthetic output that informs a clinical claim, label, or HCP message must pass through human medical review. This is the single sharpest compliance overlay across all 8 domains.

Common Failure Modes

Treating synthetic as truth. Synthetic outputs are plausible-sounding by design; that doesn't make them right.
Bias amplification. LLMs underrepresent older, more conservative, less-educated demographics. Synthetic panels built on default LLM behavior reproduce this bias.
Skipping real-customer validation for high-stakes decisions. Pricing, positioning shifts, and product launches need real-world confirmation, not just synthetic confidence.
Static personas. Audiences change. A persona built in 2024 may be wrong in 2026. Refresh the underlying data.
Conflating synthetic personas with digital twins. Personas are abstract; digital twins are individual. Different tools for different jobs.

KPIs

Concept hit rate (% of concepts that pass synthetic + advance to live testing)
Synthetic-to-live correlation (how often does synthetic prediction match live results?)
Time-to-validated-concept (synthetic compresses this dramatically)
Cost-per-validated-concept (synthetic should be 10x+ cheaper than traditional)
Concept abandonment rate before media spend (synthetic should kill more bad ideas, sooner)

Resources for Deeper Study

YouTube channels

PyMC Labs, methodological depth
Stanford HAI, academic foundation
NN/g (Nielsen Norman Group). UX research foundations
Marketing Science Institute, academic marketing research

Podcasts

Marketing Today with Alan Hart
The Bob Moesta Show (Jobs-to-be-Done)
Indi Young's Listening podcast

Books

Demand-Side Sales 101 (Bob Moesta)
When Coffee and Kale Compete (Alan Klement)
Practical Empathy (Indi Young)
Interviewing Users (Steve Portigal)

Foundational Papers

Park et al. (2023), "Generative Agents: Interactive Simulacra of Human Behavior". Stanford foundational paper
More recent papers on synthetic respondent calibration and validation

v3 (shipped Apr 2026)

Park 85% accuracy correction (was miscited as 94%)
Toubia 'Funhouse Mirrors' critique (0.197 average twin-to-human correlation, 93.9% twin-SD-lower-than-human)
5 named cases (CVS Health × Simile 100K twins, EY × Evidenza 95% correlation, Aaru NY primary <400 votes, PandaDoc × Wynter live caught what synthetic missed, PyMC Labs 90%/85% validation)
Pure-play vs. embedded vs. B2B-specific tooling comparison (Simile / Aaru / Ditto / Synthetic Users / Evidenza / Qualtrics Edge / Wynter / PyMC)
AI-interview-hybrid distinction explicitly called out (Listen Labs / Outset / Keplar are NOT synthetic)
Moesta verbatim quotes (incl. 'It's the struggling moment where they can't do something that causes them to take the leap')
$1B+ disclosed VC across 2023-2026 reframed (was '$1.5B' without sourcing)
3 tactical playbooks (synthetic persona from first-party data, 50→3 ad creatives, synthetic+live in sequence)
Industry overlay (biopharma 'advisory only' framing especially sharp) + cross-references (5 inter-domain + 2 skills)

v4 deferred

Continuous-calibration methodology (synthetic-to-live correlation tracking as a continuous KPI) with a named-brand case
First regulated-industry public failure case to clarify boundaries (FDA/SEC enforcement action)

See research-plan.md for the master v3 changelog and v4 forward plan.

Frequently asked questions about customer intelligence and synthetic testing

What's the actual accuracy of synthetic personas?

Park et al. (2024) Generative Agent Simulations of 1,000 People: agents replicated participants' GSS responses 85% as accurately as participants replicated their own answers two weeks later. NOT 94%; the 94% figure widely cited online is a miscitation. Combined interview + survey agents reached 86% vs. demographic-only at 74%. Counter-data: Toubia et al. (2025) 'Funhouse Mirrors' found average twin-to-human correlation of 0.197 (≈ height vs. intelligence) and twin standard deviation lower than human in 93.9% of cases. Synthetic narrows what to test; it doesn't decide.

Synthetic personas vs. live B2B panels: which wins?

Sequence them. Synthetic (Custom Claude Project, Synthetic Users, Ask Rally) for hypothesis generation and rapid concept screening; narrow 50 ad creatives to 3 in a day. Live B2B panels (Wynter, $799+/mo, 70K-80K verified B2B professionals) for validation before consequential decisions. PyMC Labs offers Bayesian-grounded synthetic with documented validation (90% correlation on product ranking, 85%+ distributional similarity across 57 surveys, 9,300 human responses). NN/G's stance: 'synthetic users help with hypothesis generation, not validation.'

How does CVS Health use 100,000 synthetic twins?

CVS built generative agents on 2.9M consented responses from 400,000+ participants across 200+ behavioral scenarios (announced 2025). Operates 100,000+ 'agentic twins' for journey-friction analysis, hard-to-reach population testing, and product testing at speed. Critical caveat from CVS's own announcement: simulations don't replace real-world research. Governance monitors tone, fairness, safety. This is the canonical regulated-industry pattern: synthetic prioritizes what to test next, never the sole basis for clinical or messaging decisions.

Are AI-interview platforms (Listen Labs, Outset) the same as synthetic research?

No. Listen Labs ($69M Sequoia, 2025), Outset ($17M 8VC), and Keplar ($3.4M Kleiner Perkins) use AI to interview real humans at scale. They are AI-moderated qualitative research, not synthetic respondents. The distinction matters and is often lost in press coverage: AI-interview hybrids interview humans; synthetic research generates plausible responses without humans. Different tools for different jobs.