Domain 7: Customer Intelligence & Synthetic Testing
The newest standalone domain in the OS. The work is pre-launch validation, audience simulation, message testing, and customer behavior modeling, most of which wasn't possible at this scale or speed before generative AI. The domain owns synthetic personas (AI-generated audience profiles for concept testing), digital twins (continuously evolving models of actual customers), synthetic panels (large-scale simulated audiences for survey-style research), message testing pre-launch, packaging and pricing simulation, and the validation infrastructure that decides what you ship versus what dies in a draft folder.
"It's the struggling moment where they can't do something that causes them to take the leap." (Bob Moesta, Inside Intercom podcast, May 17, 2018)
"People buy things to help them make progress." (Bob Moesta, same Intercom interview)
See also: Mahmoud's customer-research-playbook for interview craft and JTBD methodology, Mahmoud's competitor-research-playbook for synthetic competitor-persona patterns, Domain 1 (Sensing) for sales-call mining as Domain 7 fuel, Domain 2 (Strategy) for synthetic and Wynter sequencing on positioning shifts, Domain 3 (Content) for the customer-language vector store, Domain 8 (Measurement) for synthetic-to-live correlation as a continuous KPI, Domain 0 (AgentOps) for governance on synthetic outputs in regulated industries.
Why this matters now
McKinsey estimates agentic AI could support up to two-thirds of current marketing activities, synthetic audience testing included. The category has attracted more than $1B in disclosed venture capital across 2023 to 2026, with confirmed major rounds at Simile ($100M Series A, Feb 2026, Index Ventures + Bain Capital + A* + angels Fei-Fei Li and Andrej Karpathy), Aaru (~$50M+ Series A at a $1B headline valuation, Dec 2025, Redpoint, though Aaru's ARR is still under $10M), Listen Labs ($69M in 2025), Outset ($17M Series A from 8VC), and Keplar ($3.4M from Kleiner Perkins in 2025).
The capability is real. Park et al.'s 2024 paper, "Generative Agent Simulations of 1,000 People", ran two-hour AI interviews on 1,052 participants and found the agents replicated the participants' General Social Survey responses 85% as accurately as the same participants replicated their own answers two weeks later. The combined interview-plus-survey agents reached 86%, versus demographic-only agents at 74%. They reduced racial and ideological bias by 36 to 62%. (The "94% accuracy" number you'll see cited everywhere is a miscitation; the actual Park figure is 85%.) Toubia et al.'s 2025 Twin-2K-500 reference dataset covers 2,058 US participants across 500-plus questions in four waves and is the cleanest validation set the field has.
The cautions are also real. The same Toubia team published the Funhouse Mirrors paper in 2025 showing average twin-to-human correlation of 0.197, which is roughly the correlation between height and intelligence. Twin standard deviation was lower than human in 93.9% of cases. The canonical critique from outside academia is Conjointly's "Synthetic Respondents Are the Homeopathy of Market Research", which documented income variance from $111,348 to $272,014 from prompt rephrasing alone. Nielsen Norman Group's stance is the most-cited UX-research line on this: "synthetic users help with hypothesis generation, not validation."
The honest middle is that synthetic testing is a useful pre-test layer for narrowing concepts before live validation. It is not a substitute for talking to actual customers.
Six pieces of work in this domain
Synthetic persona generation is building AI personas from your own first-party data: CRM records, survey responses, customer interview transcripts. You give the model enough real material that it can construct profiles representative of the buyers you actually have, then you can ask those personas questions and watch how they react under different conditions. Built on imagination alone (no real data, just an LLM) the personas reflect the LLM's training-data bias, typically younger, more educated, and more liberal than your actual customers.
Concept and message testing is using those personas, or simulated panels, to pre-test things before you spend real money. Does this headline land? Is this value prop sharper than the alternative? Will this price point feel reasonable or insulting? Will this packaging confuse buyers? Cheap to do at scale. Useful as a filter. Not a substitute for live testing.
Digital twin modeling is the more ambitious version: continuously updating models of individual customers, used for personalization at scale, behavioral prediction, and simulating customer experience changes before you make them. CVS Health is the most-cited consumer example, with 100,000+ "agentic twins" built on 2.9 million consented responses across 200+ behavioral scenarios. Their own announcement is explicit that simulations don't replace real-world research.
Synthetic panels and survey-style research are large simulated audiences (n=1,000 to 5,000-plus) that look demographically like a target market segment. Useful when real recruiting is hard or expensive: niche audiences, regulated populations, markets that would take weeks to assemble in person.
Validation infrastructure is the unglamorous but essential layer: calibrating synthetic predictions against real customer behavior, detecting and correcting bias, documenting your methodology so other people can trust it, and routing high-stakes decisions to live research instead of relying on the synthetic output. Without this layer, the rest reads as homeopathy.
Customer voice capture overlaps with Domain 1 but lives here too. Interview transcript synthesis, review mining for verbatim language, sales call analysis, support-ticket pattern recognition. The work of pulling actual customer speech out of the data you already have and feeding it back into your messaging.
What works in 2026
Use synthetic audiences as a front-end filter, not a finish line. They're good at producing testable hypotheses (narrowing twenty concepts to three before you spend on live testing). They're not validated truth. The strongest teams run synthetic and live in sequence, not as substitutes for each other.
Build your personas on real first-party data. Garbage in, garbage out, aggressively. Reliable persona generation starts with customer interviews, CRM patterns, support logs, social listening, and behavioral data you actually own. Personas built only from public data and an LLM's training drift toward the demographics LLMs over-represent.
Don't rely on synthetic output for consequential decisions in regulated sectors. In healthcare, finance, anything involving minors, anything with regulator exposure, synthetic persona output should never be the sole basis for messaging or product decisions. Human review, legal review, and live research remain non-negotiable.
Calibrate continuously. A synthetic persona is a model, and models drift. Compare your synthetic predictions to actual customer behavior on every campaign and product launch, then adjust. This is the line between rigorous applied research and a confidence-boosting illusion.
Document the inputs. When someone on your team says "the persona predicts strong adoption," they should also be able to explain why, based on what inputs, and with what known limitations. Transparency is what earns the trust to use this work in real decisions.
Use synthetic for breadth, live for depth. Synthetic panels are unmatched for quick volume testing, fifty ad creatives in a day, twenty pricing variations across segments. Live research is unmatched for understanding why people react the way they do, the cultural nuance, the tacit knowledge that synthetic just doesn't have access to.
Tools & Platforms
Pure-Play Synthetic Research Platforms
- Simile ($100M from Index Ventures, Feb 2026). Founded by Stanford's Joon Sung Park (inventor of generative agents), Michael Bernstein, Percy Liang. The deepest pedigree in the space.
- Aaru, synthetic audiences with a focus on consumer behavior
- Ditto (askditto.io), research platform with extensive market mapping
- Evidenza, synthetic research for B2B
- SYMAR, synthetic market research
- Synthetic Users. UX-research-focused
- Ask Rally, virtual focus groups, GenPop panel calibrated on real interviews
- Delve AI, persona generation + digital twin chat
Embedded in Survey Platforms
- Qualtrics Edge Audiences, synthetic respondents in the world's largest survey platform; fine-tuned on 25+ years of Qualtrics research data
- Toluna HarmonAIze, synthetic respondents from 79M-member panel data
- YouGov (via Yabble acquisition). AI-augmented insights
Hybrid AI + Human Panels
- Quantilope. AI for analysis and survey design, human respondents
- Remesh. AI moderation, real participants
- Conjointly, pricing and feature research
B2B-Specific
- Wynter, message testing with verified B2B audience pools (~$299-$1,000+/mo)
- PyMC Labs. Bayesian-modeled synthetic consumers; Fortune 500 deployments
Custom Builds
- Anthropic Claude / OpenAI GPT with custom system prompts + structured persona docs
- Retell AI / Vapi, voice-based synthetic interviews
Notable Practitioners & Frameworks
- Joon Sung Park (Stanford, now Simile). Pioneer of generative agents. The Park et al. 2024 paper (1,000 People) is the foundational reference.
- Michael Bernstein, Percy Liang (Stanford). Co-architects of generative agents.
- PyMC Labs team. Bayesian methodology. "LLMs Reproduce Human Purchase Intent" (2025). Semantic Similarity Rating method; 57 surveys, 9,300 human responses (Colgate-Palmolive collaborators); 90% correlation on product ranking, 85%+ distributional similarity.
- Bob Moesta. JTBD (real-customer-research; informs synthetic).
- Indi Young. Listening as a research practice (the "why" that synthetic can't fully replicate).
- Olivier Toubia (Columbia). Lead author on Twin-2K-500 + Funhouse Mirrors; the academic counter-weight to vendor claims.
- Ray Poynter (NewMR), practitioner-side critic on synthetic-data limits.
Named Case Studies
| Case | What they did | Result | Notes |
|---|---|---|---|
| CVS Health × Simile | Built generative agents on 2.9M consented responses from 400,000+ participants across 200+ behavioral scenarios. Operates 100,000+ "agentic twins" | Use case: medication adherence drivers; twins surfaced trust/confidence/convenience as primary, barriers as confusion/refill anxiety/prior frustration | Critical caveat from CVS's own announcement: "simulations don't replace real-world research" (they prioritize what to test). Governance monitors tone, fairness, safety. Strongest consumer-healthcare example |
| EY × Evidenza | C-suite executive research; "Synthetic CMO" feature with Sharp/Ritson/Binet/Field clones | EY CMO Toni Clayton-Hine reported 95% correlation with EY's actual Global Brand Survey of C-suite execs. Confirmed clients: BlackRock, Microsoft, JP Morgan, Salesforce, Dentsu, ServiceNow | Self-reported correlations not externally audited; pricing ~$50K-$100K/yr |
| Aaru, 2024 NY Democratic primary | ~5,000 AI agents predicting election | Within <400 votes of actual at ~1/10 the cost of traditional polls | Strongest prediction-validation in public record; weakest is that one-off election results don't prove repeatable methodology (Nate Silver's team published the critical "AI polls are fake polls" piece) |
| Park 1,000 People + NN/G three-study evaluation (calibration) | Where synthetic & real agreed: directional preferences, demographic patterns, personality dimensions (Big Five 0.80) | Where they diverged: behavioral data (online courses, where synthetic claimed completion when real users hadn't), drone delivery (synthetic favorable, real users impractical), dog-food purchase intent (synthetic SD lower than human, magnitude off) | Twin-to-human correlation averaged 0.197 in Columbia mega-study |
| PandaDoc × Wynter (the anti-synthetic control) | 50-person verified B2B marketing panel on PandaDoc messaging | Found "on-brand docs" was confusing/generic for the ICP, where live human responses caught what synthetic might have missed | 12 to 48 hour turnaround |
| Regulated-industry limit | No FDA regulation specifically governs synthetic personas. CVS's pattern (never sole-basis decision-making, governance layer) is the de facto regulated-industry playbook | Pharma firms using synthetic for HCP message testing keep human IRB-approved validation pass on every consequential decision | FDA's 2024 draft guidance on AI in drug development is silent on synthetic respondents |
Tools & Platforms, head-to-head
Pure-play synthetic platforms
| Platform | Pedigree | Pricing | Best fit | Watch for |
|---|---|---|---|---|
| Simile | Stanford founders (Park/Bernstein/Liang); $100M Index | Enterprise (undisclosed) | Fortune 100 longitudinal twins; CVS-style 100K-agent deployments | Newest; bandwidth limited to large accounts |
| Aaru | Teen-founder team; $1B headline; Redpoint | Enterprise | Election-style large-N consumer simulations; corporate executive simulations (Lumen) | <$10M ARR; valuation ahead of revenue |
| Ditto / FishDog | 300K pre-built population-true personas | $50K-$75K/yr unlimited | Self-serve consumer brand testing; Figma/Canva/Framer integrations | Now FishDog after rename |
| Synthetic Users | Kwame Ferreira; UX focus | $2-$27 per interview, +$5 RAG | UX hypothesis generation; pre-research scoping | NN/G explicitly cautions against using as research replacement |
| Evidenza | Lombardo + Weinberg ex-LinkedIn B2B Institute | ~$50K-$100K/yr | B2B CMO-level brand strategy; "Synthetic CMO" feature | No self-serve, no API, 72-hour turnaround |
| Ask Rally | Calibrated GenPop panel via Turing test | Mid-market | Rapid small-to-medium decision testing | Calibration is per-persona; uneven coverage |
Embedded-in-platform
| Platform | Approach | Source pool |
|---|---|---|
| Qualtrics Edge Audiences | Fine-tuned proprietary LLM on 25+ yrs of Qualtrics studies | Booking.com, Dollar Shave Club, Gabb |
| Toluna HarmonAIze | Each persona = individual synthetic respondent (not segment average) | 19.4M-member US panel (now expanding UK/FR) |
| YouGov (via Yabble) | AI insight layer on existing panel | YouGov panel |
B2B-specific
| Platform | Method | Numbers |
|---|---|---|
| Wynter | Verified human B2B panel (saturation methodology, 12-13 responses) | 70K-80K verified B2B professionals; LinkedIn + corporate-email verified; from $798/mo, 12-48hr |
| PyMC Labs | Bayesian-modeled synthetic + Semantic Similarity Rating | Custom; 90% product-ranking correlation, 85%+ distributional similarity |
Decision frame: Wynter when you need real B2B humans on your ICP for high-stakes message/positioning/pricing decisions. PyMC Labs when you need scientifically grounded synthetic at scale and can invest in custom Bayesian validation. Evidenza when buying committee is C-suite Fortune 500 and you can absorb $50K+ engagements.
Critical distinction: AI-interview hybrid ≠ synthetic
Listen Labs ($69M Sequoia), Outset ($17M 8VC), Keplar ($3.4M Kleiner Perkins) all use AI to interview real humans at scale. These are NOT synthetic, they are AI-moderated qualitative. The distinction is often lost in press coverage.
Tactical Playbooks
Playbook A. Synthetic persona from first-party data (Claude Project build)
- Export to Markdown: 30 customer interview transcripts + last 90 days of support tickets + last 200 sales-call transcripts (Gong/Fireflies) + open-ended NPS verbatims.
- Cluster by JTBD using Claude (3-5 distinct jobs).
- For each job, build a Claude Project with: ICP firmographics, top-3 quotes per pain dimension, top-3 trigger events, observed-language vocabulary list, list of objections actually voiced (verbatim).
- System prompt: "You are [Persona]. Answer ONLY using language and frames from the provided transcripts. If asked something outside these transcripts, say 'I don't know' rather than inventing." (Explicitly defends against the sycophancy problem NN/G flagged.)
- Validate quarterly: pose the same 5 questions to 3 real customers and compare. If correlation drops, refresh the corpus.
Cross-link: This is the bridge between Mahmoud's customer-research-playbook (real interviews) and synthetic, the playbook produces the inputs; the Project produces the simulator.
Playbook B, 50 ad creatives in a day, narrowed to 3
- Generate 50 ad-creative variations (Claude or Midjourney); store in spreadsheet
- Run synthetic panel on Ditto/Ask Rally for each variant: predicted CTR rank, recall rank, brand-fit rank
- Filter to top 10 (kill bottom 80% on synthetic alone; "front-end filter, not finish line")
- Expert review (PMM + creative director) prunes 10 → 5
- Live test 5 on Wynter (B2B) or Meta CBO-test (B2C)
- Final 2-3 get media spend
Why this works: synthetic is high-recall low-precision; live is low-recall high-precision. The funnel respects each layer's strengths.
Playbook C. Synthetic + live in sequence
- Discover (live): 12-15 customer interviews. Bob Moesta switch interviews, Indi Young listening sessions
- Hypothesize (synthetic): Translate findings into 5-10 testable concepts; run on synthetic panel for directional ranking
- Validate (live): Top 2-3 concepts go to Wynter / focus groups / live A/B
- Calibrate (continuous): After every launch, compare synthetic prediction vs. live result; track synthetic-to-live correlation as a KPI. If it decays past 0.7, the persona corpus is stale.
Cross-References to Mahmoud's Skills
customer-research-playbookowns interview craft, JTBD, switch interviews, listening, mental models. Domain 7 does not duplicate any of that. Domain 7 picks up where the playbook ends, when you have 30 real interviews, what synthetic layer should you build, and where does it break?- Domain 1 (Sensing) overlap, sales-call mining is a Domain 1 input that becomes Domain 7 fuel.
- Domain 8 (Measurement) overlap, synthetic-to-live correlation IS the measurement metric for Domain 7.
Industry overlay (Q2 2026)
| Industry | ICP / motion difference | Tools that win | Biggest pitfall | Compliance overlay |
|---|---|---|---|---|
| B2B SaaS | Synthetic personas built on Gong + Zendesk + interview corpus; Wynter for B2B panel validation; calibrate quarterly | Wynter ($799+/mo verified B2B); PyMC Labs for Bayesian rigor; Claude Project personas from first-party data; Synthetic Users for UX | Treating synthetic as truth; Park 1,000 People shows 85% replication ceiling, Toubia "Funhouse Mirrors" finds 0.197 average correlation. Front-end filter only | None |
| Biopharma | Synthetic HCP/patient personas are advisory only, never a substitute for IRB-approved research. CVS × Simile is the published pattern. EY × Evidenza for C-suite | Simile (Stanford pedigree, $100M Index); Evidenza (~$50-100K, "Synthetic CMO" with KOL clones); KOL advisory boards (gold standard); patient panels via Rare Patient Voice/Carenity | Using synthetic patient response to drive a label change, MoA messaging, or clinical claim, which regulators view as inadequate basis; payer tie-in falls apart | FDA Mar 2026 NAM draft guidance allows digital twins in trials; IRB review on patient-facing tests; HIPAA on synthetic patient cohorts built from real PHI; ABPI/PhRMA Code on HCP simulation |
| DTC | Synthetic ad pre-test, 50 creatives down to 10 down to 5, then live Meta CBO; concept ranking before media spend | Ditto/FishDog (300K personas, $50-75K/yr); Ask Rally; Suzy for human panel; Meta Advantage+ Lift Studies for live | Trusting synthetic CTR predictions and running media; DTC results vs. synthetic correlation breaks at scale; always live-test top-3 | FTC truth-in-advertising on testing claims if marketed externally |
| Dev tools | Synthetic developers are the weakest area, since LLMs can't simulate "I tried it, the SDK threw an error." Real DX testing on Discord/beta lists wins | Real beta programs; UserTesting.com with engineer-screen; Maze for unmoderated dev research; Synthetic Users only for hypothesis generation | Using synthetic devs to validate API ergonomics, which gives you what reads well rather than what compiles. Will mislead DX decisions | None beyond standard |
Key insight: Biopharma's synthetic-testing posture is uniquely advisory only. CVS Health's own framing is that 100K-twin simulations prioritize what to test next, never replace IRB-approved research. Any synthetic output that informs a clinical claim, label, or HCP message must pass through human medical review. This is the single sharpest compliance overlay across all 8 domains.
Common Failure Modes
- Treating synthetic as truth. Synthetic outputs are plausible-sounding by design; that doesn't make them right.
- Bias amplification. LLMs underrepresent older, more conservative, less-educated demographics. Synthetic panels built on default LLM behavior reproduce this bias.
- Skipping real-customer validation for high-stakes decisions. Pricing, positioning shifts, and product launches need real-world confirmation, not just synthetic confidence.
- Static personas. Audiences change. A persona built in 2024 may be wrong in 2026. Refresh the underlying data.
- Conflating synthetic personas with digital twins. Personas are abstract; digital twins are individual. Different tools for different jobs.
KPIs
- Concept hit rate (% of concepts that pass synthetic + advance to live testing)
- Synthetic-to-live correlation (how often does synthetic prediction match live results?)
- Time-to-validated-concept (synthetic compresses this dramatically)
- Cost-per-validated-concept (synthetic should be 10x+ cheaper than traditional)
- Concept abandonment rate before media spend (synthetic should kill more bad ideas, sooner)
Resources for Deeper Study
YouTube channels
- PyMC Labs, methodological depth
- Stanford HAI, academic foundation
- NN/g (Nielsen Norman Group). UX research foundations
- Marketing Science Institute, academic marketing research
Podcasts
- Marketing Today with Alan Hart
- The Bob Moesta Show (Jobs-to-be-Done)
- Indi Young's Listening podcast
Books
- Demand-Side Sales 101 (Bob Moesta)
- When Coffee and Kale Compete (Alan Klement)
- Practical Empathy (Indi Young)
- Interviewing Users (Steve Portigal)
Foundational Papers
- Park et al. (2023), "Generative Agents: Interactive Simulacra of Human Behavior". Stanford foundational paper
- More recent papers on synthetic respondent calibration and validation
v3 (shipped Apr 2026)
- Park 85% accuracy correction (was miscited as 94%)
- Toubia 'Funhouse Mirrors' critique (0.197 average twin-to-human correlation, 93.9% twin-SD-lower-than-human)
- 5 named cases (CVS Health × Simile 100K twins, EY × Evidenza 95% correlation, Aaru NY primary <400 votes, PandaDoc × Wynter live caught what synthetic missed, PyMC Labs 90%/85% validation)
- Pure-play vs. embedded vs. B2B-specific tooling comparison (Simile / Aaru / Ditto / Synthetic Users / Evidenza / Qualtrics Edge / Wynter / PyMC)
- AI-interview-hybrid distinction explicitly called out (Listen Labs / Outset / Keplar are NOT synthetic)
- Moesta verbatim quotes (incl. 'It's the struggling moment where they can't do something that causes them to take the leap')
- $1B+ disclosed VC across 2023-2026 reframed (was '$1.5B' without sourcing)
- 3 tactical playbooks (synthetic persona from first-party data, 50→3 ad creatives, synthetic+live in sequence)
- Industry overlay (biopharma 'advisory only' framing especially sharp) + cross-references (5 inter-domain + 2 skills)
v4 deferred
- Continuous-calibration methodology (synthetic-to-live correlation tracking as a continuous KPI) with a named-brand case
- First regulated-industry public failure case to clarify boundaries (FDA/SEC enforcement action)
See research-plan.md for the master v3 changelog and v4 forward plan.
Frequently asked questions about customer intelligence and synthetic testing
What's the actual accuracy of synthetic personas?
Park et al. (2024) Generative Agent Simulations of 1,000 People: agents replicated participants' GSS responses 85% as accurately as participants replicated their own answers two weeks later. NOT 94%; the 94% figure widely cited online is a miscitation. Combined interview + survey agents reached 86% vs. demographic-only at 74%. Counter-data: Toubia et al. (2025) 'Funhouse Mirrors' found average twin-to-human correlation of 0.197 (≈ height vs. intelligence) and twin standard deviation lower than human in 93.9% of cases. Synthetic narrows what to test; it doesn't decide.
Synthetic personas vs. live B2B panels: which wins?
Sequence them. Synthetic (Custom Claude Project, Synthetic Users, Ask Rally) for hypothesis generation and rapid concept screening; narrow 50 ad creatives to 3 in a day. Live B2B panels (Wynter, $799+/mo, 70K-80K verified B2B professionals) for validation before consequential decisions. PyMC Labs offers Bayesian-grounded synthetic with documented validation (90% correlation on product ranking, 85%+ distributional similarity across 57 surveys, 9,300 human responses). NN/G's stance: 'synthetic users help with hypothesis generation, not validation.'
How does CVS Health use 100,000 synthetic twins?
CVS built generative agents on 2.9M consented responses from 400,000+ participants across 200+ behavioral scenarios (announced 2025). Operates 100,000+ 'agentic twins' for journey-friction analysis, hard-to-reach population testing, and product testing at speed. Critical caveat from CVS's own announcement: simulations don't replace real-world research. Governance monitors tone, fairness, safety. This is the canonical regulated-industry pattern: synthetic prioritizes what to test next, never the sole basis for clinical or messaging decisions.
Are AI-interview platforms (Listen Labs, Outset) the same as synthetic research?
No. Listen Labs ($69M Sequoia, 2025), Outset ($17M 8VC), and Keplar ($3.4M Kleiner Perkins) use AI to interview real humans at scale. They are AI-moderated qualitative research, not synthetic respondents. The distinction matters and is often lost in press coverage: AI-interview hybrids interview humans; synthetic research generates plausible responses without humans. Different tools for different jobs.