Hindcasting the LA County COVID Vaccine Rollout: A Pre-Registered Stress Test for Multi-Agent Healthcare Simulation

20 LA County ZIP codes — 6 from Keck Medicine's CBSA + 14 controls across the HPI distribution

The problem worth solving

Health systems in the United States spend billions of dollars each year on interventions that miss the people they were designed to reach. Outreach campaigns underperform. Screening programs fail to close equity gaps. New care-navigation pathways roll out across whole networks before anyone discovers they shift workload onto already-burdened clinics. The post-mortems are honest enough about what went wrong; the deeper question is why we couldn't have known sooner.

The standard tools — predictive analytics on past patient data, A/B-style pilots, focus groups — answer related but different questions. Predictive models project the past forward. Pilots tell you what happened in one site. Focus groups tell you what people say they would do, which is consistently a worse predictor than what they actually do.

What has been missing is a way to stress-test an intervention against a population that behaves like the real one before any real patient is touched.

A different kind of model

Traditional public-health agent-based simulation hardcodes the answer into the model. You write a hazard formula, calibrate its coefficients against historical data, and then "predict" by running the formula forward. The pattern matches because the formula was tuned to make it match. That approach was reasonable in 2019 because researchers had no other option. We do.

Population Lab takes a different path. Each agent is a fully-formed character built from real census demographics, given a real January-2021 vaccine attitude drawn from real survey priors, and dropped into a simulated social environment where real intervention events arrive on the feed at their actual dates. The LLM agent reasons in natural language about what to do. Vaccination uptake is what emerges from millions of those micro-decisions, not what we computed.

We do not multiply distance by trust by social signal anywhere in our code. Distance is part of the agent's environmental context — "the nearest mass vaccination site is 2.3 miles from your home" — and the agent reasons about it the way a real person would. Mandates are events on the agent's feed, not coefficients on their hazard rate.

The only way to know whether this approach actually works is to test it against reality. So we are.

The case study

We are recreating the 2021 LA County COVID-19 vaccine rollout in twenty strategically-chosen ZIP codes. Six of them are Keck Medicine's published Community Benefit Service Area: Boyle Heights (where Keck Hospital sits), Lincoln Heights, El Sereno, and East Los Angeles. The other fourteen are controls drawn from across the LA County Healthy Places Index distribution — two more low-HPI ZIPs to test whether predictions are CBSA-specific or generally driven by neighborhood disadvantage, plus four each from the mid-low, mid-high, and high-HPI quartiles.

The 2021 rollout is the right case to test against because:

It is the most thoroughly documented public-health event in modern American history. CDPH publishes weekly vaccination rates by ZIP code from December 2020 through the end of 2021. We know what happened.
The intervention surface is rich. Eligibility tier expansions, mass-vaccination site openings and closures, supply changes (J&J pause and resumption, Pfizer's full FDA approval), the August healthcare-worker mandate, the May 27 Vax for the Win incentive — these are real, dated, sourced events with documented timing.
Equity outcomes were divergent and large. LA County's published end-of-2021 uptake spans roughly thirty percentage points across the HPI distribution. If our methodology can recover that pattern from a synthetic population that has never seen the outcome data, that is a meaningful claim. If it cannot, we want to know.

How we built the synthetic population

Five hundred agents distributed evenly — twenty-five per ZIP code, twenty ZIP codes. Each agent represents roughly two hundred to six hundred real adults in their ZCTA. They are statistical samples, not 1:1 individuals.

Three layers of evidence compose into one LLM persona: census demographics + behavioral evidence + environmental context

The substrate is the American Community Survey 5-year file (2017–2021). For each agent we sample age band, sex, race and ethnicity, employment status, and below-poverty status from the per-ZIP marginals — so the twenty-five agents living in Boyle Heights collectively look like a representative sample of Boyle Heights adults: about ninety percent Hispanic, sixteen percent age sixty-five-plus, twenty-four percent below the poverty line. The twenty-five agents living in Diamond Bar (a high-HPI ZIP in the eastern San Gabriel Valley) look like Diamond Bar: sixty percent Asian, low poverty, older skew.

On top of the demographic substrate we add behavioral evidence drawn from sources external to LA County 2021 outcomes. Initial vaccine willingness is sampled from the Kaiser Family Foundation's January 2021 Vaccine Monitor priors — stratified by race × age. The four KFF categories (as soon as possible / wait and see / only if required / definitely not) become a starting attitude that the agent can shift over the simulated year as they observe their feed and their community. Trust toward different messengers is informed by Carson et al.'s 2021 JAMA Network Open study of vaccine decision-making in five LA racial and ethnic minority groups — encoded as conditional persona descriptors rather than numeric multipliers, because Carson is a qualitative paper and forcing it into coefficients would over-claim what it found.

Environmental context anchors each agent in their actual neighborhood. Distance to the nearest mass vaccination site is computed from the agent's ZIP centroid to each of the seven documented LA County mega-sites (Dodger Stadium, the Forum, Cal State Northridge, Magic Mountain, Downey LACOE, Pomona Fairplex, SoFi Stadium). It enters the agent's persona as lived experience — "the nearest mass vaccination site is X miles from your home" — not as a coefficient.

Every persona ingredient traces back to a published source. We do not invent attributes. Where evidence is qualitative we say so. Where the published sources do not give us what we would ideally want — we have no public data on party affiliation, for example, and party became the dominant uptake predictor by mid-2021 — we exclude that mechanism and document the omission as an expected limitation.

What our agents experience

Real 2021 events are pre-encoded as feed posts with their actual sources and arrive on the simulated feed at their actual dates. The CDPH director announces the eligibility-tier opening for adults sixty-five-plus on January 13. LA County DPH announces Dodger Stadium's opening on January 15 and the five mega-sites the following week. The FDA emergency-use authorization for Johnson & Johnson lands on February 27. The J&J pause hits on April 13. Universal eligibility for everyone sixteen and older opens on April 15. Newsom announces Vax for the Win on May 27. The healthcare-worker mandate order drops on August 5; the compliance deadline arrives on September 30. The CDC's all-adult booster recommendation lands on November 29.

Real 2021 events arrive on the simulated feed at their actual dates — eligibility tiers, mass-site openings, supply changes, mandates, incentives

Agents see these events the way Angelenos saw them in 2021 — as posts they may agree with, dismiss, get anxious about, or share with family. They react in character. We do not "force" any agent's behavior through a coefficient; the events are stimuli the agent reasons about.

At the end of every simulated month, every agent answers a single structured-output survey question about their current vaccine status: vaccinated, want it as soon as I can, wait and see, only if required, or definitely not. Vaccinated agents auto-respond deterministically; unvaccinated agents respond per their persona. The survey gives us per-agent monthly observability without forcing every unvaccinated agent to post every round, and it produces the time-series we need to compare against CDPH.

How we'll validate

The validation is the entire point. We pre-registered every metric, every threshold, and every success criterion in writing before the simulation runs and before we are allowed to look at the CDPH outcome data.

Validation pipeline: pre-register metrics, run blind, unembargo, compare and publish

The primary metric is Spearman rank correlation between predicted and actual end-of-2021 uptake rates across the twenty ZIP codes. Spearman is the right headline at this sample size: with n=20 ZIPs the 95% confidence interval around any point estimate is wide, and we report it. The locked thresholds are ρ ≥ 0.65 for "promising V1 evidence" and ρ ≥ 0.80 for "comparable to peer-reviewed neighborhood-level benchmarks."

We also report mean absolute error in percentage points (locked threshold 12 pp), the recovered HPI-quartile equity gap (within ±8 pp of observed), and the speed-of-uptake Spearman by month. Per-quartile Spearman is reported transparently as a diagnostic table — it is not a pass/fail gate, because at four-to-seven ZIPs per quartile a single rank swap dominates the metric and the gate would not be statistically meaningful.

Three pre-registered outcomes are defined: success, partial, and failure. All three publish. The honest reason this is unusual is that it removes a degree of freedom that researchers normally have for cherry-picking. If the model fails to recover the equity pattern, the failure post-mortem is publishable in its own right — it tells us which mechanisms are missing or mis-weighted, and it is part of the method's credibility that we wrote it down before we ran it.

Honest limitations we know about

Every methodology choice in this case study has caveats. We documented eleven of them at length in the rationale doc; here are the four that matter most for interpreting the outcome.

The sample is small. Twenty ZIP codes × twenty-five agents per ZIP = five hundred agents. Demographic standard error per ZIP is roughly nine percentage points and is fixed across the five stochastic replicates we run. Total per-ZIP prediction uncertainty is on the order of ten percentage points. Headline claims should and will lean on rank-order metrics rather than individual-ZIP point predictions.

Party affiliation is excluded. ACS does not publish party at the agent level. By mid-2021, Brookings showed party became the dominant uptake predictor. We expect under-prediction in conservative-leaning Q4 ZIPs (Diamond Bar, Walnut), and we are explicit that this is an expected limitation of V1 — not a model failure if it materializes.

The validation target is the CDPH twelve-plus archive, but our model is adults eighteen-plus. CDPH did not publish a separate eighteen-plus ZIP-level resource for 2021. We back-correct by subtracting CDC's published California adolescent uptake (state-level, ages twelve to seventeen) from the twelve-plus rate per ZIP, weighted by each ZIP's adolescent population fraction from ACS. The arithmetic is straightforward; the assumption is that statewide adolescent uptake is a reasonable proxy for adolescent uptake in any given LA ZIP. Where the back-correction produces impossible values we will clamp and report.

The LLM has training-data knowledge of 2021 outcomes. This is the single most important risk. Persona prompts explicitly anchor the agent in early 2021 with instructions that they only know what has been said in their feed and what they have seen in their community. We regression-test this anchoring before any full simulation run with adversarial probes — direct knowledge questions, behavior vignettes that would tempt a hindsight-aware response, and survey-honesty probes that pressure a socially-desirable answer. The gate is a ninety-percent pass rate across diverse sample personas. If we cannot pass it, we tighten the anchoring and re-test before spending a dollar on the full simulation.

Why it matters if it works

If a synthetic population, run blind, can recover the observed equity pattern of a year-long real-world public-health rollout — that is a substantively new capability for healthcare. It means a health system can stress-test a planned outreach campaign against a digital twin of its actual patient population before launching. It means an equity intervention can be predicted by stratification before millions are committed. It means counterfactual exploration ("what if Dodger Stadium had stayed open another two months?" "what if MyTurn had launched in twelve languages on day one?") becomes possible without an IRB-approved field study.

Keck Medicine of USC has a published Community Benefit Service Area and a serious institutional commitment to closing equity gaps in those neighborhoods. We chose Keck's CBSA as the anchor for this case study because it is a real-world setting where the methodology, if it works, has direct purchase. The same approach generalizes to colorectal screening campaigns, statin adherence programs, RSV vaccination outreach, ED throughput interventions — anything where uptake is socially and demographically driven.

If the methodology fails — and it might — we will publish the failure with the same rigor as a success. The pre-registration discipline is what makes that publishable, because the design was not chosen to make the model look good.

What's next

The simulation runs over the next several weeks. The persona-anchor regression test runs first as the gate before any full-simulation budget is spent. Once it passes, the full V1 — five hundred agents, five replicates, twelve monthly rounds, blind to CDPH outcomes throughout — runs on Google Cloud Run. Predictions are hash-locked to a reproducibility manifest before the embargo on CDPH data is lifted. Then we apply the Phase 6 back-correction, compute every locked metric, and write up the result.

We will publish predicted versus actual side by side for every ZIP, in a follow-up post on this blog. Whether the headline reads "the model recovered the equity pattern within X percentage points" or "the model failed in the following ways and here is what we are changing for V2," the post will be live within four weeks of this one.

If you are at a health system with an equity mandate, an outreach campaign you are unsure about, or a workflow change you want to test before deploying — we are open to pilot conversations. The case-study artifact you can read on this blog is what we mean by "ready to do this for real."

X Research is building Population Lab, a multi-agent simulation platform grounded in real census-and-survey data for testing healthcare interventions before deploying them. The founder is a current USC Keck Medicine staff member; Keck Medicine's CBSA appears in this case study because that is where the founder's day-job exposure to the equity-intervention problem lives, not because Keck endorses this work.