Skip to content

Add geo-experiment design workflow and sensitivity check plotting for Synthetic Control#819

Draft
drbenvincent wants to merge 17 commits into
mainfrom
feature/sc-design-workflow
Draft

Add geo-experiment design workflow and sensitivity check plotting for Synthetic Control#819
drbenvincent wants to merge 17 commits into
mainfrom
feature/sc-design-workflow

Conversation

@drbenvincent

@drbenvincent drbenvincent commented Apr 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds prospective experiment-design capabilities and first-class sensitivity check plotting to SyntheticControl.

Design-phase methods — so practitioners can assess whether a geo-experiment will work before committing budget:

  • SyntheticControl.from_pre_period() creates a design-phase instance from pre-period data only
  • validate_design() — dress rehearsal: injects a known effect and checks recovery
  • power_analysis() — simulation-based Bayesian power curve
  • donor_pool_quality() — composite quality score (correlation, convex hull, weight concentration)
  • DressRehearsalCheck wraps dress rehearsal as a Check for pipeline integration
  • Result classes (DressRehearsalResult, PowerCurveResult, DonorPoolQualityResult) with plot() and summary() methods

Sensitivity check plotting — previously, check visualizations lived as ~80 lines of custom matplotlib in the notebook. Now they are part of the library:

  • New causalpy/checks/_plot_helpers.py with shared forest_plot() and null_distribution_plot() helpers
  • plot() staticmethods on PlaceboInSpace, PlaceboInTime, LeaveOneOut, and PriorSensitivity
  • Each check's run() auto-populates CheckResult.figures with matplotlib figures
  • GenerateReport now renders check figures in the HTML report (base64-encoded PNGs)
  • Notebook custom plot cells replaced with single-line library calls (e.g. PlaceboInSpace.plot(result, baseline_stats=stats))

Notebook overhaul (sc_pymc.ipynb):

  • Switched to the California Proposition 99 dataset — the canonical SC example
  • Restructured as a full workflow: design assessment before analysis, robustness checks after
  • Literature-grounded narrative with citations to Abadie (2003, 2010, 2015, 2021), Athey (2017), Brodersen (2015)

Test plan

  • 27 integration tests pass (test_sc_design.py)
  • 13 unit tests for check plotting (test_check_plots.py)
  • All prek checks pass (ruff, mypy, codespell, notebook schema) — interrogate failure is pre-existing (84% vs 85%)
  • Verify notebook renders correctly via make html
  • Run full test suite to check for regressions

Add prospective design capabilities so practitioners can assess whether
a geo-experiment will work before committing budget:

- `SyntheticControl.from_pre_period()`: classmethod that fits SC on
  pre-period data only, enabling prospective design assessment without
  requiring post-period observations
- `validate_design()`: dress rehearsal that injects a known effect and
  checks if the model recovers it
- `power_analysis()`: simulation-based Bayesian power curve across
  candidate effect sizes
- `donor_pool_quality()`: composite quality score aggregating donor
  correlations, convex hull coverage, and weight concentration
- `DressRehearsalCheck`: pipeline-compatible Check wrapper for
  sensitivity analysis integration
- Result classes with `plot()` and `summary()` methods
- 27 integration tests covering both prospective and retrospective
  workflows
- Demo sections in sc_pymc.ipynb showing the real workflow: design
  assessment before analysis

Made-with: Cursor
@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@drbenvincent drbenvincent added the OSS_PRODUCT OSS_PRODUCT project priorities. Labs members should get approval before logging hours. label Apr 2, 2026
@drbenvincent drbenvincent marked this pull request as draft April 2, 2026 21:06
@codecov

codecov Bot commented Apr 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.49448% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.73%. Comparing base (1ee7322) to head (c6a9b21).

Files with missing lines Patch % Lines
causalpy/experiments/synthetic_control.py 83.67% 14 Missing and 10 partials ⚠️
causalpy/experiments/sc_results.py 92.59% 3 Missing and 3 partials ⚠️
causalpy/checks/dress_rehearsal.py 82.60% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #819      +/-   ##
==========================================
- Coverage   93.77%   93.73%   -0.05%     
==========================================
  Files          77       80       +3     
  Lines       11881    12333     +452     
  Branches      696      732      +36     
==========================================
+ Hits        11142    11560     +418     
- Misses        546      566      +20     
- Partials      193      207      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Reorder cells and rewrite headings so the notebook mirrors a
practitioner's actual workflow: design assessment before the
experiment, causal analysis after. Key changes:

- Move df.head() to the data-loading section
- Move convex hull explanation before the design section
- Rename headings to question-driven titles (educational-narrative)
- Add clear "Before / After the experiment" phase headings
- Add transition prose between design and analysis phases
- Add power curve interpretation cell with go/no-go guidance
- Link donor pool selection forward to donor_pool_quality()
- Demote Effect Summary to subsection of analysis phase

Made-with: Cursor
Donor pool selection and convex hull condition are pre-experiment
checks — they now sit as subsections of "Before the experiment"
rather than floating between Load data and the design section.

Also adds a reminder in the "After" section that the convex hull
check runs automatically when constructing the full SyntheticControl.
Fixes missing nbformat properties across all output cells.

Made-with: Cursor
Summarise the full before/after workflow under the title so readers
can see the notebook's scope at a glance. Each step gets 2-3
sentences explaining what it does and why it matters.

Made-with: Cursor
…pymc notebook

Expand the Synthetic Control notebook with academic references (Abadie 2010/2015/2021,
Athey & Imbens 2017, etc.) and add post-estimation robustness sections: placebo-in-space,
placebo-in-time, leave-one-out, and prior sensitivity — each with result visualisations
and interpretation guidance. Add 13 new BibTeX entries to references.bib.

Made-with: Cursor
Replace the synthetic toy dataset with the canonical Abadie, Diamond &
Hainmueller (2010) Proposition 99 dataset — per-capita cigarette sales
across 39 US states, 1970-2000. This grounds the notebook in real data
from the SC literature, improves connections to cited references, and
gives robustness checks a realistic "good case" to demonstrate.

- Add california_prop99.csv (wide format, 7 KB) and register as "prop99"
- Update all narrative to California/tobacco policy context
- Update all code cells: control_units, treated_unit, treatment_time
- Adjust holdout_periods for the 19-year pre-period

Made-with: Cursor
Enlarge the correlation heatmap for readability with 39 states, add an
explicit donor pool selection step that removes states with negative
pre-treatment correlation (threshold=0.0), and explain the threshold
choice. Excludes Alabama, Arkansas, Georgia, Tennessee — leaving 34
well-correlated donors.

Made-with: Cursor
Notebook fully executed with California Proposition 99 data: correlation
heatmap, donor pool curation, design assessment, model fit, effect
summaries, and all four robustness checks (placebo-in-space,
placebo-in-time, leave-one-out, prior sensitivity) with visualisations.

Made-with: Cursor
Extract shared plotting helpers (_plot_helpers.py) and add plot()
staticmethods to PlaceboInSpace, PlaceboInTime, LeaveOneOut, and
PriorSensitivity. Each check now auto-populates CheckResult.figures
in run(). GenerateReport renders check figures in the HTML report.
Replace ~80 lines of custom matplotlib in sc_pymc.ipynb with
single-line library calls.

Made-with: Cursor
@drbenvincent drbenvincent changed the title Add geo-experiment design workflow for Synthetic Control Add geo-experiment design workflow and sensitivity check plotting for Synthetic Control Apr 3, 2026
- Add raw data time-series visualization after data loading
- Add circle tile map showing per-state correlation with California
- Add interpretation text after dress rehearsal plot
- Document power curve Type I error issue as TODO; remove effect_size=0
- Reduce forest plot per-row height (0.45 -> 0.3) in _plot_helpers.py
- Fix pre-existing nbformat validation issues in cell outputs

Made-with: Cursor
Agents cannot detect unsaved IDE state, so prompt the user to confirm
all files (especially notebooks with expensive outputs) are saved
before staging and committing.

Made-with: Cursor
…kflow

Made-with: Cursor

# Conflicts:
#	causalpy/data/datasets.py
#	causalpy/experiments/__init__.py
#	causalpy/experiments/synthetic_control.py
#	docs/source/notebooks/sc_pymc.ipynb
#	docs/source/references.bib
PR #834 consolidated all per-document `:::{bibliography}` blocks
into the global `docs/source/references.rst` page to eliminate
`bibtex.duplicate_citation` warnings. The newly-rewritten
sc_pymc.ipynb still carried a local bibliography cell at the end;
remove it so the notebook conforms to the new convention. Inline
`{cite:p}` / `{cite:t}` references continue to resolve via the
global bibliography.

Made-with: Cursor
- Switch power_analysis criterion from default `hdi_excludes_zero`
  to `prob_gt_zero`. The HDI-based criterion is sign-blind and was
  flagging wrong-sign mis-fit artefacts as detections at small
  positive injected effects, producing a non-monotonic V-shape in
  the power curve.
- Bump n_simulations 10 -> 25 and extend effect_sizes to
  np.linspace(0, 0.25, 6) so the curve includes the null point and
  has tighter Monte Carlo precision.
- Update the surrounding markdown to describe the new criterion and
  reword the existing TODO admonition into a Caveat that records
  remaining sources of pseudo-post mis-fit bias and lists planned
  follow-ups (null-distribution calibration, longer holdout window,
  sign-aware HDI variant).

Outputs intentionally unchanged here; the notebook will be re-run
manually to refresh the power-curve figure and table.

Made-with: Cursor
Reorder the design-phase sub-sections so the power curve comes first,
then a streamlined `validate_design(injected_effect=0)` rehearsal
explicitly framed as a placebo-in-time sanity check. Drop the
`injected_effect=0.15` rehearsal. Add a closing
"Putting it together" section that recaps each check, names the central
tension between the clean power curve and the failed placebo-in-time,
and separates magnitude estimation (likely biased) from existence-of-effect
inference (rescued by placebo-in-space). Trim the power-curve caveat
admonition to point forward to the new sub-section.

Implements the "Right for the Wrong Reasons" narrative: simpler diagnostics
look healthy, but the placebo-in-time check surfaces a structural
identification problem with the donor pool.

Made-with: Cursor
@drbenvincent

Copy link
Copy Markdown
Collaborator Author

Question on the noise-injection step in power_analysis()

Flagging this for a closer look before we merge — I want to make sure we're on solid methodological ground here.

Quick recap of what the algorithm does, so the question is concrete. For each candidate effect size, and for each simulation within it, power_analysis() does roughly this: it takes a copy of the pre-period data, draws a fresh vector of Gaussian noise scaled by the residual standard deviation from the real pre-period fit, adds that noise to the treated unit's pre-period values, builds a new design-phase SC on the noisy data, runs validate_design() at the current effect size, and records whether the chosen detection criterion fires. Tally detections, divide by n_simulations, repeat across the effect-size grid, and that's the power curve.

The bit I want to sanity-check is the noise injection itself. It's the only source of simulation-to-simulation variability — the injected effect is deterministic, the donor pool is fixed, the model is fixed — so whatever statistical properties we attribute to the power curve are inherited entirely from that step. Two specific worries:

  1. Is the noise model right? We're drawing iid Gaussian noise with sigma = residual_std, where residual_std is the std of the pre-period residuals' posterior-mean trajectory. That's a single scalar applied uniformly across time, treating residuals as homoscedastic and independent. For most real treated time-series — including Prop 99 — pre-period residuals are autocorrelated and often heteroscedastic. An iid Gaussian draw will under-represent the kind of variation we actually expect to see in a fresh realisation of the same data-generating process, which would tend to make the power curve look tighter / more optimistic than it should.

  2. Is sampling-variation-on-the-treated-unit-only the right notion of "a fresh experiment" for SC? In a frequentist power calculation we'd usually resample from the assumed DGP. Here we're perturbing the observed treated trajectory while holding the donor pool fixed at its single realised path. That's a defensible choice (donors are the "design", treated is the "outcome"), but it's not obviously the same thing as sampling-distribution-style power, and I haven't found a clean reference in the synthetic-control literature that endorses exactly this resampling scheme. Abadie-style inference uses placebo-in-space permutations rather than parametric noise injection; the Bayesian-power literature (e.g. Kruschke-style assurance, design prior approaches) typically simulates from the full posterior predictive rather than perturbing one series.

Concrete things I'd like us to check before this lands:

  • Is there a paper / standard reference we can cite for this specific resampling scheme? If yes, let's add the citation in the docstring and the notebook. If no, we should be explicit in the docs that this is a heuristic and describe its limitations.
  • Should the noise model at least respect pre-period autocorrelation (e.g. block bootstrap of residuals, or an AR(1) draw calibrated from the pre-period residuals) rather than iid Gaussian?
  • Should we offer an alternative variability source — e.g. resampling from the posterior predictive of the design-phase fit — as a non-default option, so users can compare?
  • At minimum: does the docstring make it clear what assumption the iid-Gaussian-on-treated step is making, and what kinds of mis-specification will bias the resulting power curve?

Happy to take the lead on any of the above if we agree it's worth doing in this PR rather than as a follow-up. My instinct is that the first bullet (find a citation or be honest that there isn't one) is the minimum bar for landing, and the others can become a follow-up issue.

@cetagostini

Copy link
Copy Markdown
Collaborator

Hey, amazing work and dig here, I had this idea many times before and kinda like it because sound intuitive but, I guess this is re-implementing an existing capability already in CausalPy.

Placebo in time was build with the goal of run a "power analysis" (bayesian assurance - power is more freq term). The approach solve and catch many of the issues or concerns you raise. Allowing you an output like this one: Check here ->

image

ps: The outcome shows detection probability as a continuous function of effect magnitude — the same thing as a power curve — but decomposed into three regions (Correct Detection, Misclassification, Non-Detection) rather than a single binary threshold, and it's derived from the placebo-calibrated null model, not IID noise injection. Not available now, but we only need a function to make the plot.

On the other hand, this power analysis give you information only about if the information holds. You don't need to re-run several times to get this, you can run the model over the most recent pre-period, get a posterior for different trajectories, estimate a CI, and then you can estimate properly what effect size would be greater than Z either in average or cumulative. You could simulate as many trajectories as you want and be creative here, but this only give you "power" based on given model uncertainty, and loop adds no information.

Additionally, adding effect has the flaws you already detected, increasing complexity without solve previous point.


You are right to mention about: design prior approaches. This is well documented, and it's the complement of the null model already coming from placebo estimation, you can say, "Based on prior knowledge, my expected cumulative effect (design prior) is N" then draw a full estimation based on it, and see where it lands, helping you to estimate the curve showed above. The "curve" comes out by construction — it's the joint integration of detection probability against the design prior, weighted by plausibility. Bayesian Assurance (O'Hagan, 2005) gives you the operating characteristics by integration — the "curve" emerges naturally without a brute-force simulation loop. CausalPy already implements this via PlaceboInTime's expected_effect_prior argument (#826 ).

You can loop based on the different outcomes of this method to estimate best combination of donors or other characteristics. Read here and check the references!

My take in short: It's a great job but reimplement's existing logic, after #826 I can make a PR with new plots only and a notebook to show this full pipeline and how this things are solved. Unless, I'm missing something here which is very probable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OSS_PRODUCT OSS_PRODUCT project priorities. Labs members should get approval before logging hours.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants