Add geo-experiment design workflow and sensitivity check plotting for Synthetic Control by drbenvincent · Pull Request #819 · pymc-labs/CausalPy

drbenvincent · 2026-04-02T21:05:46Z

Summary

Adds prospective experiment-design capabilities and first-class sensitivity check plotting to SyntheticControl.

Design-phase methods — so practitioners can assess whether a geo-experiment will work before committing budget:

SyntheticControl.from_pre_period() creates a design-phase instance from pre-period data only
validate_design() — dress rehearsal: injects a known effect and checks recovery
power_analysis() — simulation-based Bayesian power curve
donor_pool_quality() — composite quality score (correlation, convex hull, weight concentration)
DressRehearsalCheck wraps dress rehearsal as a Check for pipeline integration
Result classes (DressRehearsalResult, PowerCurveResult, DonorPoolQualityResult) with plot() and summary() methods

Sensitivity check plotting — previously, check visualizations lived as ~80 lines of custom matplotlib in the notebook. Now they are part of the library:

New causalpy/checks/_plot_helpers.py with shared forest_plot() and null_distribution_plot() helpers
plot() staticmethods on PlaceboInSpace, PlaceboInTime, LeaveOneOut, and PriorSensitivity
Each check's run() auto-populates CheckResult.figures with matplotlib figures
GenerateReport now renders check figures in the HTML report (base64-encoded PNGs)
Notebook custom plot cells replaced with single-line library calls (e.g. PlaceboInSpace.plot(result, baseline_stats=stats))

Notebook overhaul (sc_pymc.ipynb):

Switched to the California Proposition 99 dataset — the canonical SC example
Restructured as a full workflow: design assessment before analysis, robustness checks after
Literature-grounded narrative with citations to Abadie (2003, 2010, 2015, 2021), Athey (2017), Brodersen (2015)

Test plan

27 integration tests pass (test_sc_design.py)
13 unit tests for check plotting (test_check_plots.py)
All prek checks pass (ruff, mypy, codespell, notebook schema) — interrogate failure is pre-existing (84% vs 85%)
Verify notebook renders correctly via make html
Run full test suite to check for regressions

Add prospective design capabilities so practitioners can assess whether a geo-experiment will work before committing budget: - `SyntheticControl.from_pre_period()`: classmethod that fits SC on pre-period data only, enabling prospective design assessment without requiring post-period observations - `validate_design()`: dress rehearsal that injects a known effect and checks if the model recovers it - `power_analysis()`: simulation-based Bayesian power curve across candidate effect sizes - `donor_pool_quality()`: composite quality score aggregating donor correlations, convex hull coverage, and weight concentration - `DressRehearsalCheck`: pipeline-compatible Check wrapper for sensitivity analysis integration - Result classes with `plot()` and `summary()` methods - 27 integration tests covering both prospective and retrospective workflows - Demo sections in sc_pymc.ipynb showing the real workflow: design assessment before analysis Made-with: Cursor

review-notebook-app · 2026-04-02T21:05:53Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

codecov · 2026-04-02T21:11:32Z

Codecov Report

❌ Patch coverage is 92.49448% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.73%. Comparing base (1ee7322) to head (c6a9b21).

Files with missing lines	Patch %	Lines
causalpy/experiments/synthetic_control.py	83.67%	14 Missing and 10 partials ⚠️
causalpy/experiments/sc_results.py	92.59%	3 Missing and 3 partials ⚠️
causalpy/checks/dress_rehearsal.py	82.60%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #819      +/-   ##
==========================================
- Coverage   93.77%   93.73%   -0.05%     
==========================================
  Files          77       80       +3     
  Lines       11881    12333     +452     
  Branches      696      732      +36     
==========================================
+ Hits        11142    11560     +418     
- Misses        546      566      +20     
- Partials      193      207      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

read-the-docs-community · 2026-04-02T21:19:43Z

Documentation build overview

📚 causalpy | 🛠️ Build #32496639 | 📁 Comparing 8d9b164 against latest (2ff6b7b)

🔍 Preview build

47 files changed · + 24 added · ± 23 modified

+ Added

± Modified

Reorder cells and rewrite headings so the notebook mirrors a practitioner's actual workflow: design assessment before the experiment, causal analysis after. Key changes: - Move df.head() to the data-loading section - Move convex hull explanation before the design section - Rename headings to question-driven titles (educational-narrative) - Add clear "Before / After the experiment" phase headings - Add transition prose between design and analysis phases - Add power curve interpretation cell with go/no-go guidance - Link donor pool selection forward to donor_pool_quality() - Demote Effect Summary to subsection of analysis phase Made-with: Cursor

Donor pool selection and convex hull condition are pre-experiment checks — they now sit as subsections of "Before the experiment" rather than floating between Load data and the design section. Also adds a reminder in the "After" section that the convex hull check runs automatically when constructing the full SyntheticControl. Fixes missing nbformat properties across all output cells. Made-with: Cursor

Summarise the full before/after workflow under the title so readers can see the notebook's scope at a glance. Each step gets 2-3 sentences explaining what it does and why it matters. Made-with: Cursor

…pymc notebook Expand the Synthetic Control notebook with academic references (Abadie 2010/2015/2021, Athey & Imbens 2017, etc.) and add post-estimation robustness sections: placebo-in-space, placebo-in-time, leave-one-out, and prior sensitivity — each with result visualisations and interpretation guidance. Add 13 new BibTeX entries to references.bib. Made-with: Cursor

Replace the synthetic toy dataset with the canonical Abadie, Diamond & Hainmueller (2010) Proposition 99 dataset — per-capita cigarette sales across 39 US states, 1970-2000. This grounds the notebook in real data from the SC literature, improves connections to cited references, and gives robustness checks a realistic "good case" to demonstrate. - Add california_prop99.csv (wide format, 7 KB) and register as "prop99" - Update all narrative to California/tobacco policy context - Update all code cells: control_units, treated_unit, treatment_time - Adjust holdout_periods for the 19-year pre-period Made-with: Cursor

Enlarge the correlation heatmap for readability with 39 states, add an explicit donor pool selection step that removes states with negative pre-treatment correlation (threshold=0.0), and explain the threshold choice. Excludes Alabama, Arkansas, Georgia, Tennessee — leaving 34 well-correlated donors. Made-with: Cursor

Notebook fully executed with California Proposition 99 data: correlation heatmap, donor pool curation, design assessment, model fit, effect summaries, and all four robustness checks (placebo-in-space, placebo-in-time, leave-one-out, prior sensitivity) with visualisations. Made-with: Cursor

Extract shared plotting helpers (_plot_helpers.py) and add plot() staticmethods to PlaceboInSpace, PlaceboInTime, LeaveOneOut, and PriorSensitivity. Each check now auto-populates CheckResult.figures in run(). GenerateReport renders check figures in the HTML report. Replace ~80 lines of custom matplotlib in sc_pymc.ipynb with single-line library calls. Made-with: Cursor

- Add raw data time-series visualization after data loading - Add circle tile map showing per-state correlation with California - Add interpretation text after dress rehearsal plot - Document power curve Type I error issue as TODO; remove effect_size=0 - Reduce forest plot per-row height (0.45 -> 0.3) in _plot_helpers.py - Fix pre-existing nbformat validation issues in cell outputs Made-with: Cursor

Made-with: Cursor

Agents cannot detect unsaved IDE state, so prompt the user to confirm all files (especially notebooks with expensive outputs) are saved before staging and committing. Made-with: Cursor

…kflow Made-with: Cursor # Conflicts: # causalpy/data/datasets.py # causalpy/experiments/__init__.py # causalpy/experiments/synthetic_control.py # docs/source/notebooks/sc_pymc.ipynb # docs/source/references.bib

PR #834 consolidated all per-document `:::{bibliography}` blocks into the global `docs/source/references.rst` page to eliminate `bibtex.duplicate_citation` warnings. The newly-rewritten sc_pymc.ipynb still carried a local bibliography cell at the end; remove it so the notebook conforms to the new convention. Inline `{cite:p}` / `{cite:t}` references continue to resolve via the global bibliography. Made-with: Cursor

- Switch power_analysis criterion from default `hdi_excludes_zero` to `prob_gt_zero`. The HDI-based criterion is sign-blind and was flagging wrong-sign mis-fit artefacts as detections at small positive injected effects, producing a non-monotonic V-shape in the power curve. - Bump n_simulations 10 -> 25 and extend effect_sizes to np.linspace(0, 0.25, 6) so the curve includes the null point and has tighter Monte Carlo precision. - Update the surrounding markdown to describe the new criterion and reword the existing TODO admonition into a Caveat that records remaining sources of pseudo-post mis-fit bias and lists planned follow-ups (null-distribution calibration, longer holdout window, sign-aware HDI variant). Outputs intentionally unchanged here; the notebook will be re-run manually to refresh the power-curve figure and table. Made-with: Cursor

Reorder the design-phase sub-sections so the power curve comes first, then a streamlined `validate_design(injected_effect=0)` rehearsal explicitly framed as a placebo-in-time sanity check. Drop the `injected_effect=0.15` rehearsal. Add a closing "Putting it together" section that recaps each check, names the central tension between the clean power curve and the failed placebo-in-time, and separates magnitude estimation (likely biased) from existence-of-effect inference (rescued by placebo-in-space). Trim the power-curve caveat admonition to point forward to the new sub-section. Implements the "Right for the Wrong Reasons" narrative: simpler diagnostics look healthy, but the placebo-in-time check surfaces a structural identification problem with the donor pool. Made-with: Cursor

drbenvincent · 2026-05-12T11:58:25Z

Question on the noise-injection step in `power_analysis()`

Flagging this for a closer look before we merge — I want to make sure we're on solid methodological ground here.

Quick recap of what the algorithm does, so the question is concrete. For each candidate effect size, and for each simulation within it, power_analysis() does roughly this: it takes a copy of the pre-period data, draws a fresh vector of Gaussian noise scaled by the residual standard deviation from the real pre-period fit, adds that noise to the treated unit's pre-period values, builds a new design-phase SC on the noisy data, runs validate_design() at the current effect size, and records whether the chosen detection criterion fires. Tally detections, divide by n_simulations, repeat across the effect-size grid, and that's the power curve.

The bit I want to sanity-check is the noise injection itself. It's the only source of simulation-to-simulation variability — the injected effect is deterministic, the donor pool is fixed, the model is fixed — so whatever statistical properties we attribute to the power curve are inherited entirely from that step. Two specific worries:

Is the noise model right? We're drawing iid Gaussian noise with sigma = residual_std, where residual_std is the std of the pre-period residuals' posterior-mean trajectory. That's a single scalar applied uniformly across time, treating residuals as homoscedastic and independent. For most real treated time-series — including Prop 99 — pre-period residuals are autocorrelated and often heteroscedastic. An iid Gaussian draw will under-represent the kind of variation we actually expect to see in a fresh realisation of the same data-generating process, which would tend to make the power curve look tighter / more optimistic than it should.
Is sampling-variation-on-the-treated-unit-only the right notion of "a fresh experiment" for SC? In a frequentist power calculation we'd usually resample from the assumed DGP. Here we're perturbing the observed treated trajectory while holding the donor pool fixed at its single realised path. That's a defensible choice (donors are the "design", treated is the "outcome"), but it's not obviously the same thing as sampling-distribution-style power, and I haven't found a clean reference in the synthetic-control literature that endorses exactly this resampling scheme. Abadie-style inference uses placebo-in-space permutations rather than parametric noise injection; the Bayesian-power literature (e.g. Kruschke-style assurance, design prior approaches) typically simulates from the full posterior predictive rather than perturbing one series.

Concrete things I'd like us to check before this lands:

Is there a paper / standard reference we can cite for this specific resampling scheme? If yes, let's add the citation in the docstring and the notebook. If no, we should be explicit in the docs that this is a heuristic and describe its limitations.
Should the noise model at least respect pre-period autocorrelation (e.g. block bootstrap of residuals, or an AR(1) draw calibrated from the pre-period residuals) rather than iid Gaussian?
Should we offer an alternative variability source — e.g. resampling from the posterior predictive of the design-phase fit — as a non-default option, so users can compare?
At minimum: does the docstring make it clear what assumption the iid-Gaussian-on-treated step is making, and what kinds of mis-specification will bias the resulting power curve?

Happy to take the lead on any of the above if we agree it's worth doing in this PR rather than as a follow-up. My instinct is that the first bullet (find a citation or be honest that there isn't one) is the minimum bar for landing, and the others can become a follow-up issue.

cetagostini · 2026-05-12T15:06:30Z

Hey, amazing work and dig here, I had this idea many times before and kinda like it because sound intuitive but, I guess this is re-implementing an existing capability already in CausalPy.

Placebo in time was build with the goal of run a "power analysis" (bayesian assurance - power is more freq term). The approach solve and catch many of the issues or concerns you raise. Allowing you an output like this one: Check here ->

ps: The outcome shows detection probability as a continuous function of effect magnitude — the same thing as a power curve — but decomposed into three regions (Correct Detection, Misclassification, Non-Detection) rather than a single binary threshold, and it's derived from the placebo-calibrated null model, not IID noise injection. Not available now, but we only need a function to make the plot.

On the other hand, this power analysis give you information only about if the information holds. You don't need to re-run several times to get this, you can run the model over the most recent pre-period, get a posterior for different trajectories, estimate a CI, and then you can estimate properly what effect size would be greater than Z either in average or cumulative. You could simulate as many trajectories as you want and be creative here, but this only give you "power" based on given model uncertainty, and loop adds no information.

Additionally, adding effect has the flaws you already detected, increasing complexity without solve previous point.

You are right to mention about: design prior approaches. This is well documented, and it's the complement of the null model already coming from placebo estimation, you can say, "Based on prior knowledge, my expected cumulative effect (design prior) is N" then draw a full estimation based on it, and see where it lands, helping you to estimate the curve showed above. The "curve" comes out by construction — it's the joint integration of detection probability against the design prior, weighted by plausibility. Bayesian Assurance (O'Hagan, 2005) gives you the operating characteristics by integration — the "curve" emerges naturally without a brute-force simulation loop. CausalPy already implements this via PlaceboInTime's expected_effect_prior argument (#826 ).

You can loop based on the different outcomes of this method to estimate best combination of donors or other characteristics. Read here and check the references!

My take in short: It's a great job but reimplement's existing logic, after #826 I can make a PR with new plots only and a notebook to show this full pipeline and how this things are solved. Unless, I'm missing something here which is very probable.

drbenvincent added the OSS_PRODUCT OSS_PRODUCT project priorities. Labs members should get approval before logging hours. label Apr 2, 2026

drbenvincent marked this pull request as draft April 2, 2026 21:06

drbenvincent added 3 commits April 2, 2026 22:24

Add pipeline overview to sc_pymc notebook introduction

69fddb8

Summarise the full before/after workflow under the title so readers can see the notebook's scope at a glance. Each step gets 2-3 sentences explaining what it does and why it matters. Made-with: Cursor

drbenvincent mentioned this pull request Apr 3, 2026

Efficient power curve estimation via sigmoid fitting #820

Open

5 tasks

drbenvincent added 5 commits April 3, 2026 11:54

Merge branch 'main' into feature/sc-design-workflow

c6a9b21

drbenvincent mentioned this pull request Apr 3, 2026

docs: ITS sensitivity walkthrough (#788) #821

Open

6 tasks

drbenvincent changed the title ~~Add geo-experiment design workflow for Synthetic Control~~ Add geo-experiment design workflow and sensitivity check plotting for Synthetic Control Apr 3, 2026

drbenvincent added 3 commits April 3, 2026 15:52

Re-run sc_pymc notebook to refresh all cell outputs

78bd554

Made-with: Cursor

Add pre-commit unsaved-notebook check reminder to AGENTS.md

ce959f6

Agents cannot detect unsaved IDE state, so prompt the user to confirm all files (especially notebooks with expensive outputs) are saved before staging and committing. Made-with: Cursor

drbenvincent mentioned this pull request Apr 30, 2026

Refresh pipeline_workflow.ipynb and report_demo.ipynb outputs after #819 lands #883

Open

5 tasks

drbenvincent added 2 commits April 30, 2026 15:42

Merge remote-tracking branch 'origin/main' into feature/sc-design-wor…

15fc403

…kflow Made-with: Cursor # Conflicts: # causalpy/data/datasets.py # causalpy/experiments/__init__.py # causalpy/experiments/synthetic_control.py # docs/source/notebooks/sc_pymc.ipynb # docs/source/references.bib

drbenvincent mentioned this pull request Apr 30, 2026

Remove leftover per-document bibliography blocks in panel_fixed_effects.ipynb and sensitivity_checks.md #884

Open

6 tasks

drbenvincent added 2 commits April 30, 2026 16:03

This was referenced May 12, 2026

Pseudo-post mis-fit bias in SyntheticControl.validate_design / power_analysis #913

Open

Methodology review: iid Gaussian noise injection in SyntheticControl.power_analysis() #914

Open

drbenvincent mentioned this pull request May 12, 2026

Promote RMSPE-ratio view to a first-class PlaceboInSpace plot method #915

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add geo-experiment design workflow and sensitivity check plotting for Synthetic Control#819

Add geo-experiment design workflow and sensitivity check plotting for Synthetic Control#819
drbenvincent wants to merge 17 commits into
mainfrom
feature/sc-design-workflow

drbenvincent commented Apr 2, 2026 •

edited

Loading

Uh oh!

review-notebook-app Bot commented Apr 2, 2026

Uh oh!

codecov Bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

read-the-docs-community Bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

drbenvincent commented May 12, 2026

Uh oh!

cetagostini commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drbenvincent commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

review-notebook-app Bot commented Apr 2, 2026

Uh oh!

codecov Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

read-the-docs-community Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

drbenvincent commented May 12, 2026

Question on the noise-injection step in power_analysis()

Uh oh!

cetagostini commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drbenvincent commented Apr 2, 2026 •

edited

Loading

codecov Bot commented Apr 2, 2026 •

edited

Loading

read-the-docs-community Bot commented Apr 2, 2026 •

edited

Loading

Question on the noise-injection step in `power_analysis()`