Skip to content

Add ihdp_covariates.csv for IHDP Dataset Simulator#14

Open
Gitanaskhan26 wants to merge 5 commits into
pgmpy:mainfrom
Gitanaskhan26:main
Open

Add ihdp_covariates.csv for IHDP Dataset Simulator#14
Gitanaskhan26 wants to merge 5 commits into
pgmpy:mainfrom
Gitanaskhan26:main

Conversation

@Gitanaskhan26

@Gitanaskhan26 Gitanaskhan26 commented Jul 1, 2026

Copy link
Copy Markdown

Summary

Adds ihdp_covariates.csv (747 rows × 26 columns: treatment + x1-x25) to the pgmpy/example_datasets HuggingFace repo. This is the fixed, real-covariate design matrix that IHDPDataset loads via _get_raw_data("ihdp_covariates.csv") on every instantiation.

Related Issue : pgmpy/pgmpy#3420

Provenance

  • Source: ihdp_npci_1.csv from the CEVAE repository (github.com/AMLab-Amsterdam/CEVAE/tree/master/datasets/IHDP/csv), one of the standard NPCI-generated (Dorie, 2016) IHDP replications built on real covariates from the Infant Health and Development Program (Hill, 2011).
  • Only treatment and x1-x25 are kept. y_factual, y_cfactual, mu0, mu1 are dropped — those are outcome columns specific to that one CEVAE replication; IHDPDataset regenerates outcomes itself from these fixed covariates, so shipping baked-in outcomes would be actively misleading (and would tie the package to one arbitrary replication out of 1000).
  • Generation script: prep.py (included in this PR), which validates its own output before writing — shape, treated/control counts, and per-column value ranges — since this file becomes permanent shared infrastructure once uploaded.

Validated properties (asserted by prep.py)

  • Shape: 747 × 26. 139 treated / 608 control — the standard post-selection-bias IHDP sample size used throughout the literature.
  • x1-x6 (birth weight, head circumference, weeks preterm, birth order, neonatal health index, mother's age) are continuous and already standardized (mean≈0, std=1) as part of the upstream NPCI/CEVAE pipeline -> this extraction doesn't standardize them itself, it inherits that property.
  • x7-x25 are binary (0/1) site and demographic indicators, with one documented exception: x14 ("first" — firstborn indicator) is {1,2}-coded, not {0,1}. This isn't a data error — EconML's own port carries an explicit comment doing the equivalent adjustment for this same variable, so {1,2} is the literature-standard coding for it and has been left as-is.
  • Covariates are identical across all CEVAE replications (ihdp_npci_1.csv through _1000.csv) by construction, so this file is replication-agnostic ; extracting from replication 1 is representative of all of them.

Why ihdp_covariates.csv file instead of Raw CEVAE CSV

IHDPDataset treats IHDP as a real simulator: covariates are fixed, outcomes are generated fresh per instantiation from a parameterized response surface. A file with baked-in y_factual/mu0/mu1 would suggest those are meant to be read directly rather than regenerated.

What x1-x25 actually are

Column names stay x1-x25 (matching every paper/package that reports IHDP numbers), but here's what each one is, for anyone browsing this dataset who wants to know. Verified by exact value reconstruction against EconML's independently-maintained raw covariate file — not assumed from documentation (see prep.py's COVARIATE_INFO for the full verification method).

Column Name Meaning
x1-x6 bw, b.head, preterm, birth.o, nnhealth, momage Birth weight, head circumference, weeks preterm, birth order, neonatal health index, mother's age (continuous)
x7-x9 sex, twin, b.marr Infant sex, twin birth, mother married (binary)
x10-x12 mom.lths, mom.hs, mom.scoll Mother's education level: <high school / high school / some college (binary dummies)
x13, x15-x18 cig, booze, drugs, work.dur, prenatal Smoked / drank / used drugs during pregnancy, worked during pregnancy, received prenatal care (binary)
x14 first Firstborn — coded {1,2}, not {0,1} (genuine upstream convention, not an error)
x19-x25 site1-site7 Trial site indicator, 7 sites (binary)

Checklist

  • File uploaded to pgmpy/example_datasets on HuggingFace
  • Confirmed accessible via _get_raw_data("ihdp_covariates.csv")
  • prep.py included in this PR for reproducibility
  • Byte-identical regeneration confirmed (re-running prep.py
    against ihdp_npci_1.csv reproduces this file exactly — verified
    via checksum before upload)

- ihdp_covariates.csv: 747 rows x 26 cols (treatment + x1-x25)
  Extracted from CEVAE/NPCI replication, outcomes dropped.
  139 treated / 608 control, x1-x6 pre-standardized, x7-x25 binary.
- ihdp_npci_1.csv: Source file for provenance
- prep.py: Extraction script with validation assertions
@Gitanaskhan26

Copy link
Copy Markdown
Author

@ankurankan, can you please review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant