Add `ihdp_covariates.csv` for IHDP Dataset Simulator by Gitanaskhan26 · Pull Request #14 · pgmpy/example_datasets

Gitanaskhan26 · 2026-07-01T20:36:02Z

Summary

Adds ihdp_covariates.csv (747 rows × 26 columns: treatment + x1-x25) to the pgmpy/example_datasets HuggingFace repo. This is the fixed, real-covariate design matrix that IHDPDataset loads via _get_raw_data("ihdp_covariates.csv") on every instantiation.

Related Issue : pgmpy/pgmpy#3420

Provenance

Source: ihdp_npci_1.csv from the CEVAE repository (github.com/AMLab-Amsterdam/CEVAE/tree/master/datasets/IHDP/csv), one of the standard NPCI-generated (Dorie, 2016) IHDP replications built on real covariates from the Infant Health and Development Program (Hill, 2011).
Only treatment and x1-x25 are kept. y_factual, y_cfactual, mu0, mu1 are dropped — those are outcome columns specific to that one CEVAE replication; IHDPDataset regenerates outcomes itself from these fixed covariates, so shipping baked-in outcomes would be actively misleading (and would tie the package to one arbitrary replication out of 1000).
Generation script: prep.py (included in this PR), which validates its own output before writing — shape, treated/control counts, and per-column value ranges — since this file becomes permanent shared infrastructure once uploaded.

Validated properties (asserted by `prep.py`)

Shape: 747 × 26. 139 treated / 608 control — the standard post-selection-bias IHDP sample size used throughout the literature.
x1-x6 (birth weight, head circumference, weeks preterm, birth order, neonatal health index, mother's age) are continuous and already standardized (mean≈0, std=1) as part of the upstream NPCI/CEVAE pipeline -> this extraction doesn't standardize them itself, it inherits that property.
x7-x25 are binary (0/1) site and demographic indicators, with one documented exception: x14 ("first" — firstborn indicator) is {1,2}-coded, not {0,1}. This isn't a data error — EconML's own port carries an explicit comment doing the equivalent adjustment for this same variable, so {1,2} is the literature-standard coding for it and has been left as-is.
Covariates are identical across all CEVAE replications (ihdp_npci_1.csv through _1000.csv) by construction, so this file is replication-agnostic ; extracting from replication 1 is representative of all of them.

Why `ihdp_covariates.csv` file instead of Raw CEVAE CSV

IHDPDataset treats IHDP as a real simulator: covariates are fixed, outcomes are generated fresh per instantiation from a parameterized response surface. A file with baked-in y_factual/mu0/mu1 would suggest those are meant to be read directly rather than regenerated.

What x1-x25 actually are

Column names stay x1-x25 (matching every paper/package that reports IHDP numbers), but here's what each one is, for anyone browsing this dataset who wants to know. Verified by exact value reconstruction against EconML's independently-maintained raw covariate file — not assumed from documentation (see prep.py's COVARIATE_INFO for the full verification method).

Column	Name	Meaning
x1-x6	`bw`, `b.head`, `preterm`, `birth.o`, `nnhealth`, `momage`	Birth weight, head circumference, weeks preterm, birth order, neonatal health index, mother's age (continuous)
x7-x9	`sex`, `twin`, `b.marr`	Infant sex, twin birth, mother married (binary)
x10-x12	`mom.lths`, `mom.hs`, `mom.scoll`	Mother's education level: <high school / high school / some college (binary dummies)
x13, x15-x18	`cig`, `booze`, `drugs`, `work.dur`, `prenatal`	Smoked / drank / used drugs during pregnancy, worked during pregnancy, received prenatal care (binary)
x14	`first`	Firstborn — coded `{1,2}`, not `{0,1}` (genuine upstream convention, not an error)
x19-x25	`site1`-`site7`	Trial site indicator, 7 sites (binary)

Checklist

File uploaded to pgmpy/example_datasets on HuggingFace
Confirmed accessible via _get_raw_data("ihdp_covariates.csv")
prep.py included in this PR for reproducibility
Byte-identical regeneration confirmed (re-running prep.py
against ihdp_npci_1.csv reproduces this file exactly — verified
via checksum before upload)

- ihdp_covariates.csv: 747 rows x 26 cols (treatment + x1-x25) Extracted from CEVAE/NPCI replication, outcomes dropped. 139 treated / 608 control, x1-x6 pre-standardized, x7-x25 binary. - ihdp_npci_1.csv: Source file for provenance - prep.py: Extraction script with validation assertions

Gitanaskhan26 · 2026-07-01T21:39:10Z

@ankurankan, can you please review this?

Gitanaskhan26 added 5 commits July 1, 2026 17:59

Create Idhp.md

462e50e

Create prep.py

9daf017

Delete prep.py

7c7d034

Delete Idhp.md

ddc7c90

Gitanaskhan26 mentioned this pull request Jul 1, 2026

[ENH] Add IHDP simulator dataset pgmpy/pgmpy#3420

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `ihdp_covariates.csv` for IHDP Dataset Simulator#14

Add `ihdp_covariates.csv` for IHDP Dataset Simulator#14
Gitanaskhan26 wants to merge 5 commits into
pgmpy:mainfrom
Gitanaskhan26:main

Gitanaskhan26 commented Jul 1, 2026 •

edited

Loading

Uh oh!

Gitanaskhan26 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Gitanaskhan26 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Provenance

Validated properties (asserted by prep.py)

Why ihdp_covariates.csv file instead of Raw CEVAE CSV

What x1-x25 actually are

Checklist

Uh oh!

Gitanaskhan26 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Gitanaskhan26 commented Jul 1, 2026 •

edited

Loading

Validated properties (asserted by `prep.py`)

Why `ihdp_covariates.csv` file instead of Raw CEVAE CSV