Add ihdp_covariates.csv for IHDP Dataset Simulator#14
Open
Gitanaskhan26 wants to merge 5 commits into
Open
Conversation
- ihdp_covariates.csv: 747 rows x 26 cols (treatment + x1-x25) Extracted from CEVAE/NPCI replication, outcomes dropped. 139 treated / 608 control, x1-x6 pre-standardized, x7-x25 binary. - ihdp_npci_1.csv: Source file for provenance - prep.py: Extraction script with validation assertions
1 task
Author
|
@ankurankan, can you please review this? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
ihdp_covariates.csv(747 rows × 26 columns:treatment+x1-x25) to thepgmpy/example_datasetsHuggingFace repo. This is the fixed, real-covariate design matrix thatIHDPDatasetloads via_get_raw_data("ihdp_covariates.csv")on every instantiation.Related Issue : pgmpy/pgmpy#3420
Provenance
ihdp_npci_1.csvfrom the CEVAE repository (github.com/AMLab-Amsterdam/CEVAE/tree/master/datasets/IHDP/csv), one of the standard NPCI-generated (Dorie, 2016) IHDP replications built on real covariates from the Infant Health and Development Program (Hill, 2011).treatmentandx1-x25are kept.y_factual,y_cfactual,mu0,mu1are dropped — those are outcome columns specific to that one CEVAE replication;IHDPDatasetregenerates outcomes itself from these fixed covariates, so shipping baked-in outcomes would be actively misleading (and would tie the package to one arbitrary replication out of 1000).prep.py(included in this PR), which validates its own output before writing — shape, treated/control counts, and per-column value ranges — since this file becomes permanent shared infrastructure once uploaded.Validated properties (asserted by
prep.py)x1-x6(birth weight, head circumference, weeks preterm, birth order, neonatal health index, mother's age) are continuous and already standardized (mean≈0, std=1) as part of the upstream NPCI/CEVAE pipeline -> this extraction doesn't standardize them itself, it inherits that property.x7-x25are binary (0/1) site and demographic indicators, with one documented exception:x14("first" — firstborn indicator) is{1,2}-coded, not{0,1}. This isn't a data error — EconML's own port carries an explicit comment doing the equivalent adjustment for this same variable, so{1,2}is the literature-standard coding for it and has been left as-is.ihdp_npci_1.csvthrough_1000.csv) by construction, so this file is replication-agnostic ; extracting from replication 1 is representative of all of them.Why
ihdp_covariates.csvfile instead of Raw CEVAE CSVIHDPDatasettreats IHDP as a real simulator: covariates are fixed, outcomes are generated fresh per instantiation from a parameterized response surface. A file with baked-iny_factual/mu0/mu1would suggest those are meant to be read directly rather than regenerated.What x1-x25 actually are
Column names stay
x1-x25(matching every paper/package that reports IHDP numbers), but here's what each one is, for anyone browsing this dataset who wants to know. Verified by exact value reconstruction against EconML's independently-maintained raw covariate file — not assumed from documentation (seeprep.py'sCOVARIATE_INFOfor the full verification method).bw,b.head,preterm,birth.o,nnhealth,momagesex,twin,b.marrmom.lths,mom.hs,mom.scollcig,booze,drugs,work.dur,prenatalfirst{1,2}, not{0,1}(genuine upstream convention, not an error)site1-site7Checklist
pgmpy/example_datasetson HuggingFace_get_raw_data("ihdp_covariates.csv")prep.pyincluded in this PR for reproducibilityprep.pyagainst
ihdp_npci_1.csvreproduces this file exactly — verifiedvia checksum before upload)