Harden validation and preprocessing edge cases#270
Merged
Conversation
The hardening pass replaced complete.cases() with complete_finite_cases()
in both preprocessing paths, which dropped any row with a non-finite value
in a numeric column. gname == Inf is a documented never-treated code
("group status 0 or Inf"), so that filter silently deleted every
never-treated unit: under control_group = "notyettreated" it warned and
dropped them, and under control_group = "nevertreated" it returned
plausible-looking but wrong ATTs with no error.
Add a finite_exclude argument to complete_finite_cases() and pass gname in
both pre_process_did() and pre_process_did2(). Excluded columns still get
the NA/NaN check via complete.cases(); only legitimate Inf is preserved.
Restores parity with master, where gname = Inf is bit-identical to gname = 0.
Add regression tests: gname = Inf equals gname = 0 (ATT and influence
functions) across both code paths and both control groups, plus a
complete_finite_cases() unit test confirming NA/NaN gname is still dropped.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR performs a deeper defensive hardening pass across the package's public estimation, aggregation, bootstrap, simulation, preprocessing, plotting, and exported helper surfaces. The goal is to make malformed inputs fail early with package-level diagnostics, prevent non-finite data from leaking into estimator internals, and lock down slow/fast path parity under adversarial inputs.
The release version remains
2.5.1.Root Cause
Several user-facing controls and helper entry points were only partially validated before being used in
if (...)branches, formula/model-frame evaluation, bootstrap setup, aggregation logic, data.table indexing, or low-level matrix operations. That allowed malformed inputs to produce raw errors such asthe condition has length > 1,missing value where TRUE/FALSE needed, ambiguous subsetting failures, silent vector recycling, or nonsensical bootstrap results.A second class of failures came from rows whose raw data or evaluated formula terms were non-finite. These rows could previously reach overlap/rank checks or estimator code, producing avoidable NA cells or lower-level errors. Both preprocessing paths now apply the same missing/non-finite filtering discipline before estimation.
Changes
att_gt(),pre_process_did(), andpre_process_did2()against malformed formulas, malformed column-name arguments, missinggnamerows, non-finite outcomes, non-finite weights, and non-finite evaluated covariates.gname = Infnever-treated units in the finite-data filter.complete_finite_cases()gained afinite_excludeargument, and bothpre_process_did()andpre_process_did2()passgnameto it.Infis a documented never-treated code ("group status 0 or Inf"); the new finite filter would otherwise have silently dropped every never-treated unit — warning undercontrol_group = "notyettreated"and, worse, returning plausible-looking but wrong ATTs with no error undercontrol_group = "nevertreated". Excluded columns still get theNA/NaNcheck viacomplete.cases(); only legitimateInfis preserved, restoring bit-identical parity withgname = 0.trimmer(),process_attgt(),mboot(), andtest.mboot()now reject malformed direct inputs before raw indexing, recycling, or matrix errors.build_sim_dataset()so modified simulation parameter lists cannot silently recycle wrong-length vectors or propagateNA/Infsimulation parameters.aggte()object validation,balance_evalidation, and corrupted influence-function dimension checks.att_gt()does not modify caller-owneddata.frameordata.tableinputs in either implementation.gname = Infproduces ATT and influence functions identical togname = 0across both code paths and both control groups, plus acomplete_finite_cases()unit test confirmingNA/NaNgnameis still dropped whileInfis preserved.Validation
Rscript -e "devtools::test(filter='error-handling')": 0 fail, 0 warn, 0 skip, 276 passRscript -e "devtools::test(filter='pretest-vectorization|conditional-did-pretest')": 0 fail, 0 warn, 0 skip, 43 passRscript -e "devtools::test(filter='mboot|cluster|inference|pretest-vectorization')": 0 fail, 0 warn, 7 skip, 137 passRscript -e "devtools::test(filter='faster-mode-consistency|modelmatrix-hoist|robustness-guards|aggte|att_gt|edge-cases|slowpath-precompute|compute-inffunc|mutation-safety')": 0 fail, 12 expected warnings, 0 skip, 1050 passRscript -e "devtools::test()": 0 fail, 0 warn, 8 skip, 1784 passRscript -e "devtools::check(document = FALSE, args = c('--no-manual'), error_on = 'never')": 0 errors, 0 warnings, 0 notesDESCRIPTION:Version: 2.5.1