Skip to content

Harden validation and preprocessing edge cases#270

Merged
pedrohcgs merged 4 commits into
masterfrom
release/scalar-argument-validation
Jun 19, 2026
Merged

Harden validation and preprocessing edge cases#270
pedrohcgs merged 4 commits into
masterfrom
release/scalar-argument-validation

Conversation

@pedrohcgs

@pedrohcgs pedrohcgs commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR performs a deeper defensive hardening pass across the package's public estimation, aggregation, bootstrap, simulation, preprocessing, plotting, and exported helper surfaces. The goal is to make malformed inputs fail early with package-level diagnostics, prevent non-finite data from leaking into estimator internals, and lock down slow/fast path parity under adversarial inputs.

The release version remains 2.5.1.

Root Cause

Several user-facing controls and helper entry points were only partially validated before being used in if (...) branches, formula/model-frame evaluation, bootstrap setup, aggregation logic, data.table indexing, or low-level matrix operations. That allowed malformed inputs to produce raw errors such as the condition has length > 1, missing value where TRUE/FALSE needed, ambiguous subsetting failures, silent vector recycling, or nonsensical bootstrap results.

A second class of failures came from rows whose raw data or evaluated formula terms were non-finite. These rows could previously reach overlap/rank checks or estimator code, producing avoidable NA cells or lower-level errors. Both preprocessing paths now apply the same missing/non-finite filtering discipline before estimation.

Changes

  • Added shared validators for formulas, scalar column names, column-name vectors, finite numeric scalars/vectors, probability scalars, and non-negative whole-number controls.
  • Hardened att_gt(), pre_process_did(), and pre_process_did2() against malformed formulas, malformed column-name arguments, missing gname rows, non-finite outcomes, non-finite weights, and non-finite evaluated covariates.
  • Made slow and fast preprocessing agree on missing/non-finite row filtering and warning text.
  • Preserved gname = Inf never-treated units in the finite-data filter. complete_finite_cases() gained a finite_exclude argument, and both pre_process_did() and pre_process_did2() pass gname to it. Inf is a documented never-treated code ("group status 0 or Inf"); the new finite filter would otherwise have silently dropped every never-treated unit — warning under control_group = "notyettreated" and, worse, returning plausible-looking but wrong ATTs with no error under control_group = "nevertreated". Excluded columns still get the NA/NaN check via complete.cases(); only legitimate Inf is preserved, restoring bit-identical parity with gname = 0.
  • Hardened exported helpers: trimmer(), process_attgt(), mboot(), and test.mboot() now reject malformed direct inputs before raw indexing, recycling, or matrix errors.
  • Hardened build_sim_dataset() so modified simulation parameter lists cannot silently recycle wrong-length vectors or propagate NA/Inf simulation parameters.
  • Hardened aggte() object validation, balance_e validation, and corrupted influence-function dimension checks.
  • Added a mutation-safety regression to ensure att_gt() does not modify caller-owned data.frame or data.table inputs in either implementation.
  • Added model-matrix regressions for matrix-valued transformed formula terms and non-finite evaluated terms.
  • Updated robustness tests so transformed non-finite covariates are expected to be filtered before 2x2 estimator internals.
  • Added regression tests that gname = Inf produces ATT and influence functions identical to gname = 0 across both code paths and both control groups, plus a complete_finite_cases() unit test confirming NA/NaN gname is still dropped while Inf is preserved.

Validation

  • Rscript -e "devtools::test(filter='error-handling')": 0 fail, 0 warn, 0 skip, 276 pass
  • Rscript -e "devtools::test(filter='pretest-vectorization|conditional-did-pretest')": 0 fail, 0 warn, 0 skip, 43 pass
  • Rscript -e "devtools::test(filter='mboot|cluster|inference|pretest-vectorization')": 0 fail, 0 warn, 7 skip, 137 pass
  • Rscript -e "devtools::test(filter='faster-mode-consistency|modelmatrix-hoist|robustness-guards|aggte|att_gt|edge-cases|slowpath-precompute|compute-inffunc|mutation-safety')": 0 fail, 12 expected warnings, 0 skip, 1050 pass
  • Randomized fast/slow parity stress with formulas, weights, panel/repeated cross sections, control groups, base periods, and missing/non-finite contamination: 97 successful slow/fast comparisons, 43 rejected by both paths, 0 one-sided errors, 0 mismatches
  • Rscript -e "devtools::test()": 0 fail, 0 warn, 8 skip, 1784 pass
  • Rscript -e "devtools::check(document = FALSE, args = c('--no-manual'), error_on = 'never')": 0 errors, 0 warnings, 0 notes
  • DESCRIPTION: Version: 2.5.1

Copilot AI review requested due to automatic review settings June 19, 2026 14:33

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@pedrohcgs pedrohcgs changed the title Harden scalar argument validation Harden validation and preprocessing edge cases Jun 19, 2026
@pedrohcgs pedrohcgs requested a review from Copilot June 19, 2026 15:36

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

The hardening pass replaced complete.cases() with complete_finite_cases()
in both preprocessing paths, which dropped any row with a non-finite value
in a numeric column. gname == Inf is a documented never-treated code
("group status 0 or Inf"), so that filter silently deleted every
never-treated unit: under control_group = "notyettreated" it warned and
dropped them, and under control_group = "nevertreated" it returned
plausible-looking but wrong ATTs with no error.

Add a finite_exclude argument to complete_finite_cases() and pass gname in
both pre_process_did() and pre_process_did2(). Excluded columns still get
the NA/NaN check via complete.cases(); only legitimate Inf is preserved.
Restores parity with master, where gname = Inf is bit-identical to gname = 0.

Add regression tests: gname = Inf equals gname = 0 (ATT and influence
functions) across both code paths and both control groups, plus a
complete_finite_cases() unit test confirming NA/NaN gname is still dropped.
@pedrohcgs pedrohcgs added the release CRAN release PR: skip the dev-version bump on merge label Jun 19, 2026
@pedrohcgs pedrohcgs merged commit 9aba07d into master Jun 19, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release CRAN release PR: skip the dev-version bump on merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants