Adversarial review of /data/local/software/ModelArray (entry points, batch/scalar
runners, stage scripts, ML stage, GUI). Issues are written from the perspective
of a skeptical reviewer trying to break, misuse, or distrust the pipeline. Each
item lists severity, location, the failure mode, and a concrete fix.
Severity legend: P0 = data integrity / silent corruption / security, P1 = breaks in realistic use or produces misleading output, P2 = correctness/usability nit, P3 = polish.
gui/app.py calls serve(app, host="0.0.0.0", ...). Anyone on
the network can:
- Browse the filesystem via
/browse(no path restriction, walks anywhere readable by the host user). - Read arbitrary phenotype files via
/parse_tsv(path traversal blocked only by a substring check for..; absolute paths like/etc/passwdpass straight through). - Launch arbitrary pipeline runs via
/run; arguments such asinput_dir,participants,cohort_fileare inserted into asubprocess.Popenargv list without any allow-listing — they can point at any path the server user can read/write. - Kill any job via
/stopand shut the server down via/shutdownwith no token check.
Fix: bind to 127.0.0.1 by default; require a launch flag (or env var) to expose externally; add a shared-secret token middleware; restrict /browse and /parse_tsv to a configurable root (e.g. BIDS_DIR); reject absolute paths outside that root via Path.resolve().is_relative_to(root).
gui/app.py (save_project), load_project, get_config, save_config all guard with if ".." in name. This misses absolute paths (/etc/...), symlinks, and CRLF-encoded traversal. With write endpoints (save_project, save_config) this is exploitable to overwrite files anywhere the server user can write.
Fix: validate names against a strict regex (^[A-Za-z0-9_-]+(\.json)?$) and build paths with Path(root) / name, then check resolved.is_relative_to(root).
With network exposure (1.1) any client can POST /shutdown and terminate a running pipeline (the monitor thread loses its handle). Combine with 1.1 → trivial denial of service.
1.4 [P2] Hardcoded absolute paths to /data/local/134_AF19, /data/local/templateflow, /data/local/container/...
Defaults are scattered across 0_register_acpc_to_mni.sh, _run_modality_batch.sh, _run_modality_batch.sh, 2_run_convoxel.sh, inspect_modelarray_h5.sh, setup_and_run_2mm.sh, run_2mm_full.sh. Anyone else cloning this repo is blocked, and quietly bumping a container version (0.1.5.sif → 0.1.6.sif) requires editing many files.
Fix: centralize in a single defaults.env sourced by every script; require the dataset/container fields in the config and abort if unset.
3_run_model.sh replaces continuous covariate names with <name>_DM and categorical ones with <name>_F using sed "s/\\b$covariate\\b/${covariate}_DM/g" on the user-provided formula. Failure modes:
- Word boundaries on dot/underscore: a covariate
sexwill match insides(sex_id). With our actual configs,participantis also a categorical name — the substitution rewrites all occurrences, includingparticipant_idxif it appears literally in the formula (it does in configs/pipeline_mapmri_wholebrain_2group_gam_2mm.json) — producingparticipant_F_idx, breaking the random effect. - Order dependence: replacing
groupbeforegroup_Fis fine, but replacingparticipantafter listing it as categorical does not match the literalparticipant_idx, while listing it both ways causes subtle aliasing. - Special chars in covariate names (parentheses, dots) become regex metacharacters.
Fix: build the formula in R (where parsing is structural), not shell. Or maintain explicit <name> -> <renamed> substitution using an R terms()-based walk; never sed over s(...) expressions.
3_run_model.sh sets modelarray@scalars[["$SCALER_TYPE"]][dropped_ids, ] <- NaN. ModelArray opens the H5 read/write by default; this mutation persists to disk, contaminating the cohort H5 used by Stage 4 ML and any rerun. Also the regex test grepl("by\\s*=\\s*group_F", formula_text) runs on the renamed formula — but the rename uses sed (item 2.1) and may already have produced group_F_F if group_F was misconfigured as categorical.
Fix: keep pre-filter results in an in-memory mask and pass it to ModelArray.gam via element.subset. Never mutate the scalar matrix in place; if a write is required, write a copy of the H5 and operate on the copy.
4_run_ml.py builds the splitter with group_column (typically participant) to prevent leakage. Good. But the per-fold metrics call roc_auc_score(y_test, ...) using all sessions for a participant pooled into one fold's test set — that is correct for grouping but the per-row predictions are still emitted to predictions.csv with participant and session columns, encouraging downstream users to compute per-session metrics that double-count subjects. Document this explicitly, or aggregate to one row per participant before computing accuracy/f1.
2.4 [P1] Classification labels are coerced to str after NaN filter, but numeric target (e.g. group=1,2) ends up as "1.0" / "2.0"
4_run_ml.py does y_series = y_raw.astype(str). Pandas turns numeric 1 into "1.0". Downstream predictions.csv then writes "1.0"/"2.0" strings, mis-aligning with any external label list and breaking joins on group. For booleans this is worse ("True" vs True).
Fix: cast via pd.Categorical(y_raw).astype(str) after explicitly normalizing numeric labels (y_raw.astype("Int64").astype(str) when integral).
4_run_ml.py adds a SimpleImputer(strategy="mean") for svm_rbf and elastic_net. Random Forest and XGBoost skip the imputer entirely — XGBoost handles NaNs natively, but RandomForestClassifier will raise on NaN inputs and the run dies mid-CV after producing partial metrics from earlier models.
Fix: check np.isnan(X).any() once at load, log the count, and either error out or impute for all models uniformly. Add a config flag ml.impute_strategy.
_run_scalar_pipeline.sh compares EXPECTED_CSV_PATH vs. CONFIGURED_CSV_PATH after realpath -m. If data_dir is reachable via two different real paths (NFS mount + bind mount, or via a symlink in one and not the other), the equality fails and the runner aborts even though the files would be created correctly.
Fix: compare basenames and parent inode rather than the full string, or accept a --no-strict-paths override.
2.7 [P1] Cohort generators do not validate that participants TSV rows actually correspond to MNI-space scalars at the configured resolution
1_generate_cohort_longitudinal.sh emits a WARNING and continues when a NIfTI is missing — the resulting cohort has fewer rows than the participants file but no exit code, and the user only sees the warning in the captured log. Worse, 1_generate_cohort.sh writes METADATA from ${LINE[*]:1} where IFS=',', discarding embedded commas in any phenotype column (e.g. a free-text site name "Pittsburgh, PA").
Fix: abort by default when any row is missing data (allow --allow-missing to opt in); quote CSV fields properly (use python3 -c 'import csv; ...' or an awk CSV-safe writer).
1_generate_cohort.sh and pick_mask_file. If two subjects share a prefix (sub-100, sub-1001) the wrong mask/scalar is picked. The "single file in folder → take it" fallback (1_generate_cohort.sh) silently misassigns one mask to every subject when the mask folder has exactly one file.
Fix: match on ^${subject}(_|\.) anchored regex; never fall back to "the only file".
prepare_longitudinal_participants.sh: when a subject is missing age@ses-1 the script writes empty age_ses1 and empty delta_age but still emits the row. Downstream GAM with s(delta_age, ...) will see NA for that subject — mgcv will drop the row with a warning that the user never sees. The cohort H5 still contains that subject's voxels, so the N used in fitting differs silently from the N reported by 1_generate_cohort_longitudinal.sh.
Fix: abort (or emit a hard "QC" CSV) when any row has unresolvable delta_age/age_ses1.
_run_modality_batch.sh mrtransform ... -interp nearest. Correct for binary masks, but the function also moves the output into $group_mask via mv -f, overwriting the user-provided source. If the source path was a symlink into TemplateFlow (typical, see pipeline_noddi_wholebrain_2group_gam_2mm.json), and the resampling happens, the copy already made into $OUTPUT_DIR/group_mask.nii.gz is overwritten — OK. But the dimension check uses mrinfo -size which compares spatial extents only, not voxel spacing or orientation; mismatched qform/sform between a res-02 template and a 2mm-but-different-grid scalar will silently pass.
Fix: also compare mrinfo -vox and mrinfo -transform; abort on mismatch instead of resampling.
1_generate_cohort_longitudinal.sh, 0_register_acpc_to_mni.sh. The || true is a workaround for ((x++)) returning non-zero when x=0. Acceptable, but the same pattern is used to count failures — a single failed-counter increment can hide a real failure when combined with set -e -o pipefail removed locally. Audit and replace with : $((x+=1)) which is always exit 0.
3_run_model.sh starts with a bare #!/bin/bash and no set -euo pipefail. The R script generation, the Singularity launch, and the post-run NIfTI export each fail independently. The script does check $MODEL_EXIT for the model run, but the volumestats_write step's exit status is not checked — failures here are reported only via tee output, so a missing voxelwise NIfTI export looks like success.
Fix: add set -uo pipefail; capture ${PIPESTATUS[0]} after the second singularity run | tee.
3_run_model.sh writes the formula, scalar name, analysis name, and MODEL_OPTIONS_R into the heredoc without any quoting. A scalar named foo"); system("rm -rf " (admittedly contrived, but more realistic: any single quote or backslash in a config value) will produce broken R code that fails at parse time. With the GUI exposed (1.1) this becomes a remote code execution surface inside the container.
Fix: pass these as command-line arguments to a checked-in .R file (Rscript model.R --scalar "$SCALER_TYPE" --formula "$FORMULA"), and use commandArgs(trailingOnly=TRUE) inside R; never cat <<EOF user data into source code.
run_analysis.sh uses mktemp --suffix=.json then trap cleanup EXIT. But the runner is exec'd into _run_modality_batch.sh or _run_scalar_pipeline.sh at run_analysis.sh — the parent shell is replaced and the EXIT trap never fires. The temp config file leaks into /tmp on every overridden run.
Fix: unset the trap before exec, and write the override to $OUTPUT_DIR/configs/generated/_override.json (deterministic, easy to clean up).
Same chain as 3.4: --nohup launches bash "$0" with the path to the override temp file at run_analysis.sh, then exits. The original trap cleanup EXIT deletes the override before the detached child reads it. Race condition; the child may or may not see the file depending on FS timing.
Fix: when entering nohup mode, materialize the override config to a non-tmp persistent path before launching.
0_register_acpc_to_mni.sh. On first failure, in-flight jobs finish but no more are launched. PROCESSED=${#JOB_LIST[@]} is then assigned anyway, so the "Processed: N" line lies.
Fix: count actual successful returns via parallel --joblog.
2_run_convoxel.sh computes a longest-common-prefix path across realpaths and binds it into the container. If two symlinks resolve to /data/A/... and /mnt/B/..., the common prefix collapses to /, and the script binds the host root into the container. Severely overscoped on shared systems.
Fix: compute one bind per distinct parent directory of the resolved files, not a common prefix; cap to a configurable allow-list.
setup_and_run_2mm.sh cp blasts the templateflow mask in every run, then copies it again to root. Loses any per-cohort group mask the user may have intentionally regenerated.
Logging relies on stdout being redirected externally. In --nohup mode this works; in foreground it dies with the terminal. Add an explicit LOG_FILE in non-nohup runs too.
run_analysis.sh detects "modality" by presence of dataset+modality+statistics and "scalar" by data_dir+csv_file+h5_file. A user who supplies all five fields gets "modality" silently. Bad configs (missing typed keys, wrong shape for categorical_variables, int vs string for resolution) are caught only when downstream scripts fail.
Fix: add a schemas/ directory with JSON Schema for both shapes; validate with jsonschema (Python) or ajv early in run_analysis.sh. Reject ambiguous configs.
The codebase variously reads .scaler_type (R: 3_run_model.sh; Python: 4_run_ml.py) but _run_modality_batch.sh emits both correctly. New contributors writing configs with scalar_type will silently get an empty value because jq -r '.scaler_type // empty' returns "" and the runner errors with "scaler_type is required".
Fix: accept both keys; warn-then-migrate. Pick scalar_type as canonical (matches BIDS terminology) and deprecate scaler_type.
4_run_ml.py. On a shared host with n_cores: 1 set for the GAM stage out of consideration for other users, the ML stage still spawns one fit per CPU. Random Forest with 500 trees × multi-fold × all cores on a 64-core box will starve everything else.
Fix: default ml.n_jobs to min(4, statistics.n_cores) when unset.
3_run_model.sh: string values are JSON-escaped and passed to R as-is. A value like "REML" becomes the R literal "REML" — fine. A value like "y ~ x" would be passed as a string, but the user might expect it to be parsed as a formula. There is no documentation about which mgcv::gam options accept strings vs. unquoted symbols.
run_analysis.sh exposes --skip-reg, --skip-ml, --force, --scalar. The underlying scripts have their own flags (-f, -d, -s overloaded between resolution and subgroup). Hard to learn.
Each stage hardcodes a container path (modelarray_confixel_0.1.5.sif). The H5 result does carry an analysis_name, but neither the cohort CSV, summary CSV, nor ML metrics record the container SHA, repo commit, or config hash.
Fix: at start of every run, write <output_dir>/_provenance.json with: git commit (git -C "$SCRIPT_DIR" rev-parse HEAD), container path + sha256sum, full effective config (after overrides), hostname, start time, user.
3_run_model.sh. Last run wins; impossible to compare runs after the fact.
Fix: name it generated_script_${ANALYSIS_NAME}_${TIMESTAMP}.R and keep alongside the log.
4_run_ml.py — if min_class_n < n_splits, the code silently reduces n_splits rather than warning + recording the effective fold count in the summary header.
Fix: record requested_n_splits and effective_n_splits in summary.json.
requirements.txt lists flask, jinja2, waitress twice. Use pip-compile/uv and commit a requirements.lock.
Zero test_*.py/*_test.sh files; no CI; no --dry-run parity test. The whole pipeline is verified by manual eyeballing of nohup_*.log (e.g. the user's currently-open log).
Fix: add tests/:
- Schema tests for each config in
configs/. - A tiny synthetic dataset (8 subjects × 2 sessions × 5×5×5 voxels) committed under
tests/fixtures/; a top-levelmake testthat runs full pipeline + ML on it. - Property tests for
prepare_longitudinal_participants.sh(missing age sessions, single-session subjects, etc.). - Regression test: after rerunning a fixed config,
metrics.csvandsummary.jsonare bit-identical forrandom_seed.
ModelArray will happily write an analysis under a duplicate name (overwriting prior results in the H5 group) and there is no --no-clobber guard.
render_nifti_preview.sh exists but is never called by 3_run_model.sh or the batch runner. Users have no automated thumbnail of beta/t/p maps after a run.
gui/app.py allows one job at a time per process, but spawning a second GUI on a different port (the auto-port logic) creates a second jobs surface that can write to the same LOGS_DIR and to the same output trees simultaneously. Also, /run with a long-running step + a user hitting --force regenerate on the same cohort_*.csv invites races.
Fix: advisory file lock (flock) on <output_dir>/.modelarray.lock for any write.
A user can launch any number of --nohup jobs that all write to /tmp/nohup_*.log and to the same OUTPUT_DIR. Stage E uses R_SCRIPT_PATH="$DATA_DIR/generated_script.R" (fixed name → mutual corruption).
8.1 [P2] README (README.md) shows a --scalar icvf_dwimap example but does not warn that the rerun reuses the existing cohort_*.h5 even when --force is not passed
The cohort regen logic in _run_scalar_pipeline.sh and --force semantics in run_analysis.sh require reading three files to understand.
What columns are required (participant_id, session, age@ses-N, group, sex)? What are valid group values? How is participant_id formatted (sub-XXXXXX vs sub-XXXXXX_ses-N)? The shell scripts and prepare_longitudinal_participants.sh each assume something different.
Fix: a docs/participants_schema.md and a validator script.
| Order | Theme | Items | Outcome |
|---|---|---|---|
| 1 | Lock down GUI | 1.1, 1.2, 1.3 | Safe to run on shared host |
| 2 | Stop silent data corruption | 2.1, 2.2, 3.3 | Trustworthy formulas; no in-place H5 mutation |
| 3 | Make pipeline failures loud | 2.7, 2.8, 2.9, 3.2, 3.6 | No more silent row drops / fuzzy file picks |
| 4 | Provenance + tests | 5.1, 6.1, 6.2 | Repeatable, auditable runs |
| 5 | Config schema + canonical key | 4.1, 4.2 | One config shape, validated up-front |
| 6 | ML correctness | 2.3, 2.4, 2.5, 4.3 | Labels round-trip; NaN policy explicit |
| 7 | Reduce hardcoded paths | 1.4, 3.7 | Portable to other datasets/hosts |
| 8 | Polish | 3.1, 3.4, 3.5, 4.4, 5.2, 5.3, 5.4, 6.3, 7.x, 8.x | Operational quality of life |
Generated as a hostile review; treat every "P0/P1" as a real risk to validated science output until proven otherwise on a small synthetic fixture.