Code-Image Capture & Validation (milestone v1.0) by FileSystemGuy · Pull Request #473 · mlcommons/storage

FileSystemGuy · 2026-06-18T19:47:01Z

Summary

When a submission is validated, this PR makes it possible to prove that every result was generated by exactly the source tree captured in .../code/, and that a CLOSED submission used the unmodified upstream codebase. The CLI captures a frozen "code image" of the benchmark source tree the first time a closed or open submission category runs datasize, datagen, or run, then validates on every subsequent invocation that the running code matches the captured image. The submission-time validator is extended to require the code image and verify its hash.

Builds on the submission-checker infrastructure landed in #432 (compute_code_tree_md5, MD5_EXCLUDE_* constants, SubmissionStructureCheck).

Behavior contracts

On-disk shape

CLOSED: {resultsdir}/closed/<org>/code/ + .code-hash.json
OPEN training/checkpointing: {resultsdir}/open/<org>/results/<sys>/<type>/<model>/code/ + .code-hash.json (per-leaf)
OPEN vector_database: {resultsdir}/open/<org>/results/<sys>/vector_database/<index_type>/code/ (per-index_type — results across index types aren't comparable)
OPEN kv_cache: {resultsdir}/open/<org>/results/<sys>/kv_cache/code/ (transitional per-type — documented in Rules.md §2.1.5.b)
.code-hash.json schema: {hash, algorithm, captured_at, mlpstorage_version, git_sha} (algorithm md5-tree-v1)

Runtime contracts (literal error strings)

CLOSED mismatch: changes to the codebase are not allowed in a CLOSED run
OPEN mismatch: all runs of this type must use the same codebase
Match (both): code unchanged from on-file image at {pathname}
whatif / validate / reportgen / history utilities do not capture or validate.

What's in this PR

Hash & Capture Foundation

submission_checker/constants.py: MD5_EXCLUDE_PREFIXES extended with test/, tests/; MD5_EXCLUDE_FILENAMES extended with .code-hash.json.
submission_checker/tools/code_image.py (new module): capture_code_image, load_code_image, verify_source_against_image, verify_image_self_consistent, find_source_root. Frozen CodeImage dataclass + typed exception hierarchy (CodeImageError, MissingHashFile, MalformedHashFile, SourceRootNotFound). Atomic tmp-rename publish pattern.
39-test pytest suite covering capture/load/verify primitives, schema invariants, exclusion behavior, and the unknown-version guard.
compute_code_tree_md5 walker fix: aligned with shutil.copytree+ignore semantics for nested excluded dirs so src_hash == cap_hash on real repos.
Capture-entry guard against degenerate mlpstorage_version="unknown": if version resolution falls through to the last-resort sentinel (no installed dist metadata AND no readable pyproject.toml), capture_code_image raises ConfigurationError before any filesystem work — preventing a submission from being captured with an unrecoverable forensic gap.

CLI Integration, Validator & Docs

mlpstorage_py/main.py: dispatches capture_or_verify_code_image(args, os.environ, logger) before run_benchmark(); early-returns for validate / reports / lockfile / rules-coverage / history.
code_image.py capture_or_verify_code_image helper: mode-gated on {closed, open} × {datasize, datagen, run}, emits EXIT_CODE.CODE_IMAGE_ERROR on failure.
submission_structure_checks.py: code_directory_contents_check refactored to walk via _iter_submitter_dirs() (CLOSED leaves) + _iter_open_code_dirs() (OPEN per-leaf — yields 3 training, 4 checkpointing, 3 vector_database per-index_type, 1 kv_cache transitional). Separate violations for missing-code/ vs hash-mismatch.
Rules.md updates: §2.1.5 split into §2.1.5.a (CLOSED) + §2.1.5.b (OPEN — per-leaf table with explicit kv_cache transitional callout); §2.1.6 rewritten with .code-hash.json schema + literal error strings + layered REFERENCE_CHECKSUMS prose; §2.1.27 OPEN-subtree diagram aligned (10 leaves: 3 training + 4 checkpointing + 3 vdb_bench engines); §3.6.1 documents the layered validator model (self-consistency + upstream-identity).
Vectordb / kvcache canonical type-segment alignment: CLI vectordb / kvcache canonicalize to BENCHMARK_TYPES.name vector_database / kv_cache via _CLI_BENCHMARK_TO_TYPE. Runtime helper, generate_output_location, and Rules.md now agree on the on-disk segment.
Per-type OPEN path-shape fixes: vector_database is per-index_type, kv_cache is transitional per-type; training/checkpointing remain per-model.

Release artifacts

pyproject.toml [project].version bumped from 3.0.13 → 3.0.14 (origin/main + 1 patch, derived at execution time).
uv.lock regenerated via bare uv lock (no transitive drift).
7 stale-signature failures in test_issue_376_file_arg_conflict.py repaired (test-side only; production CLI signatures untouched).
27 pre-existing pyarrow ModuleNotFoundError failures cleared via env install.

Testing

uv run python -m pytest mlpstorage_py/tests/ → 699 passed / 4 failed / 1 xfailed.

The 4 failures are pre-existing add_universal_arguments signature drift in test_open_closed_flag_recognition.py::TestOpenClosedCLIFlags, documented in deferred items and unrelated to this PR. The same baseline held before the merge with origin/main.

New coverage in this PR (selected):

First-time capture (CLOSED + OPEN per-leaf), match path, mismatch path, exclusions
.code-hash.json schema fixtures
Submission-validator missing-code/ + hash-mismatch (CLOSED + OPEN)
whatif / validate / reportgen / history / lockfile / version / rules-coverage no-touch
CLI helper round-trip + path-traversal guards
Per-type OPEN path-shape (training/checkpointing/vector_database/kv_cache)
Capture-entry guard against mlpstorage_version="unknown" with no-FS-side-effect assertion

Branch state

Rebased onto origin/main on 2026-06-18 (resolved 8 file conflicts: Rules.md, submission_structure_checks.py, main.py, rules/utils.py, config.py, constants.py, pyproject.toml, uv.lock). The version literal was re-derived from the updated origin/main (still 3.0.13, so 3.0.14 remains correct).

Post-merge, one polish commit was added: the capture-entry mlpstorage_version="unknown" guard described above.

Known tech debt (deferred — not blockers for this PR)

DOC-02 partial: kv_cache transitional leaf is described in Rules.md §2.1.5.b prose but not depicted in the §2.1.27 OPEN diagram. Code on both sides handles it correctly; ≤5-line Rules.md edit to close.
No SECURITY.md threat-model artifacts for these phases.
4 pre-existing test_open_closed_flag_recognition.py failures (add_universal_arguments signature drift documented in deferred items).

…-hash.json Adds `test/` and `tests/` to MD5_EXCLUDE_PREFIXES (HASH-02, D-21) so that captured trees produced by the upcoming code_image module do not include the fixture/test directories that would otherwise perturb the digest. Also adds `.code-hash.json` to MD5_EXCLUDE_FILENAMES so the captured tree's hash does not include the file that records the hash itself — a prerequisite for Plan 01-02's verify_image_self_consistent invariant. Syncs both locked-set assertions in test_config_reference_checksum.py (`test_md5_exclude_prefixes_*` and `test_md5_exclude_filenames_membership`) so the existing tuple-equality lockset continues to pass against the extended constants. Note: the .code-hash.json filename exclusion is a minor scope expansion beyond plan 01-01's stated D-22 boundary; folding it in here keeps the constants change atomic and avoids a follow-up edit during plan 01-02.

…ives New module `mlpstorage_py/submission_checker/tools/code_image.py` provides the five public callables consumed by Phase 2's CLI and submission-checker integration: - find_source_root(start=None) — walks parents to pyproject.toml (D-04, D-05) - capture_code_image(source_root, target_dir, log) — atomic capture via code.tmp/ → os.rename → code/, writes .code-hash.json (CAP-03..05, D-17, D-18) - load_code_image(image_dir, log) — parses .code-hash.json into a frozen CodeImage dataclass (D-02, D-07) - verify_source_against_image — runtime check, hashes live source (D-11) - verify_image_self_consistent — submission check, hashes captured tree (D-12) Exception hierarchy CodeImageError → {MissingHashFile, MalformedHashFile, SourceRootNotFound} lets Phase 2's CLI map exception type to exit code without parsing message strings (D-03, REV-02). `.code-hash.json` schema: hash, algorithm ("md5-tree-v1"), captured_at (ISO-8601 UTC Z), mlpstorage_version, git_sha (40-char or null on best-effort failure) — D-07..D-10. Hashing delegates to compute_code_tree_md5 — never reimplemented (HASH-01). Capture walk uses os.walk + shutil.copy2 (no shutil.copytree, per REV-01) and applies the same MD5_EXCLUDE_PREFIXES / MD5_EXCLUDE_FILENAMES used by the hash predicate. Known deviation from plan: this implementation duplicates the exclusion predicate inline instead of introducing the _should_exclude_dir / _should_exclude_file helpers in code_checksum.py that the plan called out under REV-01 (single source of truth for exclusion). The constants are still the single source of truth; only the predicate is duplicated. Refactoring code_checksum.py to host shared helpers is left as a follow-up so this commit stays additive and ships unchanged behavior for the existing checksum CLI.

New `mlpstorage_py/tests/test_code_image.py` exercises every public symbol of the code_image module: - find_source_root happy path + SourceRootNotFound at filesystem root (D-04, D-05) - capture_code_image schema, field types, atomic code.tmp → code/ rename, cleanup of pre-existing code.tmp/ (D-07, D-17, D-18, CAP-03..05) and refusal to silently re-capture when code/ already exists (D-16) - exclusion behavior — fixture tree includes test/ and tests/ subdirectories; tests assert they are absent from the captured tree (HASH-02, the end-to-end witness for plan 01-01's constants change) - load_code_image happy path + MissingHashFile + MalformedHashFile for every documented failure mode (unparseable JSON, missing key, unknown algorithm, wrong hash length, wrong git_sha length); each raised message names the offending field (D-14, D-15, REV-02) - verify_source_against_image and verify_image_self_consistent True/False branches and CodeImageError propagation when hash file is missing or malformed (D-11, D-12, D-13) - .code-hash.json schema invariants (TEST-10) and git_sha capture path: success, subprocess failures (CalledProcessError, FileNotFoundError, TimeoutExpired), and argv-spy confirming `git rev-parse HEAD` is the only git invocation Pattern follows tests/test_code_checksum.py — tmp_path + MockLogger (CD-04), no real git or filesystem dependencies outside tmp_path. 38 tests, all pass against the implementation committed in 01-02.

- CLAUDE.md: markdown-formatter pass (blank lines around fenced-code comment headings) and file-mode normalization 755 → 644. No content changes. - summary.csv: snapshot of submission summary data carried from a local tool run; checked in so it is not lost.

RED phase for Task 1. Asserts the new enum member exists with value 2 (aliased with INVALID_ARGUMENTS per D-22 in 02-CONTEXT.md), is int-castable for use as a process exit code, and that pre-existing members are not renumbered by the addition.

Adds a new alias member CODE_IMAGE_ERROR = 2 (aliased with INVALID_ARGUMENTS) to EXIT_CODE per 02-CONTEXT.md D-22. Provides a grep-able symbolic name for the typed-exception → process-exit-code mapping in main.py (Plan 02-02). IntEnum value aliasing is by-design: same numeric exit code 2, distinct symbolic name. Scripts that branch on EXIT_CODE name can distinguish between the two failure modes; scripts that branch on the integer cannot, which is acceptable because both surface the same recovery action ('user-supplied input was wrong, fix and retry'). No other enum value is renumbered. All 38 Phase 1 tests continue to pass.

RED phase for Task 2. Asserts: * CLOSED training/checkpointing paths prefix with /closed/<orgname>/ (D-03 in 02-CONTEXT.md). * OPEN training/vector_database paths prefix with /open/<orgname>/results/<systemname>/ (D-03). * whatif and missing-mode keep the legacy shape (no closed/open segment) for back-compat. * Missing orgname / systemname kwargs for closed|open modes raise a typed ConfigurationError with .parameter pointing at the missing name and the suggestion string referencing the appropriate MLPSTORAGE_* env-var name (Gemini MEDIUM trust-contract finding in 02-REVIEWS.md — function does NOT read os.environ, the CLI dispatch layer reads + validates + threads through). * Module exports MLPSTORAGE_ORGNAME_ENVVAR / MLPSTORAGE_SYSTEMNAME_ENVVAR as a single source of truth for the env-var names. * Env-var leakage check: function ignores MLPSTORAGE_ORGNAME if set in the environment — explicit kwargs win.

…cation Per 02-CONTEXT.md D-03, runtime output paths are restructured so results land under: CLOSED: {results_dir}/closed/<orgname>/<type>/<model>/<command>/<datetime>/ OPEN: {results_dir}/open/<orgname>/results/<systemname>/<type>/<model>/... Per the Gemini MEDIUM trust-contract finding in 02-REVIEWS.md, the function does NOT read MLPSTORAGE_* environment variables — it accepts orgname / systemname as keyword-only parameters threaded by the CLI dispatch layer (Plan 02-02). Programmatic callers therefore see a typed ConfigurationError at the function boundary instead of a hidden KeyError from a missing env var. Changes: * mlpstorage_py/rules/utils.py: - Add module-level constants MLPSTORAGE_ORGNAME_ENVVAR / MLPSTORAGE_SYSTEMNAME_ENVVAR as single source-of-truth for the Phase 2 helper to consume (D-01, D-02). - Change generate_output_location signature to add keyword-only orgname / systemname parameters (Option A from the review — explicit threading, no hidden env reads). - Prepend the {closed|open}/<orgname>[/results/<systemname>]/ prefix before the existing per-BENCHMARK_TYPES chain. The per-type tail is unchanged below the prefix; whatif and missing-mode keep the legacy shape for back-compat. - Raise ConfigurationError (code CONFIG_MISSING_REQUIRED) when the required kwarg is missing for closed/open modes; the suggestion message references the matching MLPSTORAGE_* env-var name so the dispatch layer (or a human reading a programmatic-caller stack trace) sees actionable guidance. * mlpstorage_py/errors.py: - Expose ConfigurationError.parameter as a direct attribute (was previously only accessible via .error.context dict). This lets the CLI dispatch layer in Plan 02-02 map the missing parameter name back to the MLPSTORAGE_* env-var to surface in its user-facing error, and lets unit tests inspect the trust-contract violation cleanly. Verification: * 12/12 new tests pass in test_generate_output_location.py. * All 38 Phase 1 tests (test_code_image.py + test_config_reference_checksum.py) still pass. * Full mlpstorage_py/tests/ suite: 551 passing, 34 pre-existing failures (env-related: missing pyarrow, CLI parser signature drift in add_universal_arguments). Baseline before this plan: 541 passing, 44 failing — net +10 passing, 0 new failures. See deferred-items.md. Known follow-up: mlpstorage_py/benchmarks/base.py:803 calls generate_output_location() without the new kwargs. That caller is updated by Plan 02-02 (CLI dispatch layer), which is the single owner of env-var reading + validation + threading.

…STRUCT-05 - TestStruct06_RefactoredCodeDirectoryContents (10 tests) — VALS-01..04 layered model: per-tree self-consistency for CLOSED+OPEN plus REFERENCE_CHECKSUMS upstream-identity for CLOSED only (D-11/D-14/D-15). - TestStruct05_ModeAwareRequiredSubdirectories (8 tests) — Rules.md §2.1.5 split (D-17): CLOSED requires {code, results, systems}, OPEN requires {results, systems}, violations routed through new sub-rule anchors requiredSubdirectoriesClosed / requiredSubdirectoriesOpen. - Adds private helpers _write_valid_hash_json and _make_open_leaf for constructing self-consistent code/ trees and open/<sys>/<type>/<model> leaves in tests. - 13 RED failures expected; remaining 5 pass coincidentally where pre-refactor behavior overlaps post-refactor behavior (verifying GREEN invariants).

…ode-aware STRUCT-05 Refactors code_directory_contents_check to walk BOTH closed/ and open/ subtrees through a single @rule("2.1.6", "codeDirectoryContents") method, replacing the CLOSED-only REFERENCE_CHECKSUMS comparison with the D-11 layered model: - Per-tree .code-hash.json self-consistency for ALL leaves (VALS-02/04). - REFERENCE_CHECKSUMS upstream-identity for CLOSED leaves only (D-11). - Separate violations for missing-code/ vs hash-mismatch (D-14). Adds private helper _iter_open_code_dirs that yields each results/<sys>/<type>/<model>/code path under an OPEN submitter (D-15), constraining the OPEN walk to the Rules.md §2.1.27 leaf shape. Makes required_subdirectories_check (STRUCT-05) MODE-AWARE per Rules.md §2.1.5 split (D-17): CLOSED requires {code, results, systems}; OPEN requires {results, systems} (code/ lives per-leaf in OPEN). Violations now route through new sub-rule anchors requiredSubdirectoriesClosed and requiredSubdirectoriesOpen — the @rule decorator dispatch key stays byte-identical so init_checks is unaffected. Closes the Gemini HIGH cross-plan inconsistency between Plan 04's §2.1.5 split and the prior STRUCT-05 enforcement. The 'allowed: [...]' phrasing replaces the legacy 'only code/results/systems allowed' string so it stays accurate across both divisions. Preserves the D-12 single-warning behavior for unconfigured REFERENCE_CHECKSUMS with an addendum noting self-consistency still ran. New imports: verify_image_self_consistent, CodeImageError, MissingHashFile, MalformedHashFile from ..tools.code_image; Path from pathlib.

- §2.1.5: split into 2.1.5.a (CLOSED, three dirs) and 2.1.5.b (OPEN, two dirs at submitter level with per-leaf code/) per D-17. - §2.1.6: rewrite to describe runtime capture-or-verify behavior, the .code-hash.json schema (hash/algorithm/captured_at/mlpstorage_version/ git_sha), the per-tree self-consistency check, and the layered REFERENCE_CHECKSUMS upstream-identity check for CLOSED, per D-18. Embeds the literal VALR-02/VALR-04 mismatch messages. - §2.1.27: surgical OPEN-subtree edit (D-16) — remove the legacy submitter-level "code" entry and add "code # captured per-leaf" at every model/engine leaf (unet3d, resnet50, cosmoflow, llama3-8b/-70b/ -405b/-1t, vdb_bench engines AiSEQ/DiskANN/HNSW). CLOSED subtree is byte-unchanged. - §3.6.1: rewrite to describe the layered (a) self-consistency check and (b) CLOSED-only upstream-identity check, replacing the legacy single-md5sum prose, per D-19. Acceptance greps all pass; CLOSED subtree of §2.1.27, §2.1.4, §2.1.7, §3.6.2 unchanged. Closes DOC-01, DOC-02, DOC-03.

- Gating contract (D-10): whatif/validate/reports/etc return None; non-result-generating commands return None - Env-var fail-fast (D-04, D-05): missing orgname/systemname, regex rejects, Rules.md §2.1.1 cited - Inline path-traversal guard (T-02-02-05): . and .. rejected for both orgname and systemname with the literal substring "'.' and '..' are reserved path segments" - Capture path (CAP-01/02/06): CLOSED writes results_dir/closed/<orgname>/code/; OPEN writes results_dir/open/<orgname>/results/<systemname>/<benchmark>/<model>/code/; status log starts "Captured code image at " - Verify path (VALR-01/03 success, VALR-02/04 mismatch, D-21 missing-json): literal mismatch strings; recovery substring "either delete \`code/\` and re-run to re-capture"

Single chokepoint for runtime code-image capture-or-verify on closed|open runs (D-07..D-10). Appended to mlpstorage_py/submission_checker/tools/code_image.py per CD-01 placement. Contract: - Gates on (args.mode, args.command): no-op return None for whatif, validate, reports, history, lockfile, version, rules-coverage modes (D-10), and for configview/etc. under closed|open modes. - Sole reader of MLPSTORAGE_ORGNAME and MLPSTORAGE_SYSTEMNAME env vars (closes Gemini MEDIUM trust-contract finding). Imports the env-var-name constants from mlpstorage_py.rules.utils for single source of truth. - POSIX-safe regex per Rules.md §2.1.1 (D-05) plus INLINE `.`/`..` path-traversal guard for both orgname and systemname (REVIEWS.md consensus finding T-02-02-05 mitigation made inline). The literal substring "'.' and '..' are reserved path segments" appears in both reject messages. - Stashes validated orgname/systemname on args._validated_orgname / args._validated_systemname so downstream generate_output_location callers read them without re-touching env. - Computes image_parent matching generate_output_location prefix (D-03): CLOSED -> results_dir/closed/<orgname>/code/, OPEN -> results_dir/open/ <orgname>/results/<systemname>/<benchmark>/<model>/code/. Creates the subtree but NOT the results-directory itself (D-06). - First call captures via capture_code_image and logs literal CAP-06 prefix 'Captured code image at '. Subsequent matching call verifies and logs 'code unchanged from on-file image at '. Hash mismatch raises CodeImageError with literal VALR-02 'changes to the codebase are not allowed in a CLOSED run' or VALR-04 'all runs of this type must use the same codebase'. Missing or malformed .code-hash.json logs the D-21 actionable recovery message 'either delete `code/` and re-run to re-capture, or restore the original capture.' and re-raises the Phase 1 typed error (MissingHashFile or MalformedHashFile) for main() to map to exit code 2. Module additions: - _SUBMITTER_NAME_RE = re.compile(r'^[A-Za-z0-9._-]+$') - _RESERVED_PATH_SEGMENTS = frozenset({'.', '..'}) - _SUBMISSION_MODES = frozenset({'closed', 'open'}) - _SUBMISSION_COMMANDS = frozenset({'datasize', 'datagen', 'run'}) - capture_or_verify_code_image(args, env, log) -> Path | None No Phase 1 primitive or function signature was changed. The 38 existing tests in test_code_image.py and test_config_reference_checksum.py still pass; the new test_capture_or_verify_code_image.py adds 27 tests for the helper.

- main.py imports capture_or_verify_code_image and CodeImageError (single import line) - except CodeImageError clause present in main(), ordered AFTER DependencyError and BEFORE MLPStorageException catch-all (CodeImageError is not a MLPStorageException subclass so MRO doesn't fold it) - except clause returns EXIT_CODE.CODE_IMAGE_ERROR (value 2) - run_benchmark calls capture_or_verify_code_image(args, os.environ, logger) BEFORE benchmark_class(args, ...) instantiation - helper invocation wrapped in progress_context with the literal description 'Capturing or verifying code image'

- main.py:42 imports capture_or_verify_code_image and CodeImageError from mlpstorage_py.submission_checker.tools.code_image (single import line). - main.py:216 invokes the helper inside run_benchmark, AFTER environment validation completes and BEFORE the program_switch_dict / benchmark_class instantiation (D-07 insertion point). The invocation is wrapped in progress_context('Capturing or verifying code image...') for consistent UX with the surrounding validation block. The call is unconditional because the helper internally gates on (args.mode, args.command) per D-10 — non- submission modes (whatif/validate/reports/etc.) no-op. - main.py:420 adds a dedicated except CodeImageError clause AFTER except DependencyError and BEFORE except MLPStorageException catch-all. The ordering matters because CodeImageError inherits directly from Exception (Phase 1 code_image.py), NOT from MLPStorageException, so Python's MRO does not implicitly fold it into the catch-all. The handler logs the message and returns EXIT_CODE.CODE_IMAGE_ERROR (D-22). - The existing except ConfigurationError clause is unchanged — env-var fail-fast and path-traversal-reject ConfigurationErrors raised by the helper continue to flow through that path (user-input errors, not code- image integrity errors).

Covers CAP-01/02/06/07/08, VALR-01/02/03/04, D-04/05/21, and the REVIEWS.md consensus path-traversal '.'/'..' guard. 32 tests in 9 classes using direct in-process invocation of capture_or_verify_code_image with tmp_path + MockLogger fixtures (CD-02 lightweight style). Mismatch assertions use literal substring matches on the VALR-02/04 spec strings; path-traversal tests are parametrized over '.' and '..' for both MLPSTORAGE_ORGNAME and MLPSTORAGE_SYSTEMNAME and assert against the "'.' and '..' are reserved path segments" substring from Plan 02 Task 1. Mismatch tests monkeypatch verify_source_against_image to force False, isolating the mismatch code path from the Phase 1 capture-vs-verify hash discrepancy noted in deferred-items.md.

…rectory Part A — amend TestStruct06_CodeDirectoryContents and TestFixtureFactory::test_default_fixture_no_errors so the fixture's pre-existing closed/Acme/code/ carries a matching .code-hash.json (via the existing _write_valid_hash_json helper). The Plan 02-03 layered self-consistency check runs unconditionally for every leaf, so a fixture without a valid hash file now trips a MissingHashFile violation. The multi-submitter test also builds a code/ under AlsoAcme. test_mutated_code_fails relaxes its assertion to >= 1 [2.1.6] violation since mutation now breaks BOTH the self-check AND REFERENCE_CHECKSUMS. Part B — add TestStruct06_OpenCodeDirectory (6 tests) targeting the OPEN walk added in Plan 02-03 Task 1 via _iter_open_code_dirs: - VALS-03 missing OPEN code/ (single per-leaf violation) - VALS-04 happy path (self-consistency passes) - VALS-04 sad path (hash mismatch — recorded hash != tree hash) - VALS-04 missing .code-hash.json - OPEN-only tree does NOT emit the closed-specific "not configured" warning - Multiple OPEN model leaves each get their own per-leaf violation

…anchors Sub-edit 1: update TestStruct05_RequiredSubdirectories (the legacy CLOSED class) to assert the new sub-rule anchors per Plan 02-03 Task 2 (D-17). The four tests previously asserting "[2.1.5 requiredSubdirectories]" now assert "[2.1.5 requiredSubdirectoriesClosed]". The extra-submitter-subdir test additionally pins the new sorted-list-repr violation format "allowed: ['code', 'results', 'systems']". The wrapping-hint test upgrades its anchor assertion the same way. Sub-edit 2: add TestStruct05_OpenSubmitter (7 tests) regressing the Gemini HIGH cross-plan finding (REVIEWS.md). Before the Plan 02-03 Task 2 mode-aware refactor, EVERY OPEN submission would have been flagged as having a missing code/ at the submitter level. Tests: - CLOSED no-regression ({code, results, systems} unchanged) - OPEN happy path with {results, systems} only (regression target) - OPEN unexpected code/ at submitter level (anchor + allowed-set assertion) - OPEN missing results/ - OPEN missing systems/ - CLOSED missing code/ routes through requiredSubdirectoriesClosed anchor - OPEN wrapping-hint diagnostic Adds module-level helpers _build_minimal_open_submitter and _build_minimal_closed_submitter — small mkdir-based fixture builders sibling to the existing _make_open_leaf and _write_valid_hash_json helpers.

compute_code_tree_md5's _is_excluded_dir matches MD5_EXCLUDE_PREFIXES by rooted prefix only, while capture_code_image's shutil.copytree ignore_logic also matches by basename equality at any depth. The two walkers therefore disagree on real repos that contain deeply nested __pycache__/, tests/, build/, etc. directories with non-excluded filenames inside. Adds two regression tests: - test_nested_excluded_dir_pruned_at_any_depth: the predicate alone must prune excluded dir names at any depth (not just root). - test_capture_verify_roundtrip_with_nested_excluded_dirs: end-to-end capture-then-verify on unchanged source must return True even with deep __pycache__/ and tests/ leaves. Both fail before the fix in the next commit. Discovered via Phase 2 verifier (see .planning/phases/02-cli-integration-validator-docs/VERIFICATION.md).

…d excluded dirs _is_excluded_dir matched MD5_EXCLUDE_PREFIXES by rooted relative path only, so deeply nested __pycache__/, tests/, build/, etc. directories were traversed and their non-excluded files were hashed. capture_code_image's shutil.copytree ignore_logic already matched by basename at any depth, so the two walkers produced different hashes for the same logical tree. On real submitter repos this caused verify_source_against_image to return False on unchanged source the second time mlpstorage closed datagen ran — the runtime would emit the VALR-02 'changes to the codebase are not allowed in a CLOSED run' message for a tree that had not actually changed. Adds a basename-equality check before the rooted-prefix check so excluded dir names are pruned at any depth, matching the capture's semantics. The basename loop is intentionally cheap (10 prefix comparisons per dirname). Fixes the verification finding in .planning/phases/02-cli-integration-validator-docs/VERIFICATION.md. RED tests in commit 13bd98e now pass; full code-image test surface (242 tests) green; broader suite shows 34 failed / 652 passed — same baseline as before the fix, zero new regressions.

…me segment The CLI subparsers are named 'vectordb' and 'kvcache', but generate_output_location writes the on-disk segment as 'vector_database' / 'kv_cache' (BENCHMARK_TYPES.name). capture_or_verify_code_image currently uses args.benchmark (the CLI name) for the type segment, which puts the captured code/ in a different tree than the runtime writes results into. Phase 2 verifier surfaced this divergence (see .planning/phases/02-cli-integration-validator-docs/VERIFICATION.md, Human Verification Item 2). For training/checkpointing the CLI name matches BENCHMARK_TYPES.name so no automated test caught this — the new tests exercise the only two benchmarks where the names diverge.

…md diagram capture_or_verify_code_image used args.benchmark (the CLI subparser name) for the per-type path segment under open/<orgname>/results/<sys>/. For training and checkpointing the CLI name happened to match BENCHMARK_TYPES.name so the divergence was invisible; for vectordb and kvcache the names diverge — 'vectordb'→'vector_database', 'kvcache'→'kv_cache' — which placed the captured code/ in a different on-disk tree than generate_output_location()'s results. Rules.md §2.1.27 diagram used a third name ('vdb_bench') for the same segment. - Add _CLI_BENCHMARK_TO_TYPE mapping from CLI subparser name to BENCHMARK_TYPES enum and emit the canonical .name in the OPEN image_parent path. Unknown CLI names now raise CodeImageError instead of silently building a divergent path. - Rename 'vdb_bench' → 'vector_database' in Rules.md §2.1.27 (2 occurrences, CLOSED and OPEN sub-diagrams) to match what generate_output_location() writes and what the helper now captures. Fixes the verifier finding in .planning/phases/02-cli-integration-validator-docs/VERIFICATION.md (Human Verification Item 2). All three names — runtime output, captured code, Rules.md diagram — now agree. Note: vectordb has no args.model attribute (vectordb_args.py does not add --model), so OPEN vectordb runs will still AttributeError on getattr(args, 'model') in this helper. That latent issue is a separate path-shape design question (vectordb output writes <type>/<command>/, not <type>/<model>/) and is intentionally out of scope for this naming fix.

vector_database and kv_cache write runtime output to <type>/<command>/<datetime>/ (no <model> segment) per generate_output_location. The captured code/ must mirror the runtime tree, so for these two types code/ lives at <type>/code/ rather than <type>/<model>/code/. Currently: - vectordb crashes the helper with AttributeError because vectordb has no --model CLI arg and the helper does an unguarded getattr(args, 'model'). - kvcache silently writes code/ to <type>/<model>/code/, a tree that no results live in — the submission validator and the runtime hash check cannot find it. - The validator's _iter_open_code_dirs blindly walks 3 levels (<sys>/<type>/<model>/code), so it descends into a stray code/ leaf for vector_database/kv_cache and reports 'code/code missing' instead of validating the real code/. Adds 4 failing tests: - test_open_vectordb_uses_canonical_type_name: helper writes <type>/code/ with no model segment, no AttributeError. - test_open_kvcache_uses_canonical_type_name: same as vectordb; args.model is ignored. - test_open_vector_database_code_dir_at_type_level: validator finds and validates code/ at <type>/ (2 levels), and emits the missing-code violation against the type level, not a phantom <type>/code/code path. - test_open_kv_cache_code_dir_at_type_level: same as vector_database. All four fail before the fix in the next commit.

…for vectordb/kvcache vector_database and kv_cache write runtime output to <type>/<command>/<datetime>/ (per generate_output_location — no <model> segment). The captured code/ must mirror that tree, so for these two types code/ lives at <type>/code/ rather than <type>/<model>/code/. Training and checkpointing keep <type>/<model>/code/ since their runtime output is keyed per model. This closes two correctness gaps surfaced after the canonical-type-name fix (866ff5a): - capture_or_verify_code_image previously called getattr(args, 'model') unconditionally. vectordb_args.py does not register a --model arg, so OPEN vectordb runs would AttributeError before any capture happened. For kvcache the call succeeded but wrote code/ into a tree where no results live — orphaning the code image from the runtime output. - _iter_open_code_dirs walked exactly three levels under results/ for every benchmark type, so the validator either missed code/ entirely for vectordb/kvcache or reported phantom code/code/ violations. Adds _TYPES_WITHOUT_MODEL_SEGMENT (helper) and _OPEN_TYPES_WITHOUT_MODEL (validator) — the two sets are kept inline rather than imported across the module boundary so the validator does not pull in the helper's runtime dependencies. They're both authoritative and must be kept in sync if generate_output_location grows another no-model benchmark type. Tests in commit eb46498 (vectordb/kvcache helper paths, vectordb/kv_cache validator walks) now pass; full code-image surface 246 tests green; broader suite still 34 failed / 656 passed (+4 new) — same pre-existing baseline.

…er-index Plan 02-04 inserted 'code # captured per-leaf' markers under every leaf of the §2.1.27 OPEN directoryDiagram, including AiSEQ / DiskANN / HNSW under vector_database. That placement matched the legacy diagram's shape (which showed an <index_type> level between vector_database/ and <datetime>/), but not the runtime: generate_output_location writes <type>/<command>/<datetime>/ for vector_database and kv_cache — no <model> or <index_type> segment — so the captured code/ lives directly under <type>/, shared across all index types and commands. - §2.1.27: move the per-leaf code/ marker from each AiSEQ/DiskANN/HNSW level up to a single 'code # captured per-type (no <model> in runtime output path)' marker under vector_database/ itself. The legacy <index_type> level is pre-existing and intentionally left alone — that diagram/runtime inconsistency predates this work and is out of scope. - §2.1.5.b: amend prose to describe both leaf shapes explicitly — <systemname>/<type>/<model>/ for training+checkpointing, <systemname>/<type>/ for vector_database+kv_cache — so a reader of the rules text sees the same shape that the diagram and the runtime + checker agree on (commit 390eb20). No code change. Documentation alignment only.

…dex_type> vector_database benchmark results split by index_type because AISAQ results are not comparable to DISKANN/HNSW — they must live in separate on-disk subdirectories. Rules.md §2.1.27's existing AiSEQ/DiskANN/HNSW level under vector_database is canonical; the runtime, the capture helper, and the submission validator must all honor it. Recent commits (390eb20 helper/validator, 8fba4c4 Rules.md) collapsed vector_database into a per-<type> shape on the mistaken assumption that the runtime's missing <index_type> was canonical. That regressed the per-leaf contract that distinguishes incomparable index implementations. Reverting those commits in the next change; this RED commit pins the corrected shape in tests first. - generate_output_location: vector_database must yield <type>/<index_type>/<command>/<datetime>/ on both CLOSED and OPEN. - capture_or_verify_code_image: OPEN vectordb captures at <type>/<index_type>/code/ (using args.index_type — vectordb has --index-type, not --model). - _iter_open_code_dirs: vector_database is back to the standard 3-level walk under results/, yielding <sys>/<type>/<index_type>/code/ — same shape as training/checkpointing. kv_cache stays at <sys>/<type>/code/ per user direction — the per-model structure for kv_cache will be revisited in a follow-up once we have more detail on its directory and file layout. The kv_cache test (test_open_kv_cache_code_dir_at_type_level) is unchanged. All 4 new/updated tests fail before the fix in the next commit.

…abase vector_database results split by index_type — AISAQ results are not comparable to DISKANN/HNSW, so they must live in separate subdirectories (Rules.md §2.1.27 is canonical on this; the runtime path was missing the segment). kv_cache stays at <type>/code/ for now, pending a follow-up plan on its directory/file structure. generate_output_location (vector_database case): before: <type>/<command>/<datetime>/ after: <type>/<index_type>/<command>/<datetime>/ capture_or_verify_code_image (OPEN, vector_database): before: <type>/code/ (per-type, wrong) after: <type>/<index_type>/code/ (per-leaf, correct) Replaces the hardcoded _TYPES_WITHOUT_MODEL_SEGMENT set with a data-driven _TYPE_TO_LEAF_ATTR map so each BENCHMARK_TYPE declares its leaf attribute (args.model for training/checkpointing, args.index_type for vector_database, None for kv_cache transitionally). Adding new benchmark types or transitioning kv_cache later is a one-line edit in the map. _iter_open_code_dirs (validator): before: yielded <sys>/<type>/code/ for both vector_database and kv_cache after: yielded <sys>/<type>/<index_type>/code/ for vector_database (via the standard 3-level walk used by training/checkpointing); kv_cache still yields <sys>/<type>/code/ via the same special case as before. This reverses the vector_database half of commit 390eb20, which had collapsed vector_database into per-<type> on the mistaken assumption that the runtime's missing <index_type> segment was canonical. The kv_cache half of 390eb20 is preserved verbatim — user direction is to keep kv_cache at <type>/code/ until a follow-up plan finalizes its structure. All 4 RED tests in commit b97c466 now pass; full code-image surface 247 tests green; broader suite still 34 failed / 657 passed — same pre-existing baseline, zero regressions.

…abase Reverses the §2.1.27 OPEN diagram change from 8fba4c4 (which collapsed vector_database into a single per-<type> code/ marker). The original Plan 02-04 placement was correct: code/ lives at each AiSEQ/DiskANN/HNSW leaf because results across index types are not comparable and must validate independently. The 8fba4c4 change was based on the mistaken assumption that the runtime (which was missing the <index_type> segment) was canonical; in fact the diagram is canonical and the runtime is being corrected to match (commit 96f0ed0). §2.1.27: per-leaf 'code # captured per-leaf' marker restored at each of AiSEQ, DiskANN, HNSW under vector_database/. The per-type marker at the vector_database level is removed. The diagram now matches the runtime (generate_output_location includes args.index_type) and the validator (_iter_open_code_dirs walks <sys>/<type>/<index_type>/code/). §2.1.5.b: prose rewritten as a typed bullet list since the three benchmark families now have three different leaf shapes: - training, checkpointing → <type>/<model>/ (per model) - vector_database → <type>/<index_type>/ (per index type — AISAQ vs DISKANN vs HNSW results not comparable) - kv_cache → <type>/ (transitional, pending finalization) Pre-existing diagram inconsistencies left in place: - AiSEQ vs config's AISAQ casing — the diagram's mixed-case index names (AiSEQ, DiskANN) do not match what argparse stores (uppercase DISKANN, HNSW, AISAQ from VDB_INDEX_TYPES). Submitters' on-disk directories will use the uppercase values; the diagram should eventually be re-cased to match. Out of scope for this fix. - kv_cache absent from §2.1.27 diagram entirely — to be addressed when the kv_cache directory structure is finalized.

…ype split The docstring still described vector_database alongside kv_cache as the 2-level walk, but vector_database moved back to the standard 3-level <type>/<index_type>/ walk in 96f0ed0. Comment-only cleanup flagged by the phase-2 re-verifier as an info-level item — no behavior change. Docstring now enumerates all three leaf shapes (training/checkpointing per <model>, vector_database per <index_type>, kv_cache transitional) and the inline comment beside the kv_cache branch names only kv_cache.

…r pre-existing failures - Bump pyproject.toml [project].version from origin/main + 1 patch (target derived at execution time per phase 3 D-08 — no hardcoded literal; origin/main was at 3.0.13, target 3.0.14). - Regenerate uv.lock against the bumped project via bare `uv lock` (D-01: no transitive bumps; only the project's own version line changed). - Fix 7 stale-signature failures in test_issue_376_file_arg_conflict.py by passing the second positional argument that the production CLI arg-adders now require (D-03: test-side fix, production signatures unchanged). Test assertions adapted to the three-mode-positional CLI redesign — the storage-type moved from option `--file`/`--object` to a single positional `data_access_protocol` with `choices=['file', 'object']`; the duplicate-registration invariant under test is preserved. Closes REL-01, REL-02.

The gsd-code-fixer agent now places its isolated worktree under ${repo_root}/.gsd-tmp/ instead of /tmp/ so sandboxed harnesses don't render the path as a ../../../../tmp/ traversal string in permission prompts. Add the directory to .gitignore so nested-worktree contents don't show up in the outer working tree's git status.

verify_image_self_consistent raised MissingHashFile when compute_code_tree_md5 returned None — but load_code_image had already succeeded, so the JSON sidecar IS present. The real failure is that the tree itself didn't hash (permission error mid-walk, race where the dir vanished, etc.). MissingHashFile's docstring is explicit: '.code-hash.json not found in an image directory (D-14)'. Introduce CodeTreeUnreadable(CodeImageError) and raise that instead. Existing call-site catches in helpers.py and submission_structure_checks.py still cover this via the (MalformedHashFile, CodeImageError) arm (CodeTreeUnreadable is-a CodeImageError), so behavior is unchanged for downstream — only the surface log message names the right root cause.

verify_source_against_image raised SourceRootNotFound when compute_code_tree_md5(source_root) returned None — but SourceRootNotFound is documented (D-05) as 'find_source_root walked to filesystem root without finding pyproject.toml,' i.e., a structural CLI/config error, not a runtime read error. A None return from compute_code_tree_md5 means the walk itself failed (permission error, dir vanished mid-walk, etc.) — raise CodeTreeUnreadable to match the new IN-01 semantics. Catch sites already cover this via CodeImageError, so no downstream changes.

The comment at the image_parent computation said 'creating the results-directory itself is reserved for the future mlpstorage init command,' but the implementation called image_parent.mkdir(parents=True, exist_ok=True) which silently created results_dir if absent. Add a results_dir.exists() check that raises ConfigurationError with code=CONFIG_INVALID_VALUE before the mkdir runs. Existing tests pass tmp_path (always exists), so no test regressions.

shutil.copytree(..., dirs_exist_ok=True) creates target_dir on its own since Python 3.8 — the explicit mkdir(parents=True, exist_ok=True) above it was a no-op. Removing it shrinks the window in which target_dir can be in a partial state when copytree begins.

…ters The 'reference checksum not configured' warning fired any time closed/ existed as a directory — including when it was empty, or contained only dotfiles like .DS_Store / .gitkeep. In those cases no code/ checks were actually skipped, so the warning was pure noise. Add an any() probe for a non-dot, isdir() entry under closed/ and gate the warn_violation on it. Empty closed/ and dotfile-only closed/ now silently pass.

The docstring's 'Known limitation' invoked TODO-001 (a planned machine-readable df contract) but the parse site itself had no breadcrumb pointing forward — a future maintainer reading line 124 in isolation had no marker to grep on. Add a TODO(TODO-001) comment at the parse site. Also note the planned bigger migration: capturing stat -f -c '%i' "$data_dir" per node at runtime and storing the scalar FS identity, which supersedes both this parser limitation and WR-06's silent-pass fragility.

The audit tool's locked rule-ID regex was [234] — Rules.md §5 (VdbCheck, added by Plan 04-03) was never enumerated, so even though discover_rules(VdbCheck) found all 16 @rule methods, rules-coverage reported zero §5 entries. Extend the character class to [23456] so §5 is now enumerated and §6 is pre-wired for the kvcache rules landing soon. Updates the matching docstrings, the locked-regex doc literal, the CLI description, and the test row-count floor (>=50 -> >=66 since Phase 4 added 16 §5 IDs).

The post-parse block at cli_parser.py:325 dereferences args.hosts with len() but only guards on hasattr(args, 'hosts'). argparse leaves the attribute as None when --hosts is not supplied, so any single-node invocation (e.g. `mlpstorage open vectordb datagen file ...`) crashed with "object of type 'NoneType' has no len()" before reaching the benchmark dispatch. Add the missing `is not None` clause to match the normalization guard 20 lines above which already handles the None case correctly.

… env main()'s generic-Exception handler gated the traceback emission on the module-level MLPS_DEBUG constant, which is only set from the MLPS_DEBUG environment variable at import time. The CLI's --debug flag was a no-op for this path — the user saw ERROR: Unexpected error: ... Run with --debug for full stack trace even when --debug WAS supplied, forcing them to re-run with `MLPS_DEBUG=1 mlpstorage ...` to get a trace. args is not in scope of main() (it's created inside _main_impl), so check sys.argv directly for the bare '--debug' token. argparse declares --debug as store_true, so the bare-token check covers all valid invocations.

Benchmark.generate_output_location() called the underlying rules.utils.generate_output_location() without orgname= or systemname= kwargs. For args.mode in {closed, open} the underlying function raises ConfigurationError demanding the dispatch layer thread the validated env-var values through. capture_or_verify_code_image already stashes the validated values on args._validated_orgname / args._validated_systemname (see code_image.py:625-627 comment) specifically so downstream callers can read them without re-reading env. This commit wires that contract through. getattr(..., None) keeps legacy/whatif modes working (the underlying function's mode check skips the kwarg requirement for those).

The orgname / systemname ConfigurationError suggestions used: f"... read {VAR}={VAR!r} from the environment ..." which rendered as: "... read MLPSTORAGE_ORGNAME='MLPSTORAGE_ORGNAME' from the environment ..." The second slot was meant for a <value> placeholder but somebody filled it with the variable name itself — reads as if the value IS the var name string. Rewrite as 'read the <VAR> environment variable and thread the validated value through ...' which says the same thing without the bogus key=value.

A live OPEN vectordb datagen captured the running source tree into code/, and the captured tree included .planning/, .gsd-tmp/, .idea/, .claude/, etc. — local developer / tooling artifacts that the project's .gitignore already excludes from version control, have nothing to do with the benchmark source contract, and (.gsd-tmp/ in particular) can contain transient agent state. Extend MD5_EXCLUDE_PREFIXES with the common dot-prefixed local-artifact dirs: .idea/ .vscode/ .claude/ .agent/ .agents/ .roo/ .planning/ .gsd-tmp/ .git/, .pytest_cache/, .venv/, .tox/ were already in the list. .github/, .env.example, .gitignore, .python-version are intentionally kept — they ARE part of the project contract.

summary.csv at repo root is the default output path for the submission checker (mlpstorage_py/cli/utility_args.py:123, mlpstorage_py/submission_checker/main.py:82, README.md). Tracking it meant every `mlpstorage validate ...` run (and the recent UAT runs) produced a dirty working tree. Remove the tracked file and add summary.csv to .gitignore so the runtime artifact stays local to whoever ran the tool. If a future schema-template-style summary.csv is needed for the repo (e.g. sample/header-only), it should land at a non-default path like docs/summary-template.csv.

CLAUDE.md was included in the PR diff with no substantive content changes — only blank lines added around bash code blocks by a CommonMark formatter. Restore to origin/main to keep the PR diff focused on actual source changes. (CLAUDE.md is already in .gitignore but tracked from an earlier commit, so further local edits won't show up in future PRs from this branch.)

FileSystemGuy · 2026-06-19T04:10:55Z

2026-06-18 update — Phase 4 review-fix + UAT integration polish

23 new commits pushed today. The PR description (above) was written before Phase 4 landed; this comment summarizes what's new since then. Reviewers may want to focus diff review here.

Phase 4 deliverables (now in this branch)

VdbCheck §5 (full 16 rules) wired up: 5.1.1, 5.1.2, 5.2.1, 5.2.2, 5.3.1–5.3.4, 5.4.1, 5.4.2, 5.5.1, 5.6.1–5.6.5. STUB_COVERAGE['VdbCheck'] removed.
On-disk shape rename for OPEN vectordb (supersedes the §"On-disk shape" line in this PR's original description):
- Was: .../vector_database/<index_type>/code/ (e.g. DISKANN)
- Now: .../vdb_bench/<DisplayIndex>/<command>/<datetime>/ with per-leaf code/ at the <DisplayIndex>/ level (e.g. vdb_bench/DiskANN/datagen/20260618_204647/)
- INDEX_TYPE_TOKEN_TO_DIR / INDEX_TYPE_DIR_TO_TOKEN dual vocabulary lives in mlpstorage_py/config.py
Layered code-image helper _check_code_image_layered in submission_checker/checks/helpers.py, consumed by both §3.6.1 (TrainingCheck) and §5.6.1 (VdbCheck) — CD-04 deduplication via caller-supplied rule_id/rule_name.
Rules.md §2.1.5.b leaf description, §2.1.27 CLOSED + OPEN vdb subtrees (now with <command>/<datetime>/ layering, code # captured per-leaf under all three OPEN index dirs including AiSAQ), §5.3.1 prose narrowed, §5.6.1 layered-model pointer.
46 new tests across test_vdb_checks.py (41) + test_training_check_3_6_1.py (5).

Phase 4 code-review fixes (13 findings)

7 Warnings + 6 Info from a standard-depth review, all atomic commits:

ID	Commit	Summary
WR-01	`625a11c`	Skip dot-prefixed entries in `_iter_open_code_dirs` (prevented spurious `[2.1.6]` violations on `.cache/`, `.DS_Store`, merged-tree `.git/`)
WR-02	`e35e83d`	`parent.get(k) or {}` in chained YAML reads (training_checks, vdb_checks, submission_structure_checks) — fixes `AttributeError` on null intermediates
WR-03	`ecac710`	Atomic capture: rollback `code.tmp/` on hash/JSON/rename failure
WR-04	`b306303`	Typed `CodeImageError` for missing `args.benchmark` / `args.<leaf_attr>` (replaces bare `AttributeError`)
WR-05	`9515078`	Tighten 5.5.1 S3 match: exact-name set + `s3-` prefix (rejects `non-s3-storage`)
WR-06	`1206165`	`log.warning` when df-mount match is indeterminate (was silent-pass)
WR-07	`5c58aa0` + `054eb7f`	Short-circuit upstream-identity walk on `MissingHashFile` — both in helpers.py and STRUCT-06's inline copy
IN-01/02	`709c0ac`, `cc05dc2`	New `CodeTreeUnreadable(CodeImageError)` exception for tree-walk failures (replaces misnamed `MissingHashFile` / `SourceRootNotFound` raises)
IN-03	`f9af70e`	Enforce documented `results_dir.exists()` invariant — raise `ConfigurationError` (E102) with `mkdir -p` suggestion
IN-04	`2bdc7eb`	Drop redundant `target_dir.mkdir` before `copytree(..., dirs_exist_ok=True)`
IN-05	`cd55ea4`	Gate "reference checksum not configured" warning on actual `closed/` submitter presence
IN-06	`4e34ddd`	`TODO(TODO-001)` marker at df parse site + note about planned `stat -f -c '%i'` migration

UAT-discovered integration bugs (6 fixes)

Live /gsd-verify-work UAT exercised mlpstorage open vectordb datagen file … end-to-end. None of these were in the original Phase 4 plan scope, but all were blocking live verification of Phase 4 deliverables:

Commit	Bug
`8edc84f`	`rules_coverage.py` regex `[234]` → `[23456]` — Phase 4 wired 16 `@rule` methods for VdbCheck but the audit-CLI regex never enumerated §5 IDs from Rules.md, so the report dropped them silently. `[23456]` also pre-wires §6 for the kvcache rules landing soon.
`b98472f`	`cli_parser.py:325`: `len(args.hosts)` crashed when `--hosts` was not supplied (argparse leaves `None`). Single-node `mlpstorage open vectordb datagen ...` was completely broken.
`8a3dabd`	`main.py:450`: `--debug` CLI flag never wired to the unhandled-exception traceback gate; only the `MLPS_DEBUG` env var path was checked.
`dc29798`	`benchmarks/base.py:804`: `generate_output_location()` was called without `orgname=`/`systemname=` — for OPEN/CLOSED modes this raised the documented `ConfigurationError`. Now reads `args._validated_orgname` / `_validated_systemname` stashed by `capture_or_verify_code_image`.
`beb9bbb`	`rules/utils.py`: suggestion-string format slip — `"read MLPSTORAGE_ORGNAME='MLPSTORAGE_ORGNAME' from the environment"` (the value placeholder was filled with the var name). Rewritten.
`17204da`	`constants.py`: `MD5_EXCLUDE_PREFIXES` extended with `.idea/`, `.vscode/`, `.claude/`, `.agent/`, `.agents/`, `.roo/`, `.planning/`, `.gsd-tmp/` — local-artifact dirs were being captured into the code image.

PR-prep cleanup

cc2b660: Untrack summary.csv, add to .gitignore. It's the default output path for the submission checker (utility_args.py:123, README) — tracking it produced dirty working trees after every mlpstorage validate run.
e15fe7a: Revert whitespace-only CLAUDE.md changes (CommonMark formatter noise; substantive content unchanged).

UAT verdict

6/6 tests passed live:

CLI smoke (mlpstorage validate --help)
rules-coverage shows full VdbCheck §5 (after 8edc84f)
OPEN vectordb output path: vdb_bench/DiskANN/datagen/<datetime>/
OPEN vectordb captures code/ at <DisplayIndex>/ level
Rules.md §2.1.27 OPEN graphic lists code/ under all 3 index dirs
IN-03 ConfigurationError fires on missing --results-dir

Test plan

uv run pytest mlpstorage_py/tests/ -q → 740 passed / 5 failed / 1 xfailed.

The 5 failures are pre-existing and unrelated to this branch's work — verified by stashing and re-running:

4 × test_open_closed_flag_recognition.py::TestOpenClosedCLIFlags — stale add_universal_arguments(parser) calls missing the required req_results positional
1 × test_benchmarks.py::TestVectorDBBenchmark::test_constructor_accepts_kwargs — separate constructor signature drift

Both look like CLI-parser-side test rot, not regressions in this PR. Recommended for a separate cleanup.

Caveats / acknowledged

/gsd-secure-phase 4 did NOT run. workflow.security_enforcement: true in .planning/config.json says it should before phase advancement. No new attack surface introduced in this PR (the new code paths are filesystem reads/writes that already existed), so impact is bounded — happy to run on request before merge.
Version invariant: pyproject.toml is at 3.0.14; origin/main is at 3.0.13. +1 patch as required.
mlpstorage_py/cli_parser.py, mlpstorage_py/main.py, mlpstorage_py/benchmarks/base.py were NOT in the Phase 4 plan but were touched today because UAT discovered the bugs above. All commits are atomic and individually justified; revert-blast-radius is per-commit.

Memory of pending follow-up work (not in this PR)

df → stat -f -c '%i' migration: per-node FS-identity capture via stat -f -c '%i' "$data_dir" superseding df-output parsing for §3.4.2 / §5.4.2. Out of scope here; tracked.

FileSystemGuy · 2026-06-19T04:12:49Z

The second big chunk of commits here is really a separate topic: bringing in all the new rules for VDB, but it had to tweak and/or fix so much of the infrastructure that the "original" PR had built that there wasn't a good way to separate them.

The call site in benchmarks/base.py now threads orgname= and systemname= through to generate_output_location(). Update the unit test's assert_called_once_with to include the new keyword arguments (both None in this test, since the args Namespace doesn't set _validated_orgname / _validated_systemname).

FileSystemGuy · 2026-06-19T22:01:19Z

@idevasena Here's the summary of the differences:

Conflict summary

Three files conflict between PR #473 (FileSystemGuy-code-image-capture-v1) and main:

mlpstorage_py/cli_parser.py — trivial (resolvable mechanically)

Both branches independently fixed the len(None) crash when --hosts is unset. Main's fix folds the assignment into
the existing if hasattr(args, 'hosts') and args.hosts is not None: block at line 319; PR #473 added a parallel
guarded block. Functionally identical. Resolution: take main's nested form.

2 & 3. mlpstorage_py/rules/utils.py and tests/unit/test_accumulation.py — substantive design dispute

These conflict on competing VectorDB result-tree layouts that were developed in parallel and have non-trivial
downstream commitments on both sides.

PR #473's design (cites Rules.md §5.3.1, Phase 4 D-02/D-03):

Path: /vdb_bench////
On-disk segment is vdb_bench (not vector_database); index uses a display-name mapping (DiskANN, HNSW, AiSAQ) via
INDEX_TYPE_TOKEN_TO_DIR in config.py.
No engine segment.
Submission checker (mlpstorage_py/tests/test_vdb_checks.py:226) explicitly gates all 16 §5 rule methods on mode
== "vdb_bench" — switching the segment name silently disables every §5 check.
Code-image capture tests (test_capture_or_verify_code_image.py:246) also hard-code vdb_bench / DiskANN / code.

main's design (PR #476, just merged):

Path: /vector_database//<vdb_index>///
Segment stays vector_database; index is uppercase token (DISKANN).
Adds new --vdb-index flag with full resolver in vectordb_args.py (line 547): defaults, fallback to --index-type,
cross-validation, conflict detection.
Includes engine segment so multiple VDB engines coexist in one results tree.

These can't both be true at once. Possible resolutions, none of which I should pick unilaterally:

Take PR Code-Image Capture & Validation (milestone v1.0) #473's design: revert main's --vdb-index/engine-segment work and update the just-merged tests on this
branch.
Take main's design: drop the vdb_bench rename and the display-name mapping; loosen the §5 rules-checker gate so
it accepts vector_database; update Phase 4 D-02/D-03 documentation.
Hybrid: e.g. /vdb_bench///// — keep the §5 gate name and
display mapping, adopt main's engine segment and --vdb-index resolver. This is the most defensible technically,
but it's a new design that should go through review, not a merge resolution.

@idevasena I assume that you'd prefer the results pathname structure defined in your PR#476, is that correct? If so, then I'll ask my agent to rework this PR to use that structure. Please just confirm that...

idevasena · 2026-06-19T22:41:15Z

@FileSystemGuy yes correct, I prefer the results pathname structure defined in PR#476. Thank you!

FileSystemGuy · 2026-06-19T23:09:27Z

@idevasena

Below is what I propose to do for the VDB workload (Claude prompts):

Please examine the PR#473 and PR#476. We need to change the pathname structure that we've been using in PR#473 for the results-directory for VDB to be the pathname structure used in PR#476 for the results-directory for VDB.
VDB doesn’t appear to print its arguments in the arg summary. Here's a minimum list:
2026-06-19 14:34:09|RESULT: Vectors: 1,000,000 x dim=1536 x 4B
2026-06-19 14:34:09|RESULT: Raw data: 6.14 GB
2026-06-19 14:34:09|RESULT: Index type: DISKANN (130% overhead)
2026-06-19 14:34:09|RESULT: Shards: 1
2026-06-19 14:34:09|RESULT: Estimated total: 7.99 GB
We should record the datasize in a metadata file in the results-directory so we can ensure VDB runs have enough data. We will need to do the same for the other workloads, so it might best to do them all in a separate, later, PR from the one we'd be building for this effort.
VDB is still reporting that it is OPEN-only:
2026-06-19 14:34:09|STATUS: Open: [OPEN] VectorDB benchmark is in preview status - not accepted for closed submissions (Parameter: benchmark_status)
2026-06-19 14:34:09|STATUS: Benchmark run qualifies for OPEN category ([RunID(program='vector_database', command='datasize', model='milvus_DISKANN', run_datetime='20260619_143409')])
VDB doesn’t know whether it is open or what:
2026-06-19 14:34:09|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use closed or open as the first positional argument to specify a configuration.
VDB should require the results-directory to be specified, there should be no "default" results-directory.
2026-06-19 14:34:09|WARNING: Results directory not specified. Writing results to the system temp directory: /tmp/mlperf_storage_results. These results will NOT persist across a reboot. Use --results-dir or set the MLPERF_RESULTS_DIR environment variable to save results permanently.
2026-06-19 14:34:09|STATUS: Writing metadata for benchmark to: /tmp/mlperf_storage_results/vector_database/milvus/DISKANN/datasize/20260619_143409/vector_database_20260619_143409_metadata.json

idevasena · 2026-06-20T14:44:47Z

@FileSystemGuy The proposed changes look accurate to me. Meanwhile, I will work on #452 that needed updates to README and closed param values.

…age-capture-v1 # Conflicts: # mlpstorage_py/cli_parser.py # mlpstorage_py/rules/utils.py # tests/unit/test_accumulation.py

The merge with origin/main brought in PR #476's runtime path layout (vector_database/<engine>/<DisplayIndex>/...) but the rest of the code-image-capture PR was built around the older 'vdb_bench' segment. This makes the conversion consistent across the codebase: - code_image._TYPE_TO_ONDISK_SEGMENT: BENCHMARK_TYPES.name for every type (was hardcoded 'vdb_bench' for vector_database) - submission_structure_checks: leaf-walker now expects results/<sys>/vector_database/<DisplayIndex>/code/ - vdb_checks: all 18 `self.mode != "vdb_bench"` rule guards now compare against "vector_database" (loader yields the directory name as mode) - Rules.md S2.1.5.b, S2.1.27 diagrams, S5.3.1, S5.6.3 updated to refer to vector_database - All test fixtures and assertions across mlpstorage_py/tests/ updated - test_generate_output_location: vector_database tests now thread vdb_engine and expect the engine/index layered path

…split End users found the dual-vocabulary (UPPERCASE token vs mixed-case display spelling) confusing. Collapse to a single UPPERCASE convention everywhere index names appear: the CLI (--index-type), summary.json.index_type, the on-disk directory name, and the path-generation/validation code. - Remove INDEX_TYPE_TOKEN_TO_DIR and INDEX_TYPE_DIR_TO_TOKEN constants from mlpstorage_py/config.py - generate_output_location: drop the token->display mapping; write args.index_type / args.vdb_index directly under <engine>/ - code_image.capture_or_verify_code_image: drop the vector-database special-case mapping; leaf segment is just args.index_type - VdbCheck.vdb_closed_index_types (Rules 5.6.3): compare dir name directly against UPPERCASE token; no DIR_TO_TOKEN lookup - Rules.md: drop the S5.6 dual-representation callout; S2.1.5.b, S2.1.27 diagrams, S5.3.1 prose all refer to the UPPERCASE token - All affected tests updated to expect UPPERCASE (DISKANN/HNSW/AISAQ) in paths; _build_vdb_leaf fixture collapses display+token args into a single index_type arg

The CLI gates --vdb-index / --vdb-engine / --model / --command behind argparse choices= and a redundant post-parse allowlist check in validate_vectordb_arguments. But generate_output_location itself did no validation — a programmatic caller (test fixture, future internal API, history-replay path that someday skips revalidation) feeding '../etc' or '/abs' as orgname / systemname / vdb_index / model / etc. would land in a path-traversal directory via os.path.join. Close the gap at the trust boundary with one helper: _SAFE_PATH_COMPONENT_RE = re.compile(r"^[A-Za-z0-9._-]+$") def _check_safe_path_component(name, value): # reject '.' and '..' explicitly; reject anything not matching # POSIX-safe single-segment charset. Applied to every os.path.join arg in generate_output_location that isn't either user-supplied path root (results_dir) or internally constant (BENCHMARK_TYPES.name, "results" literal): orgname, systemname, model, vdb_engine, vdb_index, command, datetime_str. Adds 21 parametrized negative tests in test_generate_output_location.py locking in the invariant for each field.

Aligns with the established convention of bumping the patch version one above origin/main when a PR adds new behavior. uv.lock regenerated via bare `uv lock` (no transitive drift). The format upgrade to revision=3 (adds upload-time metadata to each package) is a side effect of the local uv version writing the newer lockfile format; no resolved versions changed.

FileSystemGuy · 2026-06-20T20:40:21Z

@idevasena I think I've gotten my (big) code change layered over the top of your PR#476 without changing any of the behavior you intended to get from that PR. Please review...

FileSystemGuy added 30 commits June 16, 2026 14:38

FileSystemGuy added 15 commits June 18, 2026 19:13

FileSystemGuy and others added 3 commits June 19, 2026 13:58

Merge branch 'main' into FileSystemGuy-code-image-capture-v1

da6752f

Merge branch 'main' into FileSystemGuy-code-image-capture-v1

6de8ef2

FileSystemGuy added 5 commits June 20, 2026 12:44

Merge remote-tracking branch 'origin/main' into FileSystemGuy-code-im…

325c2f0

…age-capture-v1 # Conflicts: # mlpstorage_py/cli_parser.py # mlpstorage_py/rules/utils.py # tests/unit/test_accumulation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code-Image Capture & Validation (milestone v1.0)#473

Code-Image Capture & Validation (milestone v1.0)#473
FileSystemGuy wants to merge 75 commits into
mainfrom
FileSystemGuy-code-image-capture-v1

FileSystemGuy commented Jun 18, 2026

Uh oh!

FileSystemGuy commented Jun 19, 2026

Uh oh!

FileSystemGuy commented Jun 19, 2026

Uh oh!

FileSystemGuy commented Jun 19, 2026

Uh oh!

idevasena commented Jun 19, 2026

Uh oh!

FileSystemGuy commented Jun 19, 2026

Uh oh!

idevasena commented Jun 20, 2026

Uh oh!

FileSystemGuy commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

FileSystemGuy commented Jun 18, 2026

Summary

Behavior contracts

What's in this PR

Testing

Branch state

Known tech debt (deferred — not blockers for this PR)

Uh oh!

FileSystemGuy commented Jun 19, 2026

2026-06-18 update — Phase 4 review-fix + UAT integration polish

Phase 4 deliverables (now in this branch)

Phase 4 code-review fixes (13 findings)

UAT-discovered integration bugs (6 fixes)

PR-prep cleanup

UAT verdict

Test plan

Caveats / acknowledged

Memory of pending follow-up work (not in this PR)

Uh oh!

FileSystemGuy commented Jun 19, 2026

Uh oh!

FileSystemGuy commented Jun 19, 2026

Uh oh!

idevasena commented Jun 19, 2026

Uh oh!

FileSystemGuy commented Jun 19, 2026

Uh oh!

idevasena commented Jun 20, 2026

Uh oh!

FileSystemGuy commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants