Skip to content

Code-Image Capture & Validation (milestone v1.0)#473

Open
FileSystemGuy wants to merge 75 commits into
mainfrom
FileSystemGuy-code-image-capture-v1
Open

Code-Image Capture & Validation (milestone v1.0)#473
FileSystemGuy wants to merge 75 commits into
mainfrom
FileSystemGuy-code-image-capture-v1

Conversation

@FileSystemGuy

Copy link
Copy Markdown
Contributor

Summary

When a submission is validated, this PR makes it possible to prove that every result was generated by exactly the source tree captured in .../code/, and that a CLOSED submission used the unmodified upstream codebase. The CLI captures a frozen "code image" of the benchmark source tree the first time a closed or open submission category runs datasize, datagen, or run, then validates on every subsequent invocation that the running code matches the captured image. The submission-time validator is extended to require the code image and verify its hash.

Builds on the submission-checker infrastructure landed in #432 (compute_code_tree_md5, MD5_EXCLUDE_* constants, SubmissionStructureCheck).

Behavior contracts

On-disk shape

  • CLOSED: {resultsdir}/closed/<org>/code/ + .code-hash.json
  • OPEN training/checkpointing: {resultsdir}/open/<org>/results/<sys>/<type>/<model>/code/ + .code-hash.json (per-leaf)
  • OPEN vector_database: {resultsdir}/open/<org>/results/<sys>/vector_database/<index_type>/code/ (per-index_type — results across index types aren't comparable)
  • OPEN kv_cache: {resultsdir}/open/<org>/results/<sys>/kv_cache/code/ (transitional per-type — documented in Rules.md §2.1.5.b)
  • .code-hash.json schema: {hash, algorithm, captured_at, mlpstorage_version, git_sha} (algorithm md5-tree-v1)

Runtime contracts (literal error strings)

  • CLOSED mismatch: changes to the codebase are not allowed in a CLOSED run
  • OPEN mismatch: all runs of this type must use the same codebase
  • Match (both): code unchanged from on-file image at {pathname}
  • whatif / validate / reportgen / history utilities do not capture or validate.

What's in this PR

Hash & Capture Foundation

  • submission_checker/constants.py: MD5_EXCLUDE_PREFIXES extended with test/, tests/; MD5_EXCLUDE_FILENAMES extended with .code-hash.json.
  • submission_checker/tools/code_image.py (new module): capture_code_image, load_code_image, verify_source_against_image, verify_image_self_consistent, find_source_root. Frozen CodeImage dataclass + typed exception hierarchy (CodeImageError, MissingHashFile, MalformedHashFile, SourceRootNotFound). Atomic tmp-rename publish pattern.
  • 39-test pytest suite covering capture/load/verify primitives, schema invariants, exclusion behavior, and the unknown-version guard.
  • compute_code_tree_md5 walker fix: aligned with shutil.copytree+ignore semantics for nested excluded dirs so src_hash == cap_hash on real repos.
  • Capture-entry guard against degenerate mlpstorage_version="unknown": if version resolution falls through to the last-resort sentinel (no installed dist metadata AND no readable pyproject.toml), capture_code_image raises ConfigurationError before any filesystem work — preventing a submission from being captured with an unrecoverable forensic gap.

CLI Integration, Validator & Docs

  • mlpstorage_py/main.py: dispatches capture_or_verify_code_image(args, os.environ, logger) before run_benchmark(); early-returns for validate / reports / lockfile / rules-coverage / history.
  • code_image.py capture_or_verify_code_image helper: mode-gated on {closed, open} × {datasize, datagen, run}, emits EXIT_CODE.CODE_IMAGE_ERROR on failure.
  • submission_structure_checks.py: code_directory_contents_check refactored to walk via _iter_submitter_dirs() (CLOSED leaves) + _iter_open_code_dirs() (OPEN per-leaf — yields 3 training, 4 checkpointing, 3 vector_database per-index_type, 1 kv_cache transitional). Separate violations for missing-code/ vs hash-mismatch.
  • Rules.md updates: §2.1.5 split into §2.1.5.a (CLOSED) + §2.1.5.b (OPEN — per-leaf table with explicit kv_cache transitional callout); §2.1.6 rewritten with .code-hash.json schema + literal error strings + layered REFERENCE_CHECKSUMS prose; §2.1.27 OPEN-subtree diagram aligned (10 leaves: 3 training + 4 checkpointing + 3 vdb_bench engines); §3.6.1 documents the layered validator model (self-consistency + upstream-identity).
  • Vectordb / kvcache canonical type-segment alignment: CLI vectordb / kvcache canonicalize to BENCHMARK_TYPES.name vector_database / kv_cache via _CLI_BENCHMARK_TO_TYPE. Runtime helper, generate_output_location, and Rules.md now agree on the on-disk segment.
  • Per-type OPEN path-shape fixes: vector_database is per-index_type, kv_cache is transitional per-type; training/checkpointing remain per-model.

Release artifacts

  • pyproject.toml [project].version bumped from 3.0.133.0.14 (origin/main + 1 patch, derived at execution time).
  • uv.lock regenerated via bare uv lock (no transitive drift).
  • 7 stale-signature failures in test_issue_376_file_arg_conflict.py repaired (test-side only; production CLI signatures untouched).
  • 27 pre-existing pyarrow ModuleNotFoundError failures cleared via env install.

Testing

uv run python -m pytest mlpstorage_py/tests/699 passed / 4 failed / 1 xfailed.

The 4 failures are pre-existing add_universal_arguments signature drift in test_open_closed_flag_recognition.py::TestOpenClosedCLIFlags, documented in deferred items and unrelated to this PR. The same baseline held before the merge with origin/main.

New coverage in this PR (selected):

  • First-time capture (CLOSED + OPEN per-leaf), match path, mismatch path, exclusions
  • .code-hash.json schema fixtures
  • Submission-validator missing-code/ + hash-mismatch (CLOSED + OPEN)
  • whatif / validate / reportgen / history / lockfile / version / rules-coverage no-touch
  • CLI helper round-trip + path-traversal guards
  • Per-type OPEN path-shape (training/checkpointing/vector_database/kv_cache)
  • Capture-entry guard against mlpstorage_version="unknown" with no-FS-side-effect assertion

Branch state

Rebased onto origin/main on 2026-06-18 (resolved 8 file conflicts: Rules.md, submission_structure_checks.py, main.py, rules/utils.py, config.py, constants.py, pyproject.toml, uv.lock). The version literal was re-derived from the updated origin/main (still 3.0.13, so 3.0.14 remains correct).

Post-merge, one polish commit was added: the capture-entry mlpstorage_version="unknown" guard described above.

Known tech debt (deferred — not blockers for this PR)

  • DOC-02 partial: kv_cache transitional leaf is described in Rules.md §2.1.5.b prose but not depicted in the §2.1.27 OPEN diagram. Code on both sides handles it correctly; ≤5-line Rules.md edit to close.
  • No SECURITY.md threat-model artifacts for these phases.
  • 4 pre-existing test_open_closed_flag_recognition.py failures (add_universal_arguments signature drift documented in deferred items).

…-hash.json

Adds `test/` and `tests/` to MD5_EXCLUDE_PREFIXES (HASH-02, D-21) so that
captured trees produced by the upcoming code_image module do not include
the fixture/test directories that would otherwise perturb the digest.
Also adds `.code-hash.json` to MD5_EXCLUDE_FILENAMES so the captured tree's
hash does not include the file that records the hash itself — a
prerequisite for Plan 01-02's verify_image_self_consistent invariant.

Syncs both locked-set assertions in test_config_reference_checksum.py
(`test_md5_exclude_prefixes_*` and `test_md5_exclude_filenames_membership`)
so the existing tuple-equality lockset continues to pass against the
extended constants.

Note: the .code-hash.json filename exclusion is a minor scope expansion
beyond plan 01-01's stated D-22 boundary; folding it in here keeps the
constants change atomic and avoids a follow-up edit during plan 01-02.
…ives

New module `mlpstorage_py/submission_checker/tools/code_image.py` provides
the five public callables consumed by Phase 2's CLI and submission-checker
integration:

  - find_source_root(start=None) — walks parents to pyproject.toml (D-04, D-05)
  - capture_code_image(source_root, target_dir, log) — atomic capture via
    code.tmp/ → os.rename → code/, writes .code-hash.json (CAP-03..05, D-17, D-18)
  - load_code_image(image_dir, log) — parses .code-hash.json into a frozen
    CodeImage dataclass (D-02, D-07)
  - verify_source_against_image — runtime check, hashes live source (D-11)
  - verify_image_self_consistent — submission check, hashes captured tree (D-12)

Exception hierarchy CodeImageError → {MissingHashFile, MalformedHashFile,
SourceRootNotFound} lets Phase 2's CLI map exception type to exit code
without parsing message strings (D-03, REV-02).

`.code-hash.json` schema: hash, algorithm ("md5-tree-v1"), captured_at
(ISO-8601 UTC Z), mlpstorage_version, git_sha (40-char or null on
best-effort failure) — D-07..D-10.

Hashing delegates to compute_code_tree_md5 — never reimplemented (HASH-01).
Capture walk uses os.walk + shutil.copy2 (no shutil.copytree, per REV-01)
and applies the same MD5_EXCLUDE_PREFIXES / MD5_EXCLUDE_FILENAMES used by
the hash predicate.

Known deviation from plan: this implementation duplicates the exclusion
predicate inline instead of introducing the _should_exclude_dir /
_should_exclude_file helpers in code_checksum.py that the plan called out
under REV-01 (single source of truth for exclusion). The constants are
still the single source of truth; only the predicate is duplicated.
Refactoring code_checksum.py to host shared helpers is left as a follow-up
so this commit stays additive and ships unchanged behavior for the existing
checksum CLI.
New `mlpstorage_py/tests/test_code_image.py` exercises every public symbol
of the code_image module:

  - find_source_root happy path + SourceRootNotFound at filesystem root
    (D-04, D-05)
  - capture_code_image schema, field types, atomic code.tmp → code/ rename,
    cleanup of pre-existing code.tmp/ (D-07, D-17, D-18, CAP-03..05) and
    refusal to silently re-capture when code/ already exists (D-16)
  - exclusion behavior — fixture tree includes test/ and tests/
    subdirectories; tests assert they are absent from the captured tree
    (HASH-02, the end-to-end witness for plan 01-01's constants change)
  - load_code_image happy path + MissingHashFile + MalformedHashFile for
    every documented failure mode (unparseable JSON, missing key, unknown
    algorithm, wrong hash length, wrong git_sha length); each raised
    message names the offending field (D-14, D-15, REV-02)
  - verify_source_against_image and verify_image_self_consistent True/False
    branches and CodeImageError propagation when hash file is missing or
    malformed (D-11, D-12, D-13)
  - .code-hash.json schema invariants (TEST-10) and git_sha capture path:
    success, subprocess failures (CalledProcessError, FileNotFoundError,
    TimeoutExpired), and argv-spy confirming `git rev-parse HEAD` is the
    only git invocation

Pattern follows tests/test_code_checksum.py — tmp_path + MockLogger (CD-04),
no real git or filesystem dependencies outside tmp_path.

38 tests, all pass against the implementation committed in 01-02.
- CLAUDE.md: markdown-formatter pass (blank lines around fenced-code
  comment headings) and file-mode normalization 755 → 644. No content
  changes.
- summary.csv: snapshot of submission summary data carried from a local
  tool run; checked in so it is not lost.
RED phase for Task 1. Asserts the new enum member exists with value 2
(aliased with INVALID_ARGUMENTS per D-22 in 02-CONTEXT.md), is int-castable
for use as a process exit code, and that pre-existing members are not
renumbered by the addition.
Adds a new alias member CODE_IMAGE_ERROR = 2 (aliased with
INVALID_ARGUMENTS) to EXIT_CODE per 02-CONTEXT.md D-22. Provides a
grep-able symbolic name for the typed-exception → process-exit-code
mapping in main.py (Plan 02-02).

IntEnum value aliasing is by-design: same numeric exit code 2, distinct
symbolic name. Scripts that branch on EXIT_CODE name can distinguish
between the two failure modes; scripts that branch on the integer cannot,
which is acceptable because both surface the same recovery action
('user-supplied input was wrong, fix and retry').

No other enum value is renumbered. All 38 Phase 1 tests continue to pass.
RED phase for Task 2. Asserts:

  * CLOSED training/checkpointing paths prefix with /closed/<orgname>/
    (D-03 in 02-CONTEXT.md).
  * OPEN training/vector_database paths prefix with
    /open/<orgname>/results/<systemname>/ (D-03).
  * whatif and missing-mode keep the legacy shape (no closed/open
    segment) for back-compat.
  * Missing orgname / systemname kwargs for closed|open modes raise a
    typed ConfigurationError with .parameter pointing at the missing
    name and the suggestion string referencing the appropriate
    MLPSTORAGE_* env-var name (Gemini MEDIUM trust-contract finding
    in 02-REVIEWS.md — function does NOT read os.environ, the CLI
    dispatch layer reads + validates + threads through).
  * Module exports MLPSTORAGE_ORGNAME_ENVVAR / MLPSTORAGE_SYSTEMNAME_ENVVAR
    as a single source of truth for the env-var names.
  * Env-var leakage check: function ignores MLPSTORAGE_ORGNAME if set in
    the environment — explicit kwargs win.
…cation

Per 02-CONTEXT.md D-03, runtime output paths are restructured so results
land under:
  CLOSED: {results_dir}/closed/<orgname>/<type>/<model>/<command>/<datetime>/
  OPEN:   {results_dir}/open/<orgname>/results/<systemname>/<type>/<model>/...

Per the Gemini MEDIUM trust-contract finding in 02-REVIEWS.md, the function
does NOT read MLPSTORAGE_* environment variables — it accepts orgname /
systemname as keyword-only parameters threaded by the CLI dispatch layer
(Plan 02-02). Programmatic callers therefore see a typed ConfigurationError
at the function boundary instead of a hidden KeyError from a missing env
var.

Changes:

* mlpstorage_py/rules/utils.py:
  - Add module-level constants MLPSTORAGE_ORGNAME_ENVVAR /
    MLPSTORAGE_SYSTEMNAME_ENVVAR as single source-of-truth for the
    Phase 2 helper to consume (D-01, D-02).
  - Change generate_output_location signature to add keyword-only
    orgname / systemname parameters (Option A from the review — explicit
    threading, no hidden env reads).
  - Prepend the {closed|open}/<orgname>[/results/<systemname>]/ prefix
    before the existing per-BENCHMARK_TYPES chain. The per-type tail is
    unchanged below the prefix; whatif and missing-mode keep the legacy
    shape for back-compat.
  - Raise ConfigurationError (code CONFIG_MISSING_REQUIRED) when the
    required kwarg is missing for closed/open modes; the suggestion
    message references the matching MLPSTORAGE_* env-var name so the
    dispatch layer (or a human reading a programmatic-caller stack
    trace) sees actionable guidance.

* mlpstorage_py/errors.py:
  - Expose ConfigurationError.parameter as a direct attribute (was
    previously only accessible via .error.context dict). This lets the
    CLI dispatch layer in Plan 02-02 map the missing parameter name back
    to the MLPSTORAGE_* env-var to surface in its user-facing error,
    and lets unit tests inspect the trust-contract violation cleanly.

Verification:

* 12/12 new tests pass in test_generate_output_location.py.
* All 38 Phase 1 tests (test_code_image.py + test_config_reference_checksum.py)
  still pass.
* Full mlpstorage_py/tests/ suite: 551 passing, 34 pre-existing failures
  (env-related: missing pyarrow, CLI parser signature drift in
  add_universal_arguments). Baseline before this plan: 541 passing,
  44 failing — net +10 passing, 0 new failures. See deferred-items.md.

Known follow-up: mlpstorage_py/benchmarks/base.py:803 calls
generate_output_location() without the new kwargs. That caller is updated
by Plan 02-02 (CLI dispatch layer), which is the single owner of
env-var reading + validation + threading.
…STRUCT-05

- TestStruct06_RefactoredCodeDirectoryContents (10 tests) — VALS-01..04
  layered model: per-tree self-consistency for CLOSED+OPEN plus
  REFERENCE_CHECKSUMS upstream-identity for CLOSED only (D-11/D-14/D-15).
- TestStruct05_ModeAwareRequiredSubdirectories (8 tests) — Rules.md §2.1.5
  split (D-17): CLOSED requires {code, results, systems}, OPEN requires
  {results, systems}, violations routed through new sub-rule anchors
  requiredSubdirectoriesClosed / requiredSubdirectoriesOpen.
- Adds private helpers _write_valid_hash_json and _make_open_leaf for
  constructing self-consistent code/ trees and open/<sys>/<type>/<model>
  leaves in tests.
- 13 RED failures expected; remaining 5 pass coincidentally where pre-refactor
  behavior overlaps post-refactor behavior (verifying GREEN invariants).
…ode-aware STRUCT-05

Refactors code_directory_contents_check to walk BOTH closed/ and open/
subtrees through a single @rule("2.1.6", "codeDirectoryContents")
method, replacing the CLOSED-only REFERENCE_CHECKSUMS comparison with
the D-11 layered model:
- Per-tree .code-hash.json self-consistency for ALL leaves (VALS-02/04).
- REFERENCE_CHECKSUMS upstream-identity for CLOSED leaves only (D-11).
- Separate violations for missing-code/ vs hash-mismatch (D-14).

Adds private helper _iter_open_code_dirs that yields each
results/<sys>/<type>/<model>/code path under an OPEN submitter (D-15),
constraining the OPEN walk to the Rules.md §2.1.27 leaf shape.

Makes required_subdirectories_check (STRUCT-05) MODE-AWARE per Rules.md
§2.1.5 split (D-17): CLOSED requires {code, results, systems}; OPEN
requires {results, systems} (code/ lives per-leaf in OPEN). Violations
now route through new sub-rule anchors requiredSubdirectoriesClosed
and requiredSubdirectoriesOpen — the @rule decorator dispatch key
stays byte-identical so init_checks is unaffected. Closes the Gemini
HIGH cross-plan inconsistency between Plan 04's §2.1.5 split and the
prior STRUCT-05 enforcement. The 'allowed: [...]' phrasing replaces
the legacy 'only code/results/systems allowed' string so it stays
accurate across both divisions.

Preserves the D-12 single-warning behavior for unconfigured
REFERENCE_CHECKSUMS with an addendum noting self-consistency still ran.

New imports: verify_image_self_consistent, CodeImageError,
MissingHashFile, MalformedHashFile from ..tools.code_image;
Path from pathlib.
- §2.1.5: split into 2.1.5.a (CLOSED, three dirs) and 2.1.5.b (OPEN, two
  dirs at submitter level with per-leaf code/) per D-17.
- §2.1.6: rewrite to describe runtime capture-or-verify behavior, the
  .code-hash.json schema (hash/algorithm/captured_at/mlpstorage_version/
  git_sha), the per-tree self-consistency check, and the layered
  REFERENCE_CHECKSUMS upstream-identity check for CLOSED, per D-18.
  Embeds the literal VALR-02/VALR-04 mismatch messages.
- §2.1.27: surgical OPEN-subtree edit (D-16) — remove the legacy
  submitter-level "code" entry and add "code  # captured per-leaf" at
  every model/engine leaf (unet3d, resnet50, cosmoflow, llama3-8b/-70b/
  -405b/-1t, vdb_bench engines AiSEQ/DiskANN/HNSW). CLOSED subtree is
  byte-unchanged.
- §3.6.1: rewrite to describe the layered (a) self-consistency check
  and (b) CLOSED-only upstream-identity check, replacing the legacy
  single-md5sum prose, per D-19.

Acceptance greps all pass; CLOSED subtree of §2.1.27, §2.1.4, §2.1.7,
§3.6.2 unchanged. Closes DOC-01, DOC-02, DOC-03.
- Gating contract (D-10): whatif/validate/reports/etc return None; non-result-generating commands return None
- Env-var fail-fast (D-04, D-05): missing orgname/systemname, regex rejects, Rules.md §2.1.1 cited
- Inline path-traversal guard (T-02-02-05): . and .. rejected for both orgname and systemname with the literal substring "'.' and '..' are reserved path segments"
- Capture path (CAP-01/02/06): CLOSED writes results_dir/closed/<orgname>/code/; OPEN writes results_dir/open/<orgname>/results/<systemname>/<benchmark>/<model>/code/; status log starts "Captured code image at "
- Verify path (VALR-01/03 success, VALR-02/04 mismatch, D-21 missing-json): literal mismatch strings; recovery substring "either delete \`code/\` and re-run to re-capture"
Single chokepoint for runtime code-image capture-or-verify on closed|open
runs (D-07..D-10). Appended to mlpstorage_py/submission_checker/tools/code_image.py
per CD-01 placement.

Contract:
- Gates on (args.mode, args.command): no-op return None for whatif, validate,
  reports, history, lockfile, version, rules-coverage modes (D-10), and for
  configview/etc. under closed|open modes.
- Sole reader of MLPSTORAGE_ORGNAME and MLPSTORAGE_SYSTEMNAME env vars
  (closes Gemini MEDIUM trust-contract finding). Imports the env-var-name
  constants from mlpstorage_py.rules.utils for single source of truth.
- POSIX-safe regex per Rules.md §2.1.1 (D-05) plus INLINE `.`/`..`
  path-traversal guard for both orgname and systemname (REVIEWS.md consensus
  finding T-02-02-05 mitigation made inline). The literal substring
  "'.' and '..' are reserved path segments" appears in both reject messages.
- Stashes validated orgname/systemname on args._validated_orgname /
  args._validated_systemname so downstream generate_output_location callers
  read them without re-touching env.
- Computes image_parent matching generate_output_location prefix (D-03):
  CLOSED -> results_dir/closed/<orgname>/code/, OPEN -> results_dir/open/
  <orgname>/results/<systemname>/<benchmark>/<model>/code/. Creates the
  subtree but NOT the results-directory itself (D-06).
- First call captures via capture_code_image and logs literal CAP-06 prefix
  'Captured code image at '. Subsequent matching call verifies and logs
  'code unchanged from on-file image at '. Hash mismatch raises CodeImageError
  with literal VALR-02 'changes to the codebase are not allowed in a CLOSED run'
  or VALR-04 'all runs of this type must use the same codebase'. Missing or
  malformed .code-hash.json logs the D-21 actionable recovery message
  'either delete `code/` and re-run to re-capture, or restore the original
  capture.' and re-raises the Phase 1 typed error (MissingHashFile or
  MalformedHashFile) for main() to map to exit code 2.

Module additions:
- _SUBMITTER_NAME_RE = re.compile(r'^[A-Za-z0-9._-]+$')
- _RESERVED_PATH_SEGMENTS = frozenset({'.', '..'})
- _SUBMISSION_MODES = frozenset({'closed', 'open'})
- _SUBMISSION_COMMANDS = frozenset({'datasize', 'datagen', 'run'})
- capture_or_verify_code_image(args, env, log) -> Path | None

No Phase 1 primitive or function signature was changed. The 38 existing tests
in test_code_image.py and test_config_reference_checksum.py still pass; the
new test_capture_or_verify_code_image.py adds 27 tests for the helper.
- main.py imports capture_or_verify_code_image and CodeImageError (single import line)
- except CodeImageError clause present in main(), ordered AFTER DependencyError and BEFORE MLPStorageException catch-all (CodeImageError is not a MLPStorageException subclass so MRO doesn't fold it)
- except clause returns EXIT_CODE.CODE_IMAGE_ERROR (value 2)
- run_benchmark calls capture_or_verify_code_image(args, os.environ, logger) BEFORE benchmark_class(args, ...) instantiation
- helper invocation wrapped in progress_context with the literal description 'Capturing or verifying code image'
- main.py:42 imports capture_or_verify_code_image and CodeImageError from
  mlpstorage_py.submission_checker.tools.code_image (single import line).
- main.py:216 invokes the helper inside run_benchmark, AFTER environment
  validation completes and BEFORE the program_switch_dict / benchmark_class
  instantiation (D-07 insertion point). The invocation is wrapped in
  progress_context('Capturing or verifying code image...') for consistent UX
  with the surrounding validation block. The call is unconditional because
  the helper internally gates on (args.mode, args.command) per D-10 — non-
  submission modes (whatif/validate/reports/etc.) no-op.
- main.py:420 adds a dedicated except CodeImageError clause AFTER except
  DependencyError and BEFORE except MLPStorageException catch-all. The
  ordering matters because CodeImageError inherits directly from Exception
  (Phase 1 code_image.py), NOT from MLPStorageException, so Python's MRO
  does not implicitly fold it into the catch-all. The handler logs the
  message and returns EXIT_CODE.CODE_IMAGE_ERROR (D-22).
- The existing except ConfigurationError clause is unchanged — env-var
  fail-fast and path-traversal-reject ConfigurationErrors raised by the
  helper continue to flow through that path (user-input errors, not code-
  image integrity errors).
Covers CAP-01/02/06/07/08, VALR-01/02/03/04, D-04/05/21, and the
REVIEWS.md consensus path-traversal '.'/'..' guard. 32 tests in 9 classes
using direct in-process invocation of capture_or_verify_code_image with
tmp_path + MockLogger fixtures (CD-02 lightweight style).

Mismatch assertions use literal substring matches on the VALR-02/04 spec
strings; path-traversal tests are parametrized over '.' and '..' for both
MLPSTORAGE_ORGNAME and MLPSTORAGE_SYSTEMNAME and assert against the
"'.' and '..' are reserved path segments" substring from Plan 02 Task 1.

Mismatch tests monkeypatch verify_source_against_image to force False,
isolating the mismatch code path from the Phase 1 capture-vs-verify
hash discrepancy noted in deferred-items.md.
…rectory

Part A — amend TestStruct06_CodeDirectoryContents and
TestFixtureFactory::test_default_fixture_no_errors so the fixture's
pre-existing closed/Acme/code/ carries a matching .code-hash.json (via
the existing _write_valid_hash_json helper). The Plan 02-03 layered
self-consistency check runs unconditionally for every leaf, so a fixture
without a valid hash file now trips a MissingHashFile violation. The
multi-submitter test also builds a code/ under AlsoAcme. test_mutated_code_fails
relaxes its assertion to >= 1 [2.1.6] violation since mutation now breaks
BOTH the self-check AND REFERENCE_CHECKSUMS.

Part B — add TestStruct06_OpenCodeDirectory (6 tests) targeting the OPEN
walk added in Plan 02-03 Task 1 via _iter_open_code_dirs:
  - VALS-03 missing OPEN code/ (single per-leaf violation)
  - VALS-04 happy path (self-consistency passes)
  - VALS-04 sad path (hash mismatch — recorded hash != tree hash)
  - VALS-04 missing .code-hash.json
  - OPEN-only tree does NOT emit the closed-specific "not configured" warning
  - Multiple OPEN model leaves each get their own per-leaf violation
…anchors

Sub-edit 1: update TestStruct05_RequiredSubdirectories (the legacy CLOSED
class) to assert the new sub-rule anchors per Plan 02-03 Task 2 (D-17).
The four tests previously asserting "[2.1.5 requiredSubdirectories]" now
assert "[2.1.5 requiredSubdirectoriesClosed]". The extra-submitter-subdir
test additionally pins the new sorted-list-repr violation format
"allowed: ['code', 'results', 'systems']". The wrapping-hint test
upgrades its anchor assertion the same way.

Sub-edit 2: add TestStruct05_OpenSubmitter (7 tests) regressing the
Gemini HIGH cross-plan finding (REVIEWS.md). Before the Plan 02-03 Task 2
mode-aware refactor, EVERY OPEN submission would have been flagged as
having a missing code/ at the submitter level. Tests:
  - CLOSED no-regression ({code, results, systems} unchanged)
  - OPEN happy path with {results, systems} only (regression target)
  - OPEN unexpected code/ at submitter level (anchor + allowed-set assertion)
  - OPEN missing results/
  - OPEN missing systems/
  - CLOSED missing code/ routes through requiredSubdirectoriesClosed anchor
  - OPEN wrapping-hint diagnostic

Adds module-level helpers _build_minimal_open_submitter and
_build_minimal_closed_submitter — small mkdir-based fixture builders
sibling to the existing _make_open_leaf and _write_valid_hash_json
helpers.
compute_code_tree_md5's _is_excluded_dir matches MD5_EXCLUDE_PREFIXES by
rooted prefix only, while capture_code_image's shutil.copytree ignore_logic
also matches by basename equality at any depth. The two walkers therefore
disagree on real repos that contain deeply nested __pycache__/, tests/,
build/, etc. directories with non-excluded filenames inside.

Adds two regression tests:
- test_nested_excluded_dir_pruned_at_any_depth: the predicate alone must
  prune excluded dir names at any depth (not just root).
- test_capture_verify_roundtrip_with_nested_excluded_dirs: end-to-end
  capture-then-verify on unchanged source must return True even with deep
  __pycache__/ and tests/ leaves.

Both fail before the fix in the next commit. Discovered via Phase 2 verifier
(see .planning/phases/02-cli-integration-validator-docs/VERIFICATION.md).
…d excluded dirs

_is_excluded_dir matched MD5_EXCLUDE_PREFIXES by rooted relative path only,
so deeply nested __pycache__/, tests/, build/, etc. directories were
traversed and their non-excluded files were hashed. capture_code_image's
shutil.copytree ignore_logic already matched by basename at any depth, so
the two walkers produced different hashes for the same logical tree.

On real submitter repos this caused verify_source_against_image to return
False on unchanged source the second time mlpstorage closed datagen ran —
the runtime would emit the VALR-02 'changes to the codebase are not
allowed in a CLOSED run' message for a tree that had not actually changed.

Adds a basename-equality check before the rooted-prefix check so excluded
dir names are pruned at any depth, matching the capture's semantics. The
basename loop is intentionally cheap (10 prefix comparisons per dirname).

Fixes the verification finding in
.planning/phases/02-cli-integration-validator-docs/VERIFICATION.md.
RED tests in commit 13bd98e now pass; full code-image test surface
(242 tests) green; broader suite shows 34 failed / 652 passed — same
baseline as before the fix, zero new regressions.
…me segment

The CLI subparsers are named 'vectordb' and 'kvcache', but generate_output_location
writes the on-disk segment as 'vector_database' / 'kv_cache' (BENCHMARK_TYPES.name).
capture_or_verify_code_image currently uses args.benchmark (the CLI name) for the
type segment, which puts the captured code/ in a different tree than the runtime
writes results into. Phase 2 verifier surfaced this divergence (see
.planning/phases/02-cli-integration-validator-docs/VERIFICATION.md, Human
Verification Item 2). For training/checkpointing the CLI name matches
BENCHMARK_TYPES.name so no automated test caught this — the new tests exercise
the only two benchmarks where the names diverge.
…md diagram

capture_or_verify_code_image used args.benchmark (the CLI subparser name) for
the per-type path segment under open/<orgname>/results/<sys>/. For training
and checkpointing the CLI name happened to match BENCHMARK_TYPES.name so the
divergence was invisible; for vectordb and kvcache the names diverge —
'vectordb'→'vector_database', 'kvcache'→'kv_cache' — which placed the
captured code/ in a different on-disk tree than generate_output_location()'s
results. Rules.md §2.1.27 diagram used a third name ('vdb_bench') for the
same segment.

- Add _CLI_BENCHMARK_TO_TYPE mapping from CLI subparser name to
  BENCHMARK_TYPES enum and emit the canonical .name in the OPEN image_parent
  path. Unknown CLI names now raise CodeImageError instead of silently
  building a divergent path.
- Rename 'vdb_bench' → 'vector_database' in Rules.md §2.1.27 (2 occurrences,
  CLOSED and OPEN sub-diagrams) to match what generate_output_location()
  writes and what the helper now captures.

Fixes the verifier finding in
.planning/phases/02-cli-integration-validator-docs/VERIFICATION.md
(Human Verification Item 2). All three names — runtime output, captured
code, Rules.md diagram — now agree.

Note: vectordb has no args.model attribute (vectordb_args.py does not add
--model), so OPEN vectordb runs will still AttributeError on
getattr(args, 'model') in this helper. That latent issue is a separate
path-shape design question (vectordb output writes <type>/<command>/, not
<type>/<model>/) and is intentionally out of scope for this naming fix.
vector_database and kv_cache write runtime output to <type>/<command>/<datetime>/
(no <model> segment) per generate_output_location. The captured code/ must
mirror the runtime tree, so for these two types code/ lives at <type>/code/
rather than <type>/<model>/code/. Currently:

- vectordb crashes the helper with AttributeError because vectordb has no
  --model CLI arg and the helper does an unguarded getattr(args, 'model').
- kvcache silently writes code/ to <type>/<model>/code/, a tree that no
  results live in — the submission validator and the runtime hash check
  cannot find it.
- The validator's _iter_open_code_dirs blindly walks 3 levels
  (<sys>/<type>/<model>/code), so it descends into a stray code/ leaf for
  vector_database/kv_cache and reports 'code/code missing' instead of
  validating the real code/.

Adds 4 failing tests:
- test_open_vectordb_uses_canonical_type_name: helper writes <type>/code/
  with no model segment, no AttributeError.
- test_open_kvcache_uses_canonical_type_name: same as vectordb; args.model
  is ignored.
- test_open_vector_database_code_dir_at_type_level: validator finds and
  validates code/ at <type>/ (2 levels), and emits the missing-code
  violation against the type level, not a phantom <type>/code/code path.
- test_open_kv_cache_code_dir_at_type_level: same as vector_database.

All four fail before the fix in the next commit.
…for vectordb/kvcache

vector_database and kv_cache write runtime output to <type>/<command>/<datetime>/
(per generate_output_location — no <model> segment). The captured code/ must
mirror that tree, so for these two types code/ lives at <type>/code/ rather
than <type>/<model>/code/. Training and checkpointing keep
<type>/<model>/code/ since their runtime output is keyed per model.

This closes two correctness gaps surfaced after the canonical-type-name fix
(866ff5a):

- capture_or_verify_code_image previously called getattr(args, 'model')
  unconditionally. vectordb_args.py does not register a --model arg, so OPEN
  vectordb runs would AttributeError before any capture happened. For
  kvcache the call succeeded but wrote code/ into a tree where no results
  live — orphaning the code image from the runtime output.
- _iter_open_code_dirs walked exactly three levels under results/ for
  every benchmark type, so the validator either missed code/ entirely for
  vectordb/kvcache or reported phantom code/code/ violations.

Adds _TYPES_WITHOUT_MODEL_SEGMENT (helper) and _OPEN_TYPES_WITHOUT_MODEL
(validator) — the two sets are kept inline rather than imported across the
module boundary so the validator does not pull in the helper's runtime
dependencies. They're both authoritative and must be kept in sync if
generate_output_location grows another no-model benchmark type.

Tests in commit eb46498 (vectordb/kvcache helper paths, vectordb/kv_cache
validator walks) now pass; full code-image surface 246 tests green; broader
suite still 34 failed / 656 passed (+4 new) — same pre-existing baseline.
…er-index

Plan 02-04 inserted 'code  # captured per-leaf' markers under every leaf of
the §2.1.27 OPEN directoryDiagram, including AiSEQ / DiskANN / HNSW under
vector_database. That placement matched the legacy diagram's shape (which
showed an <index_type> level between vector_database/ and <datetime>/), but
not the runtime: generate_output_location writes
<type>/<command>/<datetime>/ for vector_database and kv_cache — no <model>
or <index_type> segment — so the captured code/ lives directly under
<type>/, shared across all index types and commands.

- §2.1.27: move the per-leaf code/ marker from each AiSEQ/DiskANN/HNSW
  level up to a single 'code  # captured per-type (no <model> in runtime
  output path)' marker under vector_database/ itself. The legacy
  <index_type> level is pre-existing and intentionally left alone — that
  diagram/runtime inconsistency predates this work and is out of scope.
- §2.1.5.b: amend prose to describe both leaf shapes explicitly —
  <systemname>/<type>/<model>/ for training+checkpointing,
  <systemname>/<type>/ for vector_database+kv_cache — so a reader of the
  rules text sees the same shape that the diagram and the runtime + checker
  agree on (commit 390eb20).

No code change. Documentation alignment only.
…dex_type>

vector_database benchmark results split by index_type because AISAQ results
are not comparable to DISKANN/HNSW — they must live in separate on-disk
subdirectories. Rules.md §2.1.27's existing AiSEQ/DiskANN/HNSW level under
vector_database is canonical; the runtime, the capture helper, and the
submission validator must all honor it.

Recent commits (390eb20 helper/validator, 8fba4c4 Rules.md) collapsed
vector_database into a per-<type> shape on the mistaken assumption that the
runtime's missing <index_type> was canonical. That regressed the per-leaf
contract that distinguishes incomparable index implementations. Reverting
those commits in the next change; this RED commit pins the corrected shape
in tests first.

- generate_output_location: vector_database must yield
  <type>/<index_type>/<command>/<datetime>/ on both CLOSED and OPEN.
- capture_or_verify_code_image: OPEN vectordb captures at
  <type>/<index_type>/code/ (using args.index_type — vectordb has
  --index-type, not --model).
- _iter_open_code_dirs: vector_database is back to the standard 3-level
  walk under results/, yielding <sys>/<type>/<index_type>/code/ — same
  shape as training/checkpointing.

kv_cache stays at <sys>/<type>/code/ per user direction — the per-model
structure for kv_cache will be revisited in a follow-up once we have more
detail on its directory and file layout. The kv_cache test
(test_open_kv_cache_code_dir_at_type_level) is unchanged.

All 4 new/updated tests fail before the fix in the next commit.
…abase

vector_database results split by index_type — AISAQ results are not
comparable to DISKANN/HNSW, so they must live in separate subdirectories
(Rules.md §2.1.27 is canonical on this; the runtime path was missing the
segment). kv_cache stays at <type>/code/ for now, pending a follow-up plan
on its directory/file structure.

generate_output_location (vector_database case):
  before:  <type>/<command>/<datetime>/
  after:   <type>/<index_type>/<command>/<datetime>/

capture_or_verify_code_image (OPEN, vector_database):
  before:  <type>/code/                    (per-type, wrong)
  after:   <type>/<index_type>/code/       (per-leaf, correct)

  Replaces the hardcoded _TYPES_WITHOUT_MODEL_SEGMENT set with a
  data-driven _TYPE_TO_LEAF_ATTR map so each BENCHMARK_TYPE declares its
  leaf attribute (args.model for training/checkpointing,
  args.index_type for vector_database, None for kv_cache transitionally).
  Adding new benchmark types or transitioning kv_cache later is a
  one-line edit in the map.

_iter_open_code_dirs (validator):
  before:  yielded <sys>/<type>/code/ for both vector_database and kv_cache
  after:   yielded <sys>/<type>/<index_type>/code/ for vector_database
           (via the standard 3-level walk used by training/checkpointing);
           kv_cache still yields <sys>/<type>/code/ via the same special
           case as before.

This reverses the vector_database half of commit 390eb20, which had
collapsed vector_database into per-<type> on the mistaken assumption that
the runtime's missing <index_type> segment was canonical. The kv_cache
half of 390eb20 is preserved verbatim — user direction is to keep
kv_cache at <type>/code/ until a follow-up plan finalizes its structure.

All 4 RED tests in commit b97c466 now pass; full code-image surface 247
tests green; broader suite still 34 failed / 657 passed — same pre-existing
baseline, zero regressions.
…abase

Reverses the §2.1.27 OPEN diagram change from 8fba4c4 (which collapsed
vector_database into a single per-<type> code/ marker). The original Plan
02-04 placement was correct: code/ lives at each AiSEQ/DiskANN/HNSW leaf
because results across index types are not comparable and must validate
independently. The 8fba4c4 change was based on the mistaken assumption
that the runtime (which was missing the <index_type> segment) was
canonical; in fact the diagram is canonical and the runtime is being
corrected to match (commit 96f0ed0).

§2.1.27: per-leaf 'code  # captured per-leaf' marker restored at each of
AiSEQ, DiskANN, HNSW under vector_database/. The per-type marker at the
vector_database level is removed. The diagram now matches the runtime
(generate_output_location includes args.index_type) and the validator
(_iter_open_code_dirs walks <sys>/<type>/<index_type>/code/).

§2.1.5.b: prose rewritten as a typed bullet list since the three benchmark
families now have three different leaf shapes:
- training, checkpointing → <type>/<model>/ (per model)
- vector_database         → <type>/<index_type>/ (per index type — AISAQ
                                                  vs DISKANN vs HNSW
                                                  results not comparable)
- kv_cache                → <type>/  (transitional, pending finalization)

Pre-existing diagram inconsistencies left in place:
- AiSEQ vs config's AISAQ casing — the diagram's mixed-case index names
  (AiSEQ, DiskANN) do not match what argparse stores (uppercase
  DISKANN, HNSW, AISAQ from VDB_INDEX_TYPES). Submitters' on-disk
  directories will use the uppercase values; the diagram should
  eventually be re-cased to match. Out of scope for this fix.
- kv_cache absent from §2.1.27 diagram entirely — to be addressed when
  the kv_cache directory structure is finalized.
…ype split

The docstring still described vector_database alongside kv_cache as the
2-level walk, but vector_database moved back to the standard 3-level
<type>/<index_type>/ walk in 96f0ed0. Comment-only cleanup flagged by the
phase-2 re-verifier as an info-level item — no behavior change.

Docstring now enumerates all three leaf shapes (training/checkpointing
per <model>, vector_database per <index_type>, kv_cache transitional) and
the inline comment beside the kv_cache branch names only kv_cache.
…r pre-existing failures

- Bump pyproject.toml [project].version from origin/main + 1 patch
  (target derived at execution time per phase 3 D-08 — no hardcoded literal;
  origin/main was at 3.0.13, target 3.0.14).
- Regenerate uv.lock against the bumped project via bare `uv lock`
  (D-01: no transitive bumps; only the project's own version line changed).
- Fix 7 stale-signature failures in test_issue_376_file_arg_conflict.py by
  passing the second positional argument that the production CLI arg-adders
  now require (D-03: test-side fix, production signatures unchanged).
  Test assertions adapted to the three-mode-positional CLI redesign — the
  storage-type moved from option `--file`/`--object` to a single positional
  `data_access_protocol` with `choices=['file', 'object']`; the
  duplicate-registration invariant under test is preserved.

Closes REL-01, REL-02.
The gsd-code-fixer agent now places its isolated worktree under
${repo_root}/.gsd-tmp/ instead of /tmp/ so sandboxed harnesses don't
render the path as a ../../../../tmp/ traversal string in permission
prompts. Add the directory to .gitignore so nested-worktree contents
don't show up in the outer working tree's git status.
verify_image_self_consistent raised MissingHashFile when
compute_code_tree_md5 returned None — but load_code_image had
already succeeded, so the JSON sidecar IS present. The real failure
is that the tree itself didn't hash (permission error mid-walk, race
where the dir vanished, etc.). MissingHashFile's docstring is
explicit: '.code-hash.json not found in an image directory (D-14)'.

Introduce CodeTreeUnreadable(CodeImageError) and raise that instead.
Existing call-site catches in helpers.py and submission_structure_checks.py
still cover this via the (MalformedHashFile, CodeImageError) arm
(CodeTreeUnreadable is-a CodeImageError), so behavior is unchanged
for downstream — only the surface log message names the right root cause.
verify_source_against_image raised SourceRootNotFound when
compute_code_tree_md5(source_root) returned None — but
SourceRootNotFound is documented (D-05) as 'find_source_root walked
to filesystem root without finding pyproject.toml,' i.e., a
structural CLI/config error, not a runtime read error.

A None return from compute_code_tree_md5 means the walk itself
failed (permission error, dir vanished mid-walk, etc.) — raise
CodeTreeUnreadable to match the new IN-01 semantics. Catch sites
already cover this via CodeImageError, so no downstream changes.
The comment at the image_parent computation said 'creating the
results-directory itself is reserved for the future mlpstorage init
command,' but the implementation called
image_parent.mkdir(parents=True, exist_ok=True) which silently
created results_dir if absent.

Add a results_dir.exists() check that raises ConfigurationError
with code=CONFIG_INVALID_VALUE before the mkdir runs. Existing
tests pass tmp_path (always exists), so no test regressions.
shutil.copytree(..., dirs_exist_ok=True) creates target_dir on its
own since Python 3.8 — the explicit mkdir(parents=True, exist_ok=True)
above it was a no-op. Removing it shrinks the window in which
target_dir can be in a partial state when copytree begins.
…ters

The 'reference checksum not configured' warning fired any time
closed/ existed as a directory — including when it was empty, or
contained only dotfiles like .DS_Store / .gitkeep. In those cases
no code/ checks were actually skipped, so the warning was pure
noise.

Add an any() probe for a non-dot, isdir() entry under closed/ and
gate the warn_violation on it. Empty closed/ and dotfile-only
closed/ now silently pass.
The docstring's 'Known limitation' invoked TODO-001 (a planned
machine-readable df contract) but the parse site itself had no
breadcrumb pointing forward — a future maintainer reading line 124
in isolation had no marker to grep on.

Add a TODO(TODO-001) comment at the parse site. Also note the
planned bigger migration: capturing stat -f -c '%i' "$data_dir" per
node at runtime and storing the scalar FS identity, which supersedes
both this parser limitation and WR-06's silent-pass fragility.
The audit tool's locked rule-ID regex was [234] — Rules.md §5
(VdbCheck, added by Plan 04-03) was never enumerated, so even
though discover_rules(VdbCheck) found all 16 @rule methods,
rules-coverage reported zero §5 entries.

Extend the character class to [23456] so §5 is now enumerated and
§6 is pre-wired for the kvcache rules landing soon. Updates the
matching docstrings, the locked-regex doc literal, the CLI
description, and the test row-count floor (>=50 -> >=66 since
Phase 4 added 16 §5 IDs).
The post-parse block at cli_parser.py:325 dereferences args.hosts
with len() but only guards on hasattr(args, 'hosts'). argparse leaves
the attribute as None when --hosts is not supplied, so any single-node
invocation (e.g. `mlpstorage open vectordb datagen file ...`) crashed
with "object of type 'NoneType' has no len()" before reaching the
benchmark dispatch.

Add the missing `is not None` clause to match the normalization guard
20 lines above which already handles the None case correctly.
… env

main()'s generic-Exception handler gated the traceback emission on
the module-level MLPS_DEBUG constant, which is only set from the
MLPS_DEBUG environment variable at import time. The CLI's --debug
flag was a no-op for this path — the user saw

  ERROR: Unexpected error: ...
  Run with --debug for full stack trace

even when --debug WAS supplied, forcing them to re-run with
`MLPS_DEBUG=1 mlpstorage ...` to get a trace.

args is not in scope of main() (it's created inside _main_impl), so
check sys.argv directly for the bare '--debug' token. argparse
declares --debug as store_true, so the bare-token check covers all
valid invocations.
Benchmark.generate_output_location() called the underlying
rules.utils.generate_output_location() without orgname= or
systemname= kwargs. For args.mode in {closed, open} the underlying
function raises ConfigurationError demanding the dispatch layer
thread the validated env-var values through.

capture_or_verify_code_image already stashes the validated values on
args._validated_orgname / args._validated_systemname (see
code_image.py:625-627 comment) specifically so downstream callers
can read them without re-reading env. This commit wires that
contract through. getattr(..., None) keeps legacy/whatif modes
working (the underlying function's mode check skips the kwarg
requirement for those).
The orgname / systemname ConfigurationError suggestions used:

    f"... read {VAR}={VAR!r} from the environment ..."

which rendered as:

    "... read MLPSTORAGE_ORGNAME='MLPSTORAGE_ORGNAME' from the
    environment ..."

The second slot was meant for a <value> placeholder but somebody
filled it with the variable name itself — reads as if the value
IS the var name string. Rewrite as 'read the <VAR> environment
variable and thread the validated value through ...' which says
the same thing without the bogus key=value.
A live OPEN vectordb datagen captured the running source tree into
code/, and the captured tree included .planning/, .gsd-tmp/,
.idea/, .claude/, etc. — local developer / tooling artifacts that
the project's .gitignore already excludes from version control,
have nothing to do with the benchmark source contract, and (.gsd-tmp/
in particular) can contain transient agent state.

Extend MD5_EXCLUDE_PREFIXES with the common dot-prefixed
local-artifact dirs:
  .idea/   .vscode/   .claude/   .agent/   .agents/   .roo/
  .planning/   .gsd-tmp/

.git/, .pytest_cache/, .venv/, .tox/ were already in the list.
.github/, .env.example, .gitignore, .python-version are intentionally
kept — they ARE part of the project contract.
summary.csv at repo root is the default output path for the
submission checker (mlpstorage_py/cli/utility_args.py:123,
mlpstorage_py/submission_checker/main.py:82, README.md). Tracking it
meant every `mlpstorage validate ...` run (and the recent UAT runs)
produced a dirty working tree.

Remove the tracked file and add summary.csv to .gitignore so the
runtime artifact stays local to whoever ran the tool. If a future
schema-template-style summary.csv is needed for the repo (e.g.
sample/header-only), it should land at a non-default path like
docs/summary-template.csv.
CLAUDE.md was included in the PR diff with no substantive content
changes — only blank lines added around bash code blocks by a
CommonMark formatter. Restore to origin/main to keep the PR diff
focused on actual source changes. (CLAUDE.md is already in
.gitignore but tracked from an earlier commit, so further local
edits won't show up in future PRs from this branch.)
@FileSystemGuy

Copy link
Copy Markdown
Contributor Author

2026-06-18 update — Phase 4 review-fix + UAT integration polish

23 new commits pushed today. The PR description (above) was written before Phase 4 landed; this comment summarizes what's new since then. Reviewers may want to focus diff review here.

Phase 4 deliverables (now in this branch)

  • VdbCheck §5 (full 16 rules) wired up: 5.1.1, 5.1.2, 5.2.1, 5.2.2, 5.3.1–5.3.4, 5.4.1, 5.4.2, 5.5.1, 5.6.1–5.6.5. STUB_COVERAGE['VdbCheck'] removed.
  • On-disk shape rename for OPEN vectordb (supersedes the §"On-disk shape" line in this PR's original description):
    • Was: .../vector_database/<index_type>/code/ (e.g. DISKANN)
    • Now: .../vdb_bench/<DisplayIndex>/<command>/<datetime>/ with per-leaf code/ at the <DisplayIndex>/ level (e.g. vdb_bench/DiskANN/datagen/20260618_204647/)
    • INDEX_TYPE_TOKEN_TO_DIR / INDEX_TYPE_DIR_TO_TOKEN dual vocabulary lives in mlpstorage_py/config.py
  • Layered code-image helper _check_code_image_layered in submission_checker/checks/helpers.py, consumed by both §3.6.1 (TrainingCheck) and §5.6.1 (VdbCheck) — CD-04 deduplication via caller-supplied rule_id/rule_name.
  • Rules.md §2.1.5.b leaf description, §2.1.27 CLOSED + OPEN vdb subtrees (now with <command>/<datetime>/ layering, code # captured per-leaf under all three OPEN index dirs including AiSAQ), §5.3.1 prose narrowed, §5.6.1 layered-model pointer.
  • 46 new tests across test_vdb_checks.py (41) + test_training_check_3_6_1.py (5).

Phase 4 code-review fixes (13 findings)

7 Warnings + 6 Info from a standard-depth review, all atomic commits:

ID Commit Summary
WR-01 625a11c Skip dot-prefixed entries in _iter_open_code_dirs (prevented spurious [2.1.6] violations on .cache/, .DS_Store, merged-tree .git/)
WR-02 e35e83d parent.get(k) or {} in chained YAML reads (training_checks, vdb_checks, submission_structure_checks) — fixes AttributeError on null intermediates
WR-03 ecac710 Atomic capture: rollback code.tmp/ on hash/JSON/rename failure
WR-04 b306303 Typed CodeImageError for missing args.benchmark / args.<leaf_attr> (replaces bare AttributeError)
WR-05 9515078 Tighten 5.5.1 S3 match: exact-name set + s3- prefix (rejects non-s3-storage)
WR-06 1206165 log.warning when df-mount match is indeterminate (was silent-pass)
WR-07 5c58aa0 + 054eb7f Short-circuit upstream-identity walk on MissingHashFile — both in helpers.py and STRUCT-06's inline copy
IN-01/02 709c0ac, cc05dc2 New CodeTreeUnreadable(CodeImageError) exception for tree-walk failures (replaces misnamed MissingHashFile / SourceRootNotFound raises)
IN-03 f9af70e Enforce documented results_dir.exists() invariant — raise ConfigurationError (E102) with mkdir -p suggestion
IN-04 2bdc7eb Drop redundant target_dir.mkdir before copytree(..., dirs_exist_ok=True)
IN-05 cd55ea4 Gate "reference checksum not configured" warning on actual closed/ submitter presence
IN-06 4e34ddd TODO(TODO-001) marker at df parse site + note about planned stat -f -c '%i' migration

UAT-discovered integration bugs (6 fixes)

Live /gsd-verify-work UAT exercised mlpstorage open vectordb datagen file … end-to-end. None of these were in the original Phase 4 plan scope, but all were blocking live verification of Phase 4 deliverables:

Commit Bug
8edc84f rules_coverage.py regex [234][23456] — Phase 4 wired 16 @rule methods for VdbCheck but the audit-CLI regex never enumerated §5 IDs from Rules.md, so the report dropped them silently. [23456] also pre-wires §6 for the kvcache rules landing soon.
b98472f cli_parser.py:325: len(args.hosts) crashed when --hosts was not supplied (argparse leaves None). Single-node mlpstorage open vectordb datagen ... was completely broken.
8a3dabd main.py:450: --debug CLI flag never wired to the unhandled-exception traceback gate; only the MLPS_DEBUG env var path was checked.
dc29798 benchmarks/base.py:804: generate_output_location() was called without orgname=/systemname= — for OPEN/CLOSED modes this raised the documented ConfigurationError. Now reads args._validated_orgname / _validated_systemname stashed by capture_or_verify_code_image.
beb9bbb rules/utils.py: suggestion-string format slip — "read MLPSTORAGE_ORGNAME='MLPSTORAGE_ORGNAME' from the environment" (the value placeholder was filled with the var name). Rewritten.
17204da constants.py: MD5_EXCLUDE_PREFIXES extended with .idea/, .vscode/, .claude/, .agent/, .agents/, .roo/, .planning/, .gsd-tmp/ — local-artifact dirs were being captured into the code image.

PR-prep cleanup

  • cc2b660: Untrack summary.csv, add to .gitignore. It's the default output path for the submission checker (utility_args.py:123, README) — tracking it produced dirty working trees after every mlpstorage validate run.
  • e15fe7a: Revert whitespace-only CLAUDE.md changes (CommonMark formatter noise; substantive content unchanged).

UAT verdict

6/6 tests passed live:

  1. CLI smoke (mlpstorage validate --help)
  2. rules-coverage shows full VdbCheck §5 (after 8edc84f)
  3. OPEN vectordb output path: vdb_bench/DiskANN/datagen/<datetime>/
  4. OPEN vectordb captures code/ at <DisplayIndex>/ level
  5. Rules.md §2.1.27 OPEN graphic lists code/ under all 3 index dirs
  6. IN-03 ConfigurationError fires on missing --results-dir

Test plan

uv run pytest mlpstorage_py/tests/ -q740 passed / 5 failed / 1 xfailed.

The 5 failures are pre-existing and unrelated to this branch's work — verified by stashing and re-running:

  • 4 × test_open_closed_flag_recognition.py::TestOpenClosedCLIFlags — stale add_universal_arguments(parser) calls missing the required req_results positional
  • 1 × test_benchmarks.py::TestVectorDBBenchmark::test_constructor_accepts_kwargs — separate constructor signature drift

Both look like CLI-parser-side test rot, not regressions in this PR. Recommended for a separate cleanup.

Caveats / acknowledged

  • /gsd-secure-phase 4 did NOT run. workflow.security_enforcement: true in .planning/config.json says it should before phase advancement. No new attack surface introduced in this PR (the new code paths are filesystem reads/writes that already existed), so impact is bounded — happy to run on request before merge.
  • Version invariant: pyproject.toml is at 3.0.14; origin/main is at 3.0.13. +1 patch as required.
  • mlpstorage_py/cli_parser.py, mlpstorage_py/main.py, mlpstorage_py/benchmarks/base.py were NOT in the Phase 4 plan but were touched today because UAT discovered the bugs above. All commits are atomic and individually justified; revert-blast-radius is per-commit.

Memory of pending follow-up work (not in this PR)

  • dfstat -f -c '%i' migration: per-node FS-identity capture via stat -f -c '%i' "$data_dir" superseding df-output parsing for §3.4.2 / §5.4.2. Out of scope here; tracked.

@FileSystemGuy

Copy link
Copy Markdown
Contributor Author

The second big chunk of commits here is really a separate topic: bringing in all the new rules for VDB, but it had to tweak and/or fix so much of the infrastructure that the "original" PR had built that there wasn't a good way to separate them.

FileSystemGuy and others added 3 commits June 19, 2026 13:58
The call site in benchmarks/base.py now threads orgname= and systemname=
through to generate_output_location(). Update the unit test's
assert_called_once_with to include the new keyword arguments (both
None in this test, since the args Namespace doesn't set
_validated_orgname / _validated_systemname).
@FileSystemGuy

Copy link
Copy Markdown
Contributor Author

@idevasena Here's the summary of the differences:

Conflict summary

Three files conflict between PR #473 (FileSystemGuy-code-image-capture-v1) and main:

  1. mlpstorage_py/cli_parser.py — trivial (resolvable mechanically)

Both branches independently fixed the len(None) crash when --hosts is unset. Main's fix folds the assignment into
the existing if hasattr(args, 'hosts') and args.hosts is not None: block at line 319; PR #473 added a parallel
guarded block. Functionally identical. Resolution: take main's nested form.

2 & 3. mlpstorage_py/rules/utils.py and tests/unit/test_accumulation.py — substantive design dispute

These conflict on competing VectorDB result-tree layouts that were developed in parallel and have non-trivial
downstream commitments on both sides.

PR #473's design (cites Rules.md §5.3.1, Phase 4 D-02/D-03):

  • Path: /vdb_bench////
  • On-disk segment is vdb_bench (not vector_database); index uses a display-name mapping (DiskANN, HNSW, AiSAQ) via
    INDEX_TYPE_TOKEN_TO_DIR in config.py.
  • No engine segment.
  • Submission checker (mlpstorage_py/tests/test_vdb_checks.py:226) explicitly gates all 16 §5 rule methods on mode
    == "vdb_bench" — switching the segment name silently disables every §5 check.
  • Code-image capture tests (test_capture_or_verify_code_image.py:246) also hard-code vdb_bench / DiskANN / code.

main's design (PR #476, just merged):

  • Path: /vector_database//<vdb_index>///
  • Segment stays vector_database; index is uppercase token (DISKANN).
  • Adds new --vdb-index flag with full resolver in vectordb_args.py (line 547): defaults, fallback to --index-type,
    cross-validation, conflict detection.
  • Includes engine segment so multiple VDB engines coexist in one results tree.

These can't both be true at once. Possible resolutions, none of which I should pick unilaterally:

  • Take PR Code-Image Capture & Validation (milestone v1.0) #473's design: revert main's --vdb-index/engine-segment work and update the just-merged tests on this
    branch.
  • Take main's design: drop the vdb_bench rename and the display-name mapping; loosen the §5 rules-checker gate so
    it accepts vector_database; update Phase 4 D-02/D-03 documentation.
  • Hybrid: e.g. /vdb_bench///// — keep the §5 gate name and
    display mapping, adopt main's engine segment and --vdb-index resolver. This is the most defensible technically,
    but it's a new design that should go through review, not a merge resolution.

@idevasena I assume that you'd prefer the results pathname structure defined in your PR#476, is that correct? If so, then I'll ask my agent to rework this PR to use that structure. Please just confirm that...

@idevasena

Copy link
Copy Markdown
Contributor

@FileSystemGuy yes correct, I prefer the results pathname structure defined in PR#476. Thank you!

@FileSystemGuy

Copy link
Copy Markdown
Contributor Author

@idevasena

Below is what I propose to do for the VDB workload (Claude prompts):

  1. Please examine the PR#473 and PR#476. We need to change the pathname structure that we've been using in PR#473 for the results-directory for VDB to be the pathname structure used in PR#476 for the results-directory for VDB.

  2. VDB doesn’t appear to print its arguments in the arg summary. Here's a minimum list:
    2026-06-19 14:34:09|RESULT: Vectors: 1,000,000 x dim=1536 x 4B
    2026-06-19 14:34:09|RESULT: Raw data: 6.14 GB
    2026-06-19 14:34:09|RESULT: Index type: DISKANN (130% overhead)
    2026-06-19 14:34:09|RESULT: Shards: 1
    2026-06-19 14:34:09|RESULT: Estimated total: 7.99 GB

  3. We should record the datasize in a metadata file in the results-directory so we can ensure VDB runs have enough data. We will need to do the same for the other workloads, so it might best to do them all in a separate, later, PR from the one we'd be building for this effort.

  4. VDB is still reporting that it is OPEN-only:
    2026-06-19 14:34:09|STATUS: Open: [OPEN] VectorDB benchmark is in preview status - not accepted for closed submissions (Parameter: benchmark_status)
    2026-06-19 14:34:09|STATUS: Benchmark run qualifies for OPEN category ([RunID(program='vector_database', command='datasize', model='milvus_DISKANN', run_datetime='20260619_143409')])
    VDB doesn’t know whether it is open or what:
    2026-06-19 14:34:09|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use closed or open as the first positional argument to specify a configuration.

  5. VDB should require the results-directory to be specified, there should be no "default" results-directory.
    2026-06-19 14:34:09|WARNING: Results directory not specified. Writing results to the system temp directory: /tmp/mlperf_storage_results. These results will NOT persist across a reboot. Use --results-dir or set the MLPERF_RESULTS_DIR environment variable to save results permanently.
    2026-06-19 14:34:09|STATUS: Writing metadata for benchmark to: /tmp/mlperf_storage_results/vector_database/milvus/DISKANN/datasize/20260619_143409/vector_database_20260619_143409_metadata.json

@idevasena

Copy link
Copy Markdown
Contributor

@FileSystemGuy The proposed changes look accurate to me. Meanwhile, I will work on #452 that needed updates to README and closed param values.

…age-capture-v1

# Conflicts:
#	mlpstorage_py/cli_parser.py
#	mlpstorage_py/rules/utils.py
#	tests/unit/test_accumulation.py
The merge with origin/main brought in PR #476's runtime path layout
(vector_database/<engine>/<DisplayIndex>/...) but the rest of the
code-image-capture PR was built around the older 'vdb_bench' segment.

This makes the conversion consistent across the codebase:

- code_image._TYPE_TO_ONDISK_SEGMENT: BENCHMARK_TYPES.name for every type
  (was hardcoded 'vdb_bench' for vector_database)
- submission_structure_checks: leaf-walker now expects
  results/<sys>/vector_database/<DisplayIndex>/code/
- vdb_checks: all 18 `self.mode != "vdb_bench"` rule guards now compare
  against "vector_database" (loader yields the directory name as mode)
- Rules.md S2.1.5.b, S2.1.27 diagrams, S5.3.1, S5.6.3 updated to refer
  to vector_database
- All test fixtures and assertions across mlpstorage_py/tests/ updated
- test_generate_output_location: vector_database tests now thread
  vdb_engine and expect the engine/index layered path
…split

End users found the dual-vocabulary (UPPERCASE token vs mixed-case display
spelling) confusing. Collapse to a single UPPERCASE convention everywhere
index names appear: the CLI (--index-type), summary.json.index_type, the
on-disk directory name, and the path-generation/validation code.

- Remove INDEX_TYPE_TOKEN_TO_DIR and INDEX_TYPE_DIR_TO_TOKEN constants
  from mlpstorage_py/config.py
- generate_output_location: drop the token->display mapping; write
  args.index_type / args.vdb_index directly under <engine>/
- code_image.capture_or_verify_code_image: drop the vector-database
  special-case mapping; leaf segment is just args.index_type
- VdbCheck.vdb_closed_index_types (Rules 5.6.3): compare dir name
  directly against UPPERCASE token; no DIR_TO_TOKEN lookup
- Rules.md: drop the S5.6 dual-representation callout; S2.1.5.b,
  S2.1.27 diagrams, S5.3.1 prose all refer to the UPPERCASE token
- All affected tests updated to expect UPPERCASE (DISKANN/HNSW/AISAQ)
  in paths; _build_vdb_leaf fixture collapses display+token args
  into a single index_type arg
The CLI gates --vdb-index / --vdb-engine / --model / --command behind
argparse choices= and a redundant post-parse allowlist check in
validate_vectordb_arguments. But generate_output_location itself did no
validation — a programmatic caller (test fixture, future internal API,
history-replay path that someday skips revalidation) feeding
'../etc' or '/abs' as orgname / systemname / vdb_index / model / etc.
would land in a path-traversal directory via os.path.join.

Close the gap at the trust boundary with one helper:

    _SAFE_PATH_COMPONENT_RE = re.compile(r"^[A-Za-z0-9._-]+$")
    def _check_safe_path_component(name, value):
        # reject '.' and '..' explicitly; reject anything not matching
        # POSIX-safe single-segment charset.

Applied to every os.path.join arg in generate_output_location that
isn't either user-supplied path root (results_dir) or internally
constant (BENCHMARK_TYPES.name, "results" literal):
  orgname, systemname, model, vdb_engine, vdb_index, command, datetime_str.

Adds 21 parametrized negative tests in test_generate_output_location.py
locking in the invariant for each field.
Aligns with the established convention of bumping the patch version one
above origin/main when a PR adds new behavior.

uv.lock regenerated via bare `uv lock` (no transitive drift). The format
upgrade to revision=3 (adds upload-time metadata to each package) is a
side effect of the local uv version writing the newer lockfile format;
no resolved versions changed.
@FileSystemGuy

Copy link
Copy Markdown
Contributor Author

@idevasena I think I've gotten my (big) code change layered over the top of your PR#476 without changing any of the behavior you intended to get from that PR. Please review...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants