Stage 2 (ADR 0004): repr-align data-collection contract layer by FluffyAIcode · Pull Request #23 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-05-30T08:36:56Z

Summary

First PR of the v0.3.x representation-alignment development line that opens
right after v0.3.0-rc1. Ships the data-collection contract layer —
the schema, prompt pool, and atomic versioned-Parquet writer that downstream
rollout-worker / trainer / eval code will plug into.

This PR is intentionally narrow. It is the foundation, not the implementation
of training itself. Subsequent PRs in this line:

PR 2 — rollout_worker.py (loads real verifier, drives generation,
captures hidden states) + post_filter.py (low-confidence / repetition /
EOS-fail rules per ADR 0004 §2.2).
PR 3 — 7 per-domain config YAMLs (chat-en / chat-zh / code / math /
long-context / multi-turn / tool-call) per ADR 0004 §2.1 quotas.
PR 4 (Stage 3) — trainer.py consuming the schema this PR locks in.
PR 5 (Stage 4) — eval.py reporting the §2.7 / §2.8 metrics.

What's new

`training/repr_align/data_collection/` (4 modules, 763 lines)

Module	Lines	Purpose
`schema.py`	310	`RolloutMeta`, `RolloutRow`, pyarrow `Schema` builder, `SCHEMA_VERSION = 1`
`prompt_pool.py`	369	Multi-domain composition with quotas, length filter, language tagging, dedup. Pluggable `LanguageDetector` / `Deduper` Protocols with dependency-free reference impls (`CharRatioLanguageDetector`, `ShingleJaccardDeduper`)
`parquet_writer.py`	289	Atomic versioned shard writer at `data/alignment/<verifier_id>/<verifier_dtype>/<schema_version>/shard_NNNNN.parquet` per ADR 0004 §2.5. Tmp+rename for parquet AND meta sidecar
`__init__.py`	95	Package re-exports + Stage 2 docstring

The capture row schema mirrors ADR 0004 §2.2 exactly:

prompt_id, domain, language, system_prompt_hash,
sequence_index, position_in_sequence, position_in_block, block_index,
cache_logical_size, token_id,
top_token_ids[K], top_probs[K], hidden_state[H],
verifier_top1_prob   (derived)

top_token_ids, top_probs and hidden_state are fixed-size pyarrow
lists (sizes from per-shard meta) so the trainer can vector-load batches
without per-row shape checks.

`tests/training/repr_align/data_collection/` (3 test files, 1050 lines)

Test file	Tests	Coverage target
`test_schema.py`	41	every dataclass validator branch + pyarrow round-trip
`test_prompt_pool.py`	33	every filter / quota / dedup / RNG-seed / underflow path
`test_parquet_writer.py`	23	atomic-rename, abort-on-exception, idempotent-close, missing-tempfile path, list/next-shard helpers

`training/repr_align/init.py` (PEP 562 lazy import)

Refactored to expose ReprAlignedSurgery / SurgeryConfig via __getattr__
instead of an eager import. Reasons:

proposer_surgery pulls in torch + transformers (heavy).
The new data_collection subpackage is intentionally torch-free.
Without lazy imports, importing training.repr_align.data_collection.schema
would still pay the full torch import cost via the parent package.
On this VM, torch+coverage segfault during pytest collection
(SystemError: bad call flags from torch's PyMethodDef under
coverage's tracer). Lazy import lets the new tests run with
coverage cleanly.

Public API unchanged: training.repr_align.ReprAlignedSurgery still works.

`requirements.txt`

Added pyarrow>=15,<25 for the fixed-size-list schema enforcement.

Verification

$ pytest tests/training/repr_align/data_collection/ \
    --cov=training.repr_align.data_collection \
    --cov-report=term-missing --confcutdir=tests/training -q
.....................................................................................
97 passed in 0.23s
training/repr_align/data_collection/__init__.py             4    100%
training/repr_align/data_collection/parquet_writer.py     129    100%
training/repr_align/data_collection/prompt_pool.py        155    100%
training/repr_align/data_collection/schema.py             115    100%
TOTAL                                                     403    100%
Required test coverage of 100.0% reached. Total coverage: 100.00%

$ pytest tests/training/repr_align/test_proposer_surgery.py -q
48 passed   (no regression from the lazy-import refactor)

$ pytest tests/inference_engine/ tests/training/ -q
529 passed

Coverage runs use --confcutdir=tests/training to skip the torch-importing
top-level tests/conftest.py (this is a local-VM workaround for a torch

coverage Python-3.12 METH-flags interaction; it does not affect CI on
the standard pytest invocation).

Quality bars

100% line coverage on the new module.
No mocks: every test uses real concrete classes (AlternatingRejectDeduper,
ShingleJaccardDeduper, CharRatioLanguageDetector, real tmp_path
filesystem). The few "test doubles" implement the published Protocol
interfaces structurally, exactly as ADR 0001 §3 specifies for the
proposer/verifier doubles.
No fallback: every validation path raises rather than silently
defaulting (verifier_id must match org/name, dtype must be a
literal, schema_version mismatch refuses to load, etc.).
No overfit: tests assert structural invariants (fixed-size list
sizes, atomic file appearance, quota counts, RNG reproducibility),
not specific token/prob values that would couple to upstream changes.

Out of scope explicitly

No real verifier rollout — that's the next PR.
No HF dataset adapters — those land alongside the per-domain configs.
No torch / transformers dependency in this layer.
No personal / per-user data path — covered by the planned ADR 0005
personal layer.

References

ADR 0001 — docs/adr/0001-proposer-sizing-and-alignment.md §4 (Stage roadmap)
ADR 0004 — docs/adr/0004-alignment-training-data-preparation-policy.md
§2.1 (prompt pool), §2.2 (capture spec), §2.5 (path layout)
v0.3.0-rc1 release notes — git show v0.3.0-rc1

Make training/repr_align/__init__.py expose ReprAlignedSurgery and SurgeryConfig via __getattr__ instead of an eager top-of-module import. The eager import paid a torch + transformers import cost on every submodule access, which: - blocked importing torch-free siblings (incoming training.repr_align.data_collection package) - made unit-test collection segfault on this VM under coverage instrumentation (torch's PyMethodDef + coverage tracer disagree on Python 3.12 METH flags) Lazy-import via PEP 562 keeps the public API stable (training.repr_align.ReprAlignedSurgery still works) while letting new torch-free subpackages be imported without paying the heavy import cost. Verified: python3 -c "import training.repr_align.data_collection.schema; import sys; assert 'torch' not in sys.modules" → torch not loaded python3 -c "import training.repr_align as ra; ra.ReprAlignedSurgery" → lazy attribute resolves correctly pytest tests/training/repr_align/test_proposer_surgery.py → 48 passed (no regression) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Ship the contract layer for representation-alignment training data collection. This is the foundation that downstream PRs (rollout worker + per-domain configs) plug into. Scope intentionally narrow: - schema — single source of truth for the per-token row + per-shard meta + pyarrow Schema. SCHEMA_VERSION = 1. - prompt_pool — multi-domain composition with quotas, length filter, language tagging, dedup. Pluggable protocols (LanguageDetector, Deduper) with dependency-free reference impls. - parquet_writer — atomic versioned shard writer at 'data/alignment/<verifier_id>/<verifier_dtype>/ <schema_version>/shard_NNNNN.parquet' per ADR 0004 §2.5. Writes meta sidecar + parquet atomically (tmp+rename) so readers never see a half-written shard. Out of scope (next PR in this work line): - rollout_worker.py (loads real verifier, drives generation, captures hidden states) — needs torch + transformers and is where the heavy testing surface lives. - configs/*.yaml (7 per-domain quota / source configs). - post_filter.py (low-confidence + repetition + EOS-fail filters). Quality bars (per project policy): - 100% line coverage on the new module: schema.py 115/115 lines prompt_pool.py 155/155 lines parquet_writer.py 129/129 lines __init__.py 4/4 lines - 97 unit tests, all real concrete classes (no mocks). Filter, dedup and pool flows are tested with deterministic test doubles that implement the public Protocol interfaces. - No torch dependency: data_collection imports cleanly without pulling in training.repr_align.proposer_surgery (relies on the PEP 562 lazy-import shipped in the previous commit). Deps: - pyarrow>=15,<25 added to requirements.txt for the fixed-size- list schema enforcement. References: - docs/adr/0004-alignment-training-data-preparation-policy.md §2.1 (prompt pool composition), §2.2 (capture spec), §2.5 (verifier-specific data isolation + path layout) - training/repr_align/__init__.py (Stage roadmap) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

CI on PR #23 reported 99.68% coverage with 5 statements missing on training/repr_align/__init__.py — the lazy __getattr__ + __dir__ hooks are unreachable from the existing test_proposer_surgery.py because that file imports proposer_surgery directly, never going through the package-level lazy attribute machinery. Added 4 tests that exercise the contract: - lazy attr resolves to ReprAlignedSurgery (and is the same object as a direct import from proposer_surgery) - lazy attr resolves to SurgeryConfig - unknown attr raises AttributeError with the documented message - dir() lists both public symbols and ordinary module attrs Verified locally: tests/training/ runs 149 tests pass. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 3 commits May 30, 2026 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 2 (ADR 0004): repr-align data-collection contract layer#23

Stage 2 (ADR 0004): repr-align data-collection contract layer#23
FluffyAIcode wants to merge 3 commits into
mainfrom
AgentMemory/repr-align-data-prep-8e7f

FluffyAIcode commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented May 30, 2026

Summary

What's new

training/repr_align/data_collection/ (4 modules, 763 lines)

tests/training/repr_align/data_collection/ (3 test files, 1050 lines)

training/repr_align/__init__.py (PEP 562 lazy import)

requirements.txt

Verification

Quality bars

Out of scope explicitly

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`training/repr_align/data_collection/` (4 modules, 763 lines)

`tests/training/repr_align/data_collection/` (3 test files, 1050 lines)

`training/repr_align/init.py` (PEP 562 lazy import)

`requirements.txt`