Skip to content

Stage 2 (ADR 0004): repr-align data-collection contract layer#23

Draft
FluffyAIcode wants to merge 3 commits into
mainfrom
AgentMemory/repr-align-data-prep-8e7f
Draft

Stage 2 (ADR 0004): repr-align data-collection contract layer#23
FluffyAIcode wants to merge 3 commits into
mainfrom
AgentMemory/repr-align-data-prep-8e7f

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Summary

First PR of the v0.3.x representation-alignment development line that opens
right after v0.3.0-rc1. Ships the data-collection contract layer
the schema, prompt pool, and atomic versioned-Parquet writer that downstream
rollout-worker / trainer / eval code will plug into.

This PR is intentionally narrow. It is the foundation, not the implementation
of training itself. Subsequent PRs in this line:

  • PR 2rollout_worker.py (loads real verifier, drives generation,
    captures hidden states) + post_filter.py (low-confidence / repetition /
    EOS-fail rules per ADR 0004 §2.2).
  • PR 3 — 7 per-domain config YAMLs (chat-en / chat-zh / code / math /
    long-context / multi-turn / tool-call) per ADR 0004 §2.1 quotas.
  • PR 4 (Stage 3)trainer.py consuming the schema this PR locks in.
  • PR 5 (Stage 4)eval.py reporting the §2.7 / §2.8 metrics.

What's new

training/repr_align/data_collection/ (4 modules, 763 lines)

Module Lines Purpose
schema.py 310 RolloutMeta, RolloutRow, pyarrow Schema builder, SCHEMA_VERSION = 1
prompt_pool.py 369 Multi-domain composition with quotas, length filter, language tagging, dedup. Pluggable LanguageDetector / Deduper Protocols with dependency-free reference impls (CharRatioLanguageDetector, ShingleJaccardDeduper)
parquet_writer.py 289 Atomic versioned shard writer at data/alignment/<verifier_id>/<verifier_dtype>/<schema_version>/shard_NNNNN.parquet per ADR 0004 §2.5. Tmp+rename for parquet AND meta sidecar
__init__.py 95 Package re-exports + Stage 2 docstring

The capture row schema mirrors ADR 0004 §2.2 exactly:

prompt_id, domain, language, system_prompt_hash,
sequence_index, position_in_sequence, position_in_block, block_index,
cache_logical_size, token_id,
top_token_ids[K], top_probs[K], hidden_state[H],
verifier_top1_prob   (derived)

top_token_ids, top_probs and hidden_state are fixed-size pyarrow
lists
(sizes from per-shard meta) so the trainer can vector-load batches
without per-row shape checks.

tests/training/repr_align/data_collection/ (3 test files, 1050 lines)

Test file Tests Coverage target
test_schema.py 41 every dataclass validator branch + pyarrow round-trip
test_prompt_pool.py 33 every filter / quota / dedup / RNG-seed / underflow path
test_parquet_writer.py 23 atomic-rename, abort-on-exception, idempotent-close, missing-tempfile path, list/next-shard helpers

training/repr_align/__init__.py (PEP 562 lazy import)

Refactored to expose ReprAlignedSurgery / SurgeryConfig via __getattr__
instead of an eager import. Reasons:

  1. proposer_surgery pulls in torch + transformers (heavy).
  2. The new data_collection subpackage is intentionally torch-free.
    Without lazy imports, importing training.repr_align.data_collection.schema
    would still pay the full torch import cost via the parent package.
  3. On this VM, torch+coverage segfault during pytest collection
    (SystemError: bad call flags from torch's PyMethodDef under
    coverage's tracer). Lazy import lets the new tests run with
    coverage cleanly.

Public API unchanged: training.repr_align.ReprAlignedSurgery still works.

requirements.txt

Added pyarrow>=15,<25 for the fixed-size-list schema enforcement.

Verification

$ pytest tests/training/repr_align/data_collection/ \
    --cov=training.repr_align.data_collection \
    --cov-report=term-missing --confcutdir=tests/training -q
.....................................................................................
97 passed in 0.23s
training/repr_align/data_collection/__init__.py             4    100%
training/repr_align/data_collection/parquet_writer.py     129    100%
training/repr_align/data_collection/prompt_pool.py        155    100%
training/repr_align/data_collection/schema.py             115    100%
TOTAL                                                     403    100%
Required test coverage of 100.0% reached. Total coverage: 100.00%

$ pytest tests/training/repr_align/test_proposer_surgery.py -q
48 passed   (no regression from the lazy-import refactor)

$ pytest tests/inference_engine/ tests/training/ -q
529 passed

Coverage runs use --confcutdir=tests/training to skip the torch-importing
top-level tests/conftest.py (this is a local-VM workaround for a torch

  • coverage Python-3.12 METH-flags interaction; it does not affect CI on
    the standard pytest invocation).

Quality bars

  • 100% line coverage on the new module.
  • No mocks: every test uses real concrete classes (AlternatingRejectDeduper,
    ShingleJaccardDeduper, CharRatioLanguageDetector, real tmp_path
    filesystem). The few "test doubles" implement the published Protocol
    interfaces structurally, exactly as ADR 0001 §3 specifies for the
    proposer/verifier doubles.
  • No fallback: every validation path raises rather than silently
    defaulting (verifier_id must match org/name, dtype must be a
    literal, schema_version mismatch refuses to load, etc.).
  • No overfit: tests assert structural invariants (fixed-size list
    sizes, atomic file appearance, quota counts, RNG reproducibility),
    not specific token/prob values that would couple to upstream changes.

Out of scope explicitly

  • No real verifier rollout — that's the next PR.
  • No HF dataset adapters — those land alongside the per-domain configs.
  • No torch / transformers dependency in this layer.
  • No personal / per-user data path — covered by the planned ADR 0005
    personal layer.

References

  • ADR 0001 — docs/adr/0001-proposer-sizing-and-alignment.md §4 (Stage roadmap)
  • ADR 0004 — docs/adr/0004-alignment-training-data-preparation-policy.md
    §2.1 (prompt pool), §2.2 (capture spec), §2.5 (path layout)
  • v0.3.0-rc1 release notes — git show v0.3.0-rc1
Open in Web Open in Cursor 

cursoragent and others added 3 commits May 30, 2026 08:35
Make training/repr_align/__init__.py expose ReprAlignedSurgery and
SurgeryConfig via __getattr__ instead of an eager top-of-module
import. The eager import paid a torch + transformers import cost
on every submodule access, which:

  - blocked importing torch-free siblings (incoming
    training.repr_align.data_collection package)
  - made unit-test collection segfault on this VM under
    coverage instrumentation (torch's PyMethodDef + coverage
    tracer disagree on Python 3.12 METH flags)

Lazy-import via PEP 562 keeps the public API stable
(training.repr_align.ReprAlignedSurgery still works) while letting
new torch-free subpackages be imported without paying the heavy
import cost.

Verified:
  python3 -c "import training.repr_align.data_collection.schema;     import sys; assert 'torch' not in sys.modules"
  → torch not loaded
  python3 -c "import training.repr_align as ra; ra.ReprAlignedSurgery"
  → lazy attribute resolves correctly

  pytest tests/training/repr_align/test_proposer_surgery.py
  → 48 passed (no regression)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Ship the contract layer for representation-alignment training data
collection. This is the foundation that downstream PRs (rollout
worker + per-domain configs) plug into.

Scope intentionally narrow:
  - schema   — single source of truth for the per-token row +
               per-shard meta + pyarrow Schema. SCHEMA_VERSION = 1.
  - prompt_pool — multi-domain composition with quotas, length
                  filter, language tagging, dedup. Pluggable
                  protocols (LanguageDetector, Deduper) with
                  dependency-free reference impls.
  - parquet_writer — atomic versioned shard writer at
                     'data/alignment/<verifier_id>/<verifier_dtype>/
                     <schema_version>/shard_NNNNN.parquet' per ADR
                     0004 §2.5. Writes meta sidecar + parquet
                     atomically (tmp+rename) so readers never see
                     a half-written shard.

Out of scope (next PR in this work line):
  - rollout_worker.py (loads real verifier, drives generation,
    captures hidden states) — needs torch + transformers and is
    where the heavy testing surface lives.
  - configs/*.yaml (7 per-domain quota / source configs).
  - post_filter.py (low-confidence + repetition + EOS-fail filters).

Quality bars (per project policy):
  - 100% line coverage on the new module:
      schema.py            115/115 lines
      prompt_pool.py       155/155 lines
      parquet_writer.py    129/129 lines
      __init__.py            4/4 lines
  - 97 unit tests, all real concrete classes (no mocks). Filter,
    dedup and pool flows are tested with deterministic test
    doubles that implement the public Protocol interfaces.
  - No torch dependency: data_collection imports cleanly without
    pulling in training.repr_align.proposer_surgery (relies on
    the PEP 562 lazy-import shipped in the previous commit).

Deps:
  - pyarrow>=15,<25 added to requirements.txt for the fixed-size-
    list schema enforcement.

References:
  - docs/adr/0004-alignment-training-data-preparation-policy.md
    §2.1 (prompt pool composition), §2.2 (capture spec),
    §2.5 (verifier-specific data isolation + path layout)
  - training/repr_align/__init__.py (Stage roadmap)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
CI on PR #23 reported 99.68% coverage with 5 statements missing on
training/repr_align/__init__.py — the lazy __getattr__ + __dir__
hooks are unreachable from the existing test_proposer_surgery.py
because that file imports proposer_surgery directly, never going
through the package-level lazy attribute machinery.

Added 4 tests that exercise the contract:
  - lazy attr resolves to ReprAlignedSurgery (and is the same
    object as a direct import from proposer_surgery)
  - lazy attr resolves to SurgeryConfig
  - unknown attr raises AttributeError with the documented message
  - dir() lists both public symbols and ordinary module attrs

Verified locally: tests/training/ runs 149 tests pass.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants