feat(training): model zoo — declarative variant specs + train driver (L4488c)#231
Merged
Conversation
…(L4488c)
Item 3 of the model-rotation scaffolding arc (L4488); the experiment/model-spec
layer (SOTA pillar 2). A "model spec" is a config OVERLAY over the existing
training knobs; running a spec trains ONE variant and registers it as a
CHALLENGER (capture-gap; challenger-first never overwrites the champion), which
the shadow runner shadows and the net-of-cost scorer (L4488b) ranks. Lets a
variety of models be rotated in/out as experiments are run.
Deliberately a THIN spec-overlay, NOT a generic ML platform: reuses
train_handler.main() unchanged, applying a spec's overrides around the call via
a save/restore context (the knobs are module-level cfg constants read via cfg.X
at call time — verified no bare imports). No config-object refactor of the
trainer.
- training/model_zoo.py: resolve_spec / spec_overrides (allowlisted save+restore,
restores on exception, removes previously-absent attrs) / train_spec /
train_all_active (sequential; one spec's failure never aborts the rest) + CLI
(--spec / --all-active / --list). Override allowlist = {FORWARD_DAYS,
RESIDUAL_MOMENTUM_ENABLED, XSEC_DEMEAN_ALPHA_ENABLED, MODEL_VERSION_LABEL};
disallowed keys fail loud.
- meta_trainer: manifest/feature_list/summary version -> MODEL_VERSION_LABEL
(default v3.0-meta) so each spec registers under its own version_id
({label}-{date}-{fp}) — distinct challengers on the leaderboard.
- config: MODEL_SPECS (list, default []) + MODEL_VERSION_LABEL; sample.yaml
documents the model_specs schema.
Limitation documented: a horizon (FORWARD_DAYS) override needs any
import-time-derived constant verified to read cfg.X at call time — checked when
the 60d variant lands (L4488d). CLI runs one spec at a time (each ~one full
train) so the operator paces experiments rather than looping all in one SF.
Tests: +8 (resolve active/retired/missing; allowlist; save/restore incl.
exception + absent-attr; train_spec applies overrides + defaults label;
all-active skips retired + continues on failure). Updated the feature_list
source-pin for the spec-driven version. Suite 1363 -> 1371.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ining:, broke YAML) The L4488c model-zoo doc block + top-level model_version_label were inserted BETWEEN batch_size and learning_rate inside the training: mapping, so CI (which copies predictor.sample.yaml -> predictor.yaml) failed to parse it (block-end expected at learning_rate). Local suite passed because the real gitignored predictor.yaml was used, not the sample. Moved the block to a proper top-level location after shadow_versions:. Sample now parses.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
L4488c — item 3 of the model-rotation scaffolding arc (L4488); the experiment/model-spec layer (SOTA pillar 2).
A model spec is a config OVERLAY over the existing training knobs. Running a spec trains ONE variant and registers it as a challenger (capture-gap; challenger-first never overwrites the champion) — which the shadow runner (#228) shadows and the net-of-cost scorer (L4488b) ranks. This is the layer that lets a variety of models be rotated in/out as experiments are run.
Thin by design (not a generic ML platform)
Reuses
train_handler.main()unchanged, applying a spec's overrides around the call via a save/restore context — the knobs are module-levelcfgconstants read viacfg.Xat call time (verified: no barefrom config import …). No config-object refactor of the trainer.What
training/model_zoo.py—resolve_spec/spec_overrides(allowlisted, restores on exception + removes previously-absent attrs) /train_spec/train_all_active(sequential; one spec's failure never aborts the rest) + CLI (--spec/--all-active/--list). Override allowlist ={FORWARD_DAYS, RESIDUAL_MOMENTUM_ENABLED, XSEC_DEMEAN_ALPHA_ENABLED, MODEL_VERSION_LABEL}; disallowed keys fail loud.meta_trainer— manifest/feature_list/summaryversion→MODEL_VERSION_LABEL(defaultv3.0-meta), so each spec registers under its ownversion_id({label}-{date}-{fp}) — distinct challengers on the leaderboard.MODEL_SPECS(list, default[]) +MODEL_VERSION_LABEL;sample.yamldocuments the schema.Documented limitation
A horizon (
FORWARD_DAYS) override additionally needs any import-time-derived constant verified to readcfg.Xat call time — checked when the 60d variant lands (L4488d). The CLI runs one spec at a time (~one full train each), so the operator paces experiments rather than looping all in one Saturday SF.Tests
+8 (resolve active/retired/missing; allowlist; save/restore incl. exception + absent-attr;
train_specapplies overrides + defaults label; all-active skips retired + continues on failure). Updated the feature_list source-pin for the spec-driven version. Suite 1363→1371.Next: L4488d — seed the zoo (residual-momentum, 60d-target, nonlinear-blender) → populates the challenger track and settles the horizon call via L4488b's net-of-cost leaderboard.