Summary
CLI ergonomics for iterative exploration — running many cycles, reseeding from the best, chasing a target metric, and comparing configs. These come from sustained hands-on use during a multi-cycle benchmark/record-chasing push, where the same manual workarounds kept recurring. The top 3 are the ones that bit repeatedly; #4–#5 are smaller wins.
Related: #26 (more detailed evolution stats) — #1 below is the reader/consumer side of that data, complementary rather than overlapping.
Proposed additions
1. kaievolve status <run_dir> / --watch (priority)
Today, checking on a live run means hand-writing grep -hoE "raw_C=..." loops against logs/*.log. The web viewer already has live streaming (live.json / progress feed), but there is no CLI equivalent.
A one-shot status and a follow-mode watch that print:
- iteration
N / total, throughput (iters/min), elapsed
- current best metric + which program id / island produced it
- last few improvements (metric trajectory)
- per-island best (island spread)
- whether the optional research director has fired and its latest directive
kaievolve status bench_results/<run>/run_0
kaievolve watch bench_results/<run>/run_0 # tails the live feed
2. --init-from <run_dir> (priority)
Chaining cycles (evolve → reseed from best → evolve) currently requires manually locating run_0/best/best_program.py and copying it over initial_program.py. A flag that pulls a previous run's best program as the new seed makes this a one-liner:
kaievolve-run.py --init-from bench_results/<prev_run>/run_0 evaluator.py --config cfg.yaml
3. --target <metric><op><value> + stop-on-reach (priority)
When chasing a known target there's no way to early-exit and clearly flag success; runs just continue and breakthroughs have to be detected by grep. A target predicate plus an unambiguous TARGET REACHED marker in the log/exit closes the loop:
kaievolve-run.py ... --target "raw_C<=1.5029" --stop-on-target
4. kaievolve compare <dir1> <dir2> ...
A/B'ing configs currently means digging through best_program_info.json across run dirs. A side-by-side of best metric / config diff / iterations / wall-time would make config comparison trivial.
5. kaievolve eval <program.py> <evaluator.py> [--config cfg.yaml]
Scoring a single candidate program against the canonical evaluator currently requires throwaway python -c snippets. A one-shot eval command standardizes it (useful for sanity-checking seeds and inspecting individual programs).
Why
All five surfaced during real iterative use. #1 and #2 especially map to common workflows — "how's the run doing" and "reseed from the best program" — that today devolve into log-grepping and manual file-copying.
Summary
CLI ergonomics for iterative exploration — running many cycles, reseeding from the best, chasing a target metric, and comparing configs. These come from sustained hands-on use during a multi-cycle benchmark/record-chasing push, where the same manual workarounds kept recurring. The top 3 are the ones that bit repeatedly; #4–#5 are smaller wins.
Related: #26 (more detailed evolution stats) — #1 below is the reader/consumer side of that data, complementary rather than overlapping.
Proposed additions
1.
kaievolve status <run_dir>/--watch(priority)Today, checking on a live run means hand-writing
grep -hoE "raw_C=..."loops againstlogs/*.log. The web viewer already has live streaming (live.json/ progress feed), but there is no CLI equivalent.A one-shot
statusand a follow-modewatchthat print:N / total, throughput (iters/min), elapsed2.
--init-from <run_dir>(priority)Chaining cycles (evolve → reseed from best → evolve) currently requires manually locating
run_0/best/best_program.pyand copying it overinitial_program.py. A flag that pulls a previous run's best program as the new seed makes this a one-liner:3.
--target <metric><op><value>+ stop-on-reach (priority)When chasing a known target there's no way to early-exit and clearly flag success; runs just continue and breakthroughs have to be detected by grep. A target predicate plus an unambiguous
TARGET REACHEDmarker in the log/exit closes the loop:4.
kaievolve compare <dir1> <dir2> ...A/B'ing configs currently means digging through
best_program_info.jsonacross run dirs. A side-by-side of best metric / config diff / iterations / wall-time would make config comparison trivial.5.
kaievolve eval <program.py> <evaluator.py> [--config cfg.yaml]Scoring a single candidate program against the canonical evaluator currently requires throwaway
python -csnippets. A one-shot eval command standardizes it (useful for sanity-checking seeds and inspecting individual programs).Why
All five surfaced during real iterative use. #1 and #2 especially map to common workflows — "how's the run doing" and "reseed from the best program" — that today devolve into log-grepping and manual file-copying.