Skip to content

CLI ergonomics for iterative exploration (status/watch, --init-from, stop-on-target, compare, eval) #28

@aktasbatuhan

Description

@aktasbatuhan

Summary

CLI ergonomics for iterative exploration — running many cycles, reseeding from the best, chasing a target metric, and comparing configs. These come from sustained hands-on use during a multi-cycle benchmark/record-chasing push, where the same manual workarounds kept recurring. The top 3 are the ones that bit repeatedly; #4#5 are smaller wins.

Related: #26 (more detailed evolution stats) — #1 below is the reader/consumer side of that data, complementary rather than overlapping.

Proposed additions

1. kaievolve status <run_dir> / --watch (priority)

Today, checking on a live run means hand-writing grep -hoE "raw_C=..." loops against logs/*.log. The web viewer already has live streaming (live.json / progress feed), but there is no CLI equivalent.

A one-shot status and a follow-mode watch that print:

  • iteration N / total, throughput (iters/min), elapsed
  • current best metric + which program id / island produced it
  • last few improvements (metric trajectory)
  • per-island best (island spread)
  • whether the optional research director has fired and its latest directive
kaievolve status  bench_results/<run>/run_0
kaievolve watch   bench_results/<run>/run_0     # tails the live feed

2. --init-from <run_dir> (priority)

Chaining cycles (evolve → reseed from best → evolve) currently requires manually locating run_0/best/best_program.py and copying it over initial_program.py. A flag that pulls a previous run's best program as the new seed makes this a one-liner:

kaievolve-run.py --init-from bench_results/<prev_run>/run_0  evaluator.py --config cfg.yaml

3. --target <metric><op><value> + stop-on-reach (priority)

When chasing a known target there's no way to early-exit and clearly flag success; runs just continue and breakthroughs have to be detected by grep. A target predicate plus an unambiguous TARGET REACHED marker in the log/exit closes the loop:

kaievolve-run.py ... --target "raw_C<=1.5029" --stop-on-target

4. kaievolve compare <dir1> <dir2> ...

A/B'ing configs currently means digging through best_program_info.json across run dirs. A side-by-side of best metric / config diff / iterations / wall-time would make config comparison trivial.

5. kaievolve eval <program.py> <evaluator.py> [--config cfg.yaml]

Scoring a single candidate program against the canonical evaluator currently requires throwaway python -c snippets. A one-shot eval command standardizes it (useful for sanity-checking seeds and inspecting individual programs).

Why

All five surfaced during real iterative use. #1 and #2 especially map to common workflows — "how's the run doing" and "reseed from the best program" — that today devolve into log-grepping and manual file-copying.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions