Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 1 addition & 6 deletions .codex/skills/agentopt/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,16 +218,11 @@ Candidates may be model-name strings or framework-specific LLM instances, as lon
| `auto` | **Default.** Aliases `arm_elimination`. Start here. | — |
| `arm_elimination` | Best-arm identification with statistical elimination. | `n_initial`, `growth_factor`, `confidence` |
| `brute_force` | Small combo space; exhaustive baseline. | — |
| `random` | Cheap exploratory scan over a large space. | `sample_fraction`, `seed` |
| `hill_climbing` | Local search; combos have smooth structure. | `max_iterations`, `num_restarts`, `patience`, `seed`, `batch_size` |
| `epsilon_lucb` | Find any combo within ε of the best. | `epsilon`, `n_initial`, `confidence` |
| `threshold` | Classify combos above/below a quality bar. | `threshold`, `n_initial`, `confidence` |
| `matrix_ucb` | UCB over a node × model matrix; structured spaces. | see `docs/api/selectors.md` |
| `matrix_ucb_lrf` | Low-rank-factorization variant of matrix UCB. | see `docs/api/selectors.md` |
| `lm_proposal` | Ask a proposer LLM to pick a combo, then evaluate it. | `proposer_model`, `proposer_client`, `objective`, `dataset_preview_size`, `node_descriptions` |
| `bayesian` | Surrogate-model-guided search over medium/large spaces. | `batch_size`, `sample_fraction` |

Other `ModelSelector` kwargs: `model_prices` (custom token pricing), `tracker` (e.g. `LLMTracker(cache_dir=...)` for disk-cache reuse), `node_descriptions` (used by `lm_proposal`), `lambda_cost` / `lambda_latency` (optional; default `0.0` — scalar combined objective `score - λ_cost·norm(cost) - λ_latency·norm(latency)`; see `docs/api/selectors.md#combined-objective-optional-costlatency-weights`).
Other `ModelSelector` kwargs: `model_prices` (custom token pricing), `tracker` (e.g. `LLMTracker(cache_dir=...)` for disk-cache reuse), `lambda_cost` / `lambda_latency` (optional; default `0.0` — scalar combined objective `score - λ_cost·norm(cost) - λ_latency·norm(latency)`; see `docs/api/selectors.md#combined-objective-optional-costlatency-weights`).

### 5. Concurrency

Expand Down
11 changes: 2 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,25 +288,18 @@ Working examples for the frameworks and CLI agents named above. Examples are org

AgentOpt includes a rich set of selection algorithms. Advanced users may get significant speedups by choosing the right method for their use case. See the [documentation](https://agentoptimizer.github.io/agentopt/) and [advanced_algorithms.py](examples/selection/local/advanced_algorithms.py) for details.

If you do not need the strict best model combination and want **lower search cost**, `epsilon_lucb` is often a good choice: it stops once an **ε-optimal** arm is found (tune `epsilon` to trade off how close to optimal you need to be versus how many runs you spend).

| `method=` | Best for | How it works |
|-----------|----------|-------------|
| `"auto"` (default) | General use | Automatically finds the best combination (wired to `arm_elimination` — strong best-arm identification with lower search cost than `brute_force`) |
| `"brute_force"` | Small search spaces | Evaluates all combinations |
| `"random"` | Quick exploration | Samples a random fraction |
| `"hill_climbing"` | Topology-aware search | Greedy search using model quality/speed rankings |
| `"arm_elimination"` | Best-arm identification | Bandit; eliminates statistically dominated combinations |
| `"epsilon_lucb"` | Extra search cost savings when ε-optimal is enough | Bandit; stops when an epsilon-optimal best arm is identified |
| `"threshold"` | Thresholding objectives | Bandit; determines whether each combination is above/below a user-defined `threshold` on the performance metric (e.g., mean accuracy) |
| `"lm_proposal"` | LLM-guided search | Uses a proposer LLM to shortlist promising combinations |
| `"matrix_ucb"` / `"matrix_ucb_lrf"` | Large combo × datapoint grids | UCB exploration over the evaluation matrix |
| `"bayesian"` | Expensive evaluations | GP-based Bayesian optimization over categorical model choices; uses correlation between combinations (requires `pip install "agentopt-py[bayesian]"`) |

```python
selector = ModelSelector(
agent=MyAgent, models=models, eval_fn=eval_fn, dataset=dataset,
method="epsilon_lucb",
epsilon=0.01
method="auto",
)
results = selector.select_best(parallel=True)
```
Expand Down
2 changes: 1 addition & 1 deletion docs/api/results.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ One per evaluated combination.
| `latency_seconds` | `float` | Mean latency per datapoint. |
| `input_tokens` | `Dict[str, int]` | Input tokens by model. |
| `output_tokens` | `Dict[str, int]` | Output tokens by model. |
| `attribute` | `str` | Metric track the result was scored under (algorithms like `threshold` produce multiple). |
| `attribute` | `str` | Metric track the result was scored under. |
| `is_best` | `bool` | Whether this is the top-ranked combination. |
| `datapoint_results` | `List[DatapointResult]` | Per-datapoint breakdown. |

Expand Down
41 changes: 3 additions & 38 deletions docs/api/selectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ results.print_summary()
| `model_prices` | `Dict`, optional | Custom pricing overrides: `{"model": {"input_price": x, "output_price": y}}` in $/MTok. Required for cost terms when `lambda_cost > 0`. |
| `lambda_cost` | `float`, optional | Weight on **normalized** per-sample cost in the combined objective. Default `0.0` (disabled). See [Combined objective](#combined-objective-optional-costlatency-weights) below. |
| `lambda_latency` | `float`, optional | Weight on **normalized** per-sample latency in the combined objective. Default `0.0` (disabled). |
| `node_descriptions` | `Dict[str, str]`, optional | Human-readable descriptions per node — surfaced in `LMProposalModelSelector`. |
| `tracker` | `LLMTracker`, optional | Bring your own. Defaults to a fresh `LLMTracker()` started in the constructor. Pass one in to share a cache across runs, route via a daemon (`AGENTOPT_GATEWAY_URL`), or post-process records after `select_best()` returns. |

The selector calls `tracker.start()` in the constructor and `tracker.stop()` when `select_best()` returns or raises. Record queries on the tracker remain valid after `stop()`, so post-run analysis works:
Expand Down Expand Up @@ -93,16 +92,12 @@ results.print_summary() # ranks by combined_objective when lambdas are set
| Methods | During search | Final `is_best` |
|:---|:---|:---|
| `matrix_ucb`, `matrix_ucb_lrf` | UCB rewards use per-cell combined objective | `_find_best` on `combined_objective` |
| `arm_elimination`, `epsilon_lucb`, `threshold` | Elimination / LUCB stats on combined per-sample objectives | same |
| `hill_climbing`, `bayesian` | Move / surrogate target uses combined objective | same |
| `brute_force`, `random` | Does not steer *which* combos to try | same |
| `lm_proposal` | Proposer uses `objective=` **text**, not these lambdas | `combined_objective` on the one evaluated combo only |
| `arm_elimination` | Elimination stats on combined per-sample objectives | same |
| `bayesian` | Surrogate target uses combined objective | same |
| `brute_force` | Does not steer *which* combos to try | same |

After `select_best()`, a final pass recomputes every result’s `combined_objective` against the **full-run** normalizer so rankings are comparable.

!!! note "`lm_proposal` vs lambdas"
`LMProposalModelSelector(objective="...")` is a natural-language hint to the **proposer LLM**. It is separate from `lambda_cost` / `lambda_latency`, which only affect the scalar reward used for ranking and bandit methods.

## `select_best()`

```python
Expand All @@ -123,13 +118,8 @@ Returns a [`SelectionResults`](results.md). `parallel=True` requires `agent.run`
|:---|:---|:---|
| `"auto"` (default) | Arm elimination | Strong best-arm identification at lower search cost than brute force. Same impl as `"arm_elimination"`. |
| `"brute_force"` | Evaluate every combo on the full dataset | Small search space; ground-truth comparison. |
| `"random"` | Random search | Cheap baseline. |
| `"hill_climbing"` | Greedy per-node | Large combinatorial spaces with weak coupling between nodes. |
| `"arm_elimination"` | Successive elimination | Best-arm identification with PAC-style guarantees. |
| `"epsilon_lucb"` | LUCB with tolerance | Stop once a combo is within ε of the best. |
| `"matrix_ucb"` / `"matrix_ucb_lrf"` | UCB exploiting cross-combo structure | Large model x datapoint matrices; `lrf` adds low-rank factorization. |
| `"threshold"` | Threshold bandit successive elimination | "Find all combos above accuracy θ" rather than the single best. |
| `"lm_proposal"` | LM-guided | Uses `node_descriptions` to propose combinations. |
| `"bayesian"` | Bayesian optimization | Optional extra: `pip install "agentopt-py[bayesian]"`. |

---
Expand All @@ -141,26 +131,11 @@ Returns a [`SelectionResults`](results.md). `parallel=True` requires `agent.run`
members: false
show_bases: false

::: agentopt.model_selection.random_search.RandomSearchModelSelector
options:
members: false
show_bases: false

::: agentopt.model_selection.hill_climbing.HillClimbingModelSelector
options:
members: false
show_bases: false

::: agentopt.model_selection.arm_elimination.ArmEliminationModelSelector
options:
members: false
show_bases: false

::: agentopt.model_selection.epsilon_lucb.EpsilonLUCBModelSelector
options:
members: false
show_bases: false

::: agentopt.model_selection.matrix_ucb.MatrixUCBModelSelector
options:
members: false
Expand All @@ -171,16 +146,6 @@ Returns a [`SelectionResults`](results.md). `parallel=True` requires `agent.run`
members: false
show_bases: false

::: agentopt.model_selection.threshold_successive_elimination.ThresholdBanditSEModelSelector
options:
members: false
show_bases: false

::: agentopt.model_selection.lm_proposal.LMProposalModelSelector
options:
members: false
show_bases: false

::: agentopt.model_selection.bayesian_optimization.BayesianOptimizationModelSelector
options:
members: false
Expand Down
20 changes: 0 additions & 20 deletions docs/benchmark-results/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,13 +52,8 @@ All models accessed via AWS Bedrock Application Inference Profiles (on-demand pr
| Selector | Find Rate | Mean Accuracy | Evaluations | Cost | Savings |
|:---------|:----------|:--------------|:------------|:-----|:--------|
| Brute Force | 100% | 74.75% | 1,782 | $4.71 | -- |
| LM Proposal | 100% | 74.75% | 198 | $2.47 | 48% |
| Hill Climbing | 90% | 74.55% | 1,501 | $4.03 | 14% |
| Arm Elimination | 94% | 74.10% | 666 | $3.57 | **24%** |
| Epsilon LUCB | 72% | 73.14% | 380 | $2.51 | 47% |
| Bayesian Opt | 56% | 72.43% | 990 | $2.59 | 45% |
| Random Search | 36% | 68.57% | 594 | $1.73 | 63% |
| Threshold SE | 16% | 57.48% | 252 | $1.80 | 62% |

### Thinking Effort Ablation

Expand Down Expand Up @@ -103,13 +98,8 @@ Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort)
| Selector | Find Rate | Mean Accuracy | Evaluations | Cost | Savings |
|:---------|:----------|:--------------|:------------|:-----|:--------|
| Brute Force | 100% | 70.00% | 1,800 | $84.80 | -- |
| Hill Climbing | 100% | 70.00% | 1,664 | $72.12 | 15% |
| Epsilon LUCB | 28% | 69.90% | 399 | $40.03 | 53% |
| Arm Elimination | 88% | 69.37% | 912 | $74.39 | **12%** |
| Bayesian Opt | 44% | 69.27% | 1,000 | $50.64 | 40% |
| Random Search | 36% | 67.13% | 600 | $31.39 | 63% |
| Threshold SE | 10% | 58.19% | 186 | $18.82 | 78% |
| LM Proposal | 0% | 44.03% | 200 | $3.39 | 96% |

---

Expand Down Expand Up @@ -253,11 +243,6 @@ Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort)
| Brute Force | 100% | 74.27% | 16,168 | $51.90 | -- |
| Bayesian Opt | 8% | 73.33% | 3,996 | $12.29 | 76% |
| Arm Elimination | 86% | 73.19% | 4,283 | $16.92 | **67%** |
| Hill Climbing | 52% | 73.13% | 4,635 | $19.39 | 63% |
| Random Search | 30% | 72.25% | 4,192 | $13.37 | 74% |
| Epsilon LUCB | 10% | 69.71% | 478 | $1.75 | 97% |
| Threshold SE | 4% | 65.42% | 1,642 | $6.45 | 88% |
| LM Proposal | 0% | 34.13% | 200 | $1.84 | 96% |

---

Expand Down Expand Up @@ -397,9 +382,4 @@ Impact of thinking/reasoning budget on GPQA accuracy for Opus (adaptive effort)
|:---------|:----------|:--------------|:------------|:-----|:--------|
| Brute Force | 100% | 98.84% | 14,961 | $123.87 | -- |
| Arm Elimination | 86% | 98.83% | 3,356 | $51.86 | **58%** |
| Hill Climbing | 80% | 98.76% | 3,926 | $54.22 | 56% |
| Random Search | 28% | 98.17% | 3,880 | $31.77 | 74% |
| Epsilon LUCB | 4% | 96.99% | 447 | $6.10 | 95% |
| LM Proposal | 0% | 95.82% | 158 | $5.61 | 95% |
| Bayesian Opt | 4% | 95.41% | 3,666 | $35.56 | 71% |
| Threshold SE | 0% | 74.52% | 1,355 | $6.90 | 94% |
23 changes: 1 addition & 22 deletions docs/blog/posts/technical-deep-dive.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,24 +109,10 @@ Arm Elimination works in rounds:

Bad combos get eliminated early and cheaply. Good combos earn more search budget. The search cost is far less than brute force.

### Epsilon-LUCB

When you just need to find *the single best* combo, epsilon-LUCB (Lower/Upper Confidence Bound) is extremely sample-efficient. Each round, it compares the current leader's lower confidence bound against the best challenger's upper bound. When the gap closes below a threshold epsilon, you've found your winner with statistical confidence.

### Threshold Successive Elimination

When you have a minimum acceptable accuracy in mind (e.g., "I need at least 70%"), Threshold SE takes a different approach. Instead of finding the single best combo, it classifies each combo as above or below your threshold. Each round, it evaluates all surviving combos on one more datapoint and checks their confidence intervals. Once a combo's interval no longer straddles the threshold (entirely above or entirely below), it's classified and removed from the active set. This is useful when you care about filtering rather than ranking.

### Bayesian Optimization

For expensive evaluations, Bayesian Optimization builds a Gaussian Process surrogate model that predicts accuracy as a function of the model combination. It uses Expected Improvement to pick the most informative next evaluation, spending budget where uncertainty is highest and potential is greatest.

### Hill Climbing

Hill Climbing takes a different approach: greedy local search. Start with a random model combination, then swap one model at a time, keeping each swap only if it improves accuracy. Use random restarts to avoid getting stuck in local optima.

The catch: Hill Climbing requires **topology information**. It needs a notion of which models are "neighbors" of each other, typically an ordering by capability or cost. This lets it search intelligently (try the next-best model, not a random one), but it also means you're injecting assumptions about model quality that may not hold. As the HotpotQA results show, model capability rankings don't always predict combo performance. When the topology is misleading, Hill Climbing can get stuck exploring the wrong region of the search space.

### How Much Do These Save?

Across our four benchmarks, Arm Elimination consistently achieves near-optimal accuracy while using up to 67% less budget than brute force:
Expand Down Expand Up @@ -169,20 +155,13 @@ We validated AgentOpt across four diverse benchmarks using 9 models on Amazon Be
<tbody>
<tr><td><strong>Brute Force</strong></td><td>74.75% / 0%</td><td>70.00% / 0%</td><td>74.27% / 0%</td><td>98.84% / 0%</td></tr>
<tr><td><strong>Arm Elimination</strong></td><td>74.10% / 24%</td><td>69.37% / 12%</td><td><span class="ao-efficient">73.19% / 67%</span></td><td><span class="ao-efficient">98.83% / 58%</span></td></tr>
<tr><td><strong>Hill Climbing</strong></td><td>74.55% / 14%</td><td>70.00% / 15%</td><td><span class="ao-efficient">73.13% / 63%</span></td><td><span class="ao-efficient">98.76% / 56%</span></td></tr>
<tr><td><strong>Bayesian Opt</strong></td><td>72.43% / 45%</td><td>69.27% / 40%</td><td><span class="ao-efficient">73.33% / 76%</span></td><td><span class="ao-efficient">95.41% / 71%</span></td></tr>
<tr><td><strong>Random Search</strong></td><td><span class="ao-efficient">68.57% / 63%</span></td><td><span class="ao-efficient">67.13% / 63%</span></td><td><span class="ao-efficient">72.25% / 74%</span></td><td><span class="ao-efficient">98.17% / 74%</span></td></tr>
<tr><td><strong>Epsilon-LUCB</strong></td><td>73.14% / 47%</td><td><span class="ao-efficient">69.90% / 53%</span></td><td>69.71% / 97%</td><td><span class="ao-efficient">96.99% / 95%</span></td></tr>
<tr><td><strong>Threshold SE</strong></td><td>57.83% / 62%</td><td>58.19% / 78%</td><td>65.42% / 88%</td><td>74.52% / 94%</td></tr>
<tr><td><strong>LM Proposal</strong></td><td>74.75% / 48%</td><td>44.03% / 96%</td><td>34.13% / 97%</td><td><span class="ao-efficient">95.82% / 96%</span></td></tr>
</tbody>
</table>

*Format: obtained accuracy / search cost savings vs brute force. Averaged over 50 seeds. <span class="ao-efficient">Green</span> = within 5% of brute force accuracy AND >50% savings on that metric.*

Arm Elimination and Hill Climbing achieve comparable mean accuracy (within 1 percentage point of brute force across all four benchmarks), with Arm Elimination offering modestly higher cost savings on average (40% vs 37%). No single selector dominates all benchmarks. Hill Climbing excels when top models are tightly clustered (BFCL), while Arm Elimination performs best when there is clear separation between the best combo and the rest (HotpotQA, MathQA). However, Hill Climbing requires a hand-crafted topology ranking of the models upfront — you need prior knowledge about model quality and speed ordering for it to search effectively. Arm Elimination is fully assumption-free: it uses only the observed evaluation data to eliminate dominated combos, making it more practical when you don't have reliable priors about model capabilities.

LM Proposal (asking GPT-4.1 to predict the best combo) matches brute force on GPQA (where the answer is intuitive) but collapses to 34% on HotpotQA and 44% on BFCL. It cannot predict that Ministral outperforms Opus as a planner.
Arm Elimination consistently achieves near-optimal accuracy while using significantly less budget than brute force across our four benchmarks. No single selector dominates all benchmarks — Arm Elimination performs best when there is clear separation between the best combo and the rest (HotpotQA, MathQA), while Bayesian Optimization can achieve high savings on large search spaces at the cost of lower find rates.

## Get Started

Expand Down
Loading