Skip to content

microsoft/SkillOpt

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Train agent skills like you train neural networks β€” with epochs, (mini-)batchsize, learning rates, and validation gates β€” but without touching model weights.

Project Page Paper Project Video Python 3.10+ License: MIT

🎬 SkillOpt Demo Video

64c8f76086bed7bd7a5ce664a7a14f40_raw.mp4

β–Ά Watch the full demo on YouTube


Install

Requirements: Python 3.10+

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .

# For ALFWorld benchmark (optional):
pip install -e ".[alfworld]"
alfworld-download

Configure API Credentials

cp .env.example .env
# Edit .env with your API credentials, then:
source .env

Azure OpenAI (recommended):

export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
# Option 1: API key auth
export AZURE_OPENAI_API_KEY="your-key"
# Option 2: Azure CLI auth (no API key needed)
export AZURE_OPENAI_AUTH_MODE="azure_cli"

Note: AZURE_OPENAI_ENDPOINT is always required. Without it, all LLM calls will fail.

OpenAI directly:

export OPENAI_API_KEY="sk-..."

Anthropic Claude:

export ANTHROPIC_API_KEY="sk-ant-..."

Qwen (local vLLM):

export QWEN_CHAT_BASE_URL="http://localhost:8000/v1"
export QWEN_CHAT_MODEL="Qwen/Qwen3.5-4B"

Data Preparation

SkillOpt expects data in a split directory with train/, val/, test/ subdirectories, each containing a JSON file (e.g., items.json).

data/my_split/
β”œβ”€β”€ train/items.json
β”œβ”€β”€ val/items.json
└── test/items.json

Each JSON file is an array of task items. The required fields depend on the benchmark. For example, SearchQA items look like:

[
  {
    "id": "unique_item_id",
    "question": "Who wrote the novel ...",
    "context": "[DOC] relevant passage text ...",
    "answers": ["expected answer"]
  }
]

See skillopt/envs/<benchmark>/dataloader.py for the exact format each benchmark expects.

Note: Benchmark datasets are not included in this repository. Prepare your own data following the format above.

Supported Benchmarks

Benchmark Type Config
SearchQA QA configs/searchqa/default.yaml
ALFWorld Embodied agent configs/alfworld/default.yaml
DocVQA Document QA configs/docvqa/default.yaml
LiveMathematicianBench Math configs/livemathematicianbench/default.yaml
SpreadsheetBench Code generation configs/spreadsheetbench/default.yaml
OfficeQA Tool-augmented QA configs/officeqa/default.yaml

Quick Start

Training

# Minimal example β€” train on SearchQA:
python scripts/train.py \
    --config configs/searchqa/default.yaml \
    --split_dir /path/to/your/searchqa_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

# Train on LiveMathematicianBench:
python scripts/train.py \
    --config configs/livemathematicianbench/default.yaml \
    --split_dir /path/to/your/livemath_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

# Train on ALFWorld:
python scripts/train.py \
    --config configs/alfworld/default.yaml \
    --split_dir /path/to/your/alfworld_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

Key CLI arguments:

Argument Description Example
--config Benchmark config YAML configs/searchqa/default.yaml
--split_dir Path to data split directory /path/to/split
--azure_openai_endpoint Azure OpenAI endpoint URL https://your-resource.openai.azure.com/
--optimizer_model Optimizer model deployment name gpt-5.5
--target_model Target model deployment name gpt-5.5
--num_epochs Number of training epochs 4
--batch_size Batch size per step 40
--workers Parallel rollout workers 8
--out_root Output directory outputs/my_run

Eval Only

Evaluate a trained skill on specific data splits without training:

# Evaluate on test set only:
python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/my_run/best_skill.md \
  --split valid_unseen \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/

# Evaluate on all splits (train + val + test):
python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/my_run/best_skill.md \
  --split all \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/
Split Description
valid_unseen Test set
valid_seen Validation set
train Training set
all All splits combined (default)

Output Structure

Each run writes to a structured output directory:

outputs/<run_name>/
β”œβ”€β”€ config.json              # Flattened runtime config
β”œβ”€β”€ history.json             # Per-step training history
β”œβ”€β”€ runtime_state.json       # Resume checkpoint
β”œβ”€β”€ best_skill.md            # Best validated skill document
β”œβ”€β”€ skills/skill_vXXXX.md   # Skill snapshot per step
β”œβ”€β”€ steps/step_XXXX/        # Per-step artifacts (patches, evals)
β”œβ”€β”€ slow_update/epoch_XX/   # Slow update logs
└── meta_skill/epoch_XX/    # Meta skill logs

Re-running the same command auto-resumes from the last completed step.


WebUI

Launch the monitoring dashboard (optional):

pip install -e ".[webui]"
python -m skillopt_webui.app
Flag Default Description
--port 7860 Server port
--host 0.0.0.0 Bind address
--share off Create a public Gradio share link
# With public share link (useful for remote servers)
python -m skillopt_webui.app --share

Citation

@misc{yang2026skilloptexecutivestrategyselfevolving,
      title={SkillOpt: Executive Strategy for Self-Evolving Agent Skills}, 
      author={Yifan Yang and Ziyang Gong and Weiquan Huang and Qihao Yang and Ziwei Zhou and Zisu Huang and Yan Li and Xuemei Gao and Qi Dai and Bei Liu and Kai Qiu and Yuqing Yang and Dongdong Chen and Xue Yang and Chong Luo},
      year={2026},
      eprint={2605.23904},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.23904}
}

About

SkillOpt is a text-space optimizer that trains reusable natural-language skills for frozen LLM agents through trajectory-driven edits, validation-gated updates, and deployable best_skill.md artifacts.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors