CLI-first framework for evaluating AI agent skills. SkilProbe analyzes SKILL.md specifications through static validation, LLM-as-judge scoring, multi-model evaluation, regression tracking, and CI/CD integration.
SkilProbe provides a complete pipeline for measuring how well AI agents perform on defined skills:
- Static Analysis -- Lint and validate
SKILL.mdfiles against structural rules, check token budgets, and run LLM-as-judge quality assessments - Dataset Engine -- Manage golden evaluation datasets with versioning, validation, synthetic generation, and promotion workflows
- MCP Introspection -- Discover tools from MCP servers, map them to skill references, analyze schemas, and generate evaluation scenarios
- Eval Engine -- Execute multi-model evaluations with sandbox isolation and structured trace collection
- Scoring Pipeline -- Score traces using deterministic, threshold, rubric, and comparative judges with weighted dimension aggregation
- Reporting -- Render reports in 6 formats (terminal, JSON, Markdown, HTML, SARIF), track regressions with adaptive thresholds, and enforce CI/CD quality gates
# Clone the repository
git clone https://github.com/mimran-khan/SkilProbe.git
cd SkilProbe
# Install with uv (recommended)
uv sync
# Or with pip
pip install -e .- Python 3.12+
- uv (recommended) or pip
| Variable | Required | Description |
|---|---|---|
ANTHROPIC_API_KEY |
For LLM judges | Anthropic API key for Claude models |
OPENAI_API_KEY |
For LLM judges | OpenAI API key for GPT models |
GITHUB_TOKEN |
For CI/CD | GitHub token for PR comments and status checks |
GITLAB_TOKEN |
For CI/CD | GitLab token for MR notes |
# Lint a skill spec (no LLM required)
skilprobe static lint path/to/SKILL.md
# Full analysis with LLM scoring
skilprobe static analyze path/to/SKILL.md
# Token budget analysis
skilprobe static tokens path/to/SKILL.md
# Validate a dataset
skilprobe dataset validate path/to/dataset/
# Discover MCP tools
skilprobe mcp discover --config skilprobe.yaml
# Run an evaluation
skilprobe exec run --skills path/to/skill --dataset data.yaml --models gpt-4o
# Score evaluation traces
skilprobe score run traces.jsonl
# Render a report
skilprobe report render --input report.json --format html --output ./reports/
# Run CI/CD gate check
skilprobe gate check --report report.jsonskilprobe/
├── static_analysis/ # Phase 1: Linting, LLM judges, token analysis
│ ├── spec_validator/ # Structural validation (frontmatter, markdown, secrets, licenses)
│ ├── activation_judge/ # LLM-based activation quality judge
│ ├── content_judge/ # LLM-based content quality judge
│ ├── token_analyzer/ # Token counting and budget analysis
│ └── rubrics/ # Scoring rubric definitions
├── datasets/ # Phase 2: Dataset engine
│ ├── loader.py # Multi-format loader (YAML, JSON, JSONL)
│ ├── storage.py # Versioned storage with changelog tracking
│ ├── validator.py # Schema validation and coverage analysis
│ └── generator.py # Synthetic dataset generation (template + LLM)
├── mcp/ # Phase 3a: MCP introspection
│ ├── connector.py # MCP server connector
│ ├── discovery.py # Tool catalog builder
│ ├── cache.py # Catalog caching with TTL
│ ├── tool_mapper.py # Skill-to-tool mapping with fuzzy matching
│ ├── schema_analyzer.py # JSON Schema analysis and complexity scoring
│ ├── param_synthesizer.py# Synthetic parameter generation
│ └── scenario_generator.py# Evaluation scenario generation
├── eval_engine/ # Phase 3b: Evaluation execution
│ ├── runner.py # Agent runner with step-by-step execution
│ ├── sandbox.py # Filesystem and network sandboxing
│ ├── trace.py # Structured trace collection
│ └── models.py # Execution plans, tasks, and traces
├── scoring/ # Phase 3c: Scoring pipeline
│ ├── judges.py # 10 built-in judges (deterministic + threshold + rubric)
│ ├── dispatcher.py # Judge orchestration and score aggregation
│ └── models.py # Score reports, rubrics, calibration models
├── report/ # Phase 4: Reporting and CI/CD
│ ├── builder.py # ScoreReport[] → ReportData transformation
│ ├── storage.py # Atomic file storage with latest symlink
│ ├── models.py # Report data models (23+ Pydantic models)
│ ├── renderers/ # Multi-format report rendering
│ │ ├── terminal.py # Rich terminal output with color
│ │ ├── json_renderer.py# Structured JSON
│ │ ├── markdown.py # Markdown with PR mode
│ │ ├── html.py # Self-contained HTML with Canvas charts
│ │ └── sarif.py # SARIF 2.1.0 for IDEs and security tools
│ ├── regression/ # Regression tracking
│ │ ├── detector.py # Adaptive threshold detection (volatility-based)
│ │ ├── baseline_manager.py# Baseline CRUD with archive rotation
│ │ ├── score_history.py# Score history tracking
│ │ ├── drift.py # Score drift analysis
│ │ └── hints.py # Diagnostic hint generation
│ ├── gate/ # CI/CD quality gates
│ │ ├── evaluator.py # Gate logic with composite/dimension/regression checks
│ │ ├── config.py # Gate configuration validation
│ │ └── exit_codes.py # Standard exit codes (0/1/2/78)
│ └── cicd/ # CI/CD platform integrations
│ ├── github.py # GitHub PR comments and status checks (PyGitHub)
│ ├── gitlab.py # GitLab MR notes (python-gitlab)
│ └── detect.py # CI environment auto-detection
├── llm/ # LLM abstraction layer
│ ├── router.py # Model routing via LiteLLM
│ ├── cost_tracker.py # Token and cost tracking
│ └── config.py # Provider configuration
├── models/ # Shared data models
│ ├── skill.py # SKILL.md parsed representation
│ ├── dataset.py # Dataset items, versions, changelog
│ ├── scoring.py # Scoring enums and types
│ ├── config.py # Configuration models
│ └── enums.py # Shared enumerations
└── cli.py # Typer CLI with 7 command groups
# Lint against structural rules (no LLM required)
skilprobe static lint SKILL.md
skilprobe static lint SKILL.md --format json --severity warning
skilprobe static lint SKILL.md --suppress FRONT001,MD001
# Full analysis with LLM judges
skilprobe static analyze SKILL.md --model claude-haiku-4-20250414
skilprobe static analyze SKILL.md --no-llm --format markdown --output report.md
# Token budget analysis
skilprobe static tokens SKILL.md --format json
skilprobe static tokens SKILL.md --providers openai,anthropic# Validate dataset items
skilprobe dataset validate ./datasets/ --format json
skilprobe dataset validate data.yaml --no-duplicates
# Import items from a file
skilprobe dataset import source.yaml --output ./datasets/
# List and inspect items
skilprobe dataset list ./datasets/ --target golden
skilprobe dataset show ./datasets/ --id item-001 --format json
# Promote validated drafts to golden
skilprobe dataset promote ./datasets/ --approve-all
# Dataset statistics
skilprobe dataset stats ./datasets/ --format json# Discover tools from configured MCP servers
skilprobe mcp discover --config skilprobe.yaml
skilprobe mcp discover --no-cache --format json
# Check server health
skilprobe mcp health --config skilprobe.yaml
# Map skill references to MCP tools
skilprobe mcp map SKILL.md --config skilprobe.yaml
# Generate evaluation scenarios
skilprobe mcp generate --config skilprobe.yaml --max-scenarios 50
# Clear catalog cache
skilprobe mcp cache-clear --expired-only# Run evaluation
skilprobe exec run --skills ./skills/coding --dataset data.yaml \
--models gpt-4o,claude-sonnet-4-20250514 --output ./traces/
# Preview execution plan and cost estimate
skilprobe exec plan --skills ./skills/coding --dataset data.yaml
# Execution modes
skilprobe exec run --skills ./skill --dataset data.yaml --mode with_skill
skilprobe exec run --skills ./skill --dataset data.yaml --mode without_skill
skilprobe exec run --skills ./skill --dataset data.yaml --mode comparison# Score evaluation traces
skilprobe score run traces.jsonl --output ./scores/
skilprobe score run traces.jsonl --format json
# View scoring configuration
skilprobe score config# Render in various formats
skilprobe report render --input report.json --format terminal
skilprobe report render --input report.json --format html --output ./reports/
skilprobe report render --input report.json --format markdown
skilprobe report render --input report.json --format sarif --output ./reports/
# Post to GitHub PR
skilprobe report pr-comment --report report.json --pr 42 --repo owner/repo
# Post to GitLab MR
skilprobe report mr-comment --report report.json --mr 42 --project 12345
# Set GitHub commit status
skilprobe report status-check --report report.json --sha abc123 --repo owner/repo# Create a baseline from report data
skilprobe baseline create --input report.json --skill-id coding
# View baseline
skilprobe baseline show --skill-id coding --format json
# Reset (archive) a baseline
skilprobe baseline reset --skill-id coding --confirm# Run gate check (exits 0=pass, 1=fail, 2=error, 78=skip)
skilprobe gate check --report report.json
skilprobe gate check --report report.json --config gate.yaml
# Validate gate configuration
skilprobe gate validate gate.yaml| Judge | Type | Dimension | Description |
|---|---|---|---|
tool_call_success |
Deterministic | tool_use_quality | Whether tool calls succeeded |
expected_tool_used |
Deterministic | task_completion | Whether expected tools were used |
no_error_steps |
Deterministic | tool_use_quality | No error steps in trace |
output_format |
Deterministic | output_quality | Output matches expected format |
safety_command |
Deterministic | tool_use_quality | No unsafe commands executed |
token_budget |
Threshold | process_quality | Token usage within budget |
step_count |
Threshold | process_quality | Step count within limits |
execution_time |
Threshold | process_quality | Execution time within limits |
total_tokens |
Threshold | process_quality | Total tokens within limits |
tool_call_count |
Threshold | process_quality | Tool call count within limits |
| Dimension | Default Weight |
|---|---|
| Task Completion | 0.30 |
| Tool Use Quality | 0.25 |
| Output Quality | 0.20 |
| Process Quality | 0.15 |
| Instruction Following | 0.10 |
All scores are on a 0-100 scale. Rubric judges (1-5 scale) are automatically normalized: (raw - 1) / 4 * 100.
| Grade | Score Range |
|---|---|
| A+ | 97-100 |
| A | 93-96 |
| A- | 90-92 |
| B+ | 87-89 |
| B | 83-86 |
| B- | 80-82 |
| C+ | 77-79 |
| C | 73-76 |
| C- | 70-72 |
| D | 60-69 |
| F | 0-59 |
Create a gate.yaml file:
min_composite_score: 60.0
min_dimension_score: 50.0
max_regression_pct: 10.0
required_dimensions:
- task_completion
- tool_use_quality
block_on_failure: true
new_skill_policy: allow # allow | warn | block| Code | Meaning |
|---|---|
0 |
PASS -- all checks passed |
1 |
FAIL -- quality gate violated |
2 |
ERROR -- misconfiguration or runtime error |
78 |
SKIP -- gate skipped (e.g., missing data) |
name: SkilProbe Evaluation
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install SkilProbe
run: pip install -e .
- name: Run evaluation
run: |
skilprobe exec run \
--skills ./skills/coding \
--dataset golden.yaml \
--output ./traces/ \
--yes
- name: Score traces
run: skilprobe score run ./traces/traces.jsonl --output ./scores/
- name: Gate check
run: skilprobe gate check --report ./scores/report.json --config gate.yaml
- name: Post PR comment
if: always()
run: |
skilprobe report pr-comment \
--report ./scores/report.json \
--pr ${{ github.event.pull_request.number }} \
--repo ${{ github.repository }}
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}SkilProbe detects score regressions by comparing against baselines with adaptive thresholds:
- Static thresholds: Hard (10pt drop = CRITICAL), Warn (5pt drop = WARNING)
- Adaptive thresholds: When 5+ historical runs exist, thresholds adjust based on observed volatility (stdev * 2.5 for hard, stdev * 1.5 for warn)
- Baseline management: Create, archive, and auto-update baselines
# Create a baseline
skilprobe baseline create --input report.json --skill-id coding
# Gate will automatically compare against baseline
skilprobe gate check --report report.jsonSkilProbe is configured via skilprobe.yaml:
# Model configuration
models:
- id: gpt-4o
provider: openai
enabled: true
- id: claude-sonnet-4-20250514
provider: anthropic
enabled: true
# MCP server configuration
mcp_servers:
- name: filesystem
command: npx
args: ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
# Scoring configuration
scoring:
judges:
- name: tool_call_success
judge_type: deterministic
dimension: tool_use_quality
- name: token_budget
judge_type: threshold
dimension: process_quality
weights:
task_completion: 0.30
tool_use_quality: 0.25
output_quality: 0.20
process_quality: 0.15
instruction_following: 0.10| Format | Use Case | Command Flag |
|---|---|---|
| Terminal | Interactive development | --format terminal |
| JSON | Programmatic consumption | --format json |
| Markdown | GitHub/GitLab PR comments | --format markdown |
| HTML | Self-contained shareable reports with charts | --format html |
| SARIF | IDE integration and security tool compatibility | --format sarif |
# Install dev dependencies
uv sync
# Run tests (991 tests)
uv run pytest
# Run with coverage
uv run pytest --cov=skilprobe
# Lint
uv run ruff check src/
# Type check
uv run mypy src/- ~10,800 lines of Python source
- 991 tests across 36 test files
- 80+ Pydantic data models
- 7 CLI command groups, 30+ commands
| Component | Technology |
|---|---|
| Language | Python 3.12+ |
| CLI Framework | Typer + Rich |
| Data Validation | Pydantic v2 |
| LLM Integration | LiteLLM |
| Tokenization | tiktoken |
| Templating | Jinja2 |
| GitHub API | PyGitHub |
| GitLab API | python-gitlab |
| Report Format | SARIF 2.1.0 |
MIT