Skip to content

mimran-khan/SkilProbe

Repository files navigation

SkilProbe

CLI-first framework for evaluating AI agent skills. SkilProbe analyzes SKILL.md specifications through static validation, LLM-as-judge scoring, multi-model evaluation, regression tracking, and CI/CD integration.

Python 3.12+ License: MIT Tests


Overview

SkilProbe provides a complete pipeline for measuring how well AI agents perform on defined skills:

  1. Static Analysis -- Lint and validate SKILL.md files against structural rules, check token budgets, and run LLM-as-judge quality assessments
  2. Dataset Engine -- Manage golden evaluation datasets with versioning, validation, synthetic generation, and promotion workflows
  3. MCP Introspection -- Discover tools from MCP servers, map them to skill references, analyze schemas, and generate evaluation scenarios
  4. Eval Engine -- Execute multi-model evaluations with sandbox isolation and structured trace collection
  5. Scoring Pipeline -- Score traces using deterministic, threshold, rubric, and comparative judges with weighted dimension aggregation
  6. Reporting -- Render reports in 6 formats (terminal, JSON, Markdown, HTML, SARIF), track regressions with adaptive thresholds, and enforce CI/CD quality gates

Installation

# Clone the repository
git clone https://github.com/mimran-khan/SkilProbe.git
cd SkilProbe

# Install with uv (recommended)
uv sync

# Or with pip
pip install -e .

Requirements

  • Python 3.12+
  • uv (recommended) or pip

Environment Variables

Variable Required Description
ANTHROPIC_API_KEY For LLM judges Anthropic API key for Claude models
OPENAI_API_KEY For LLM judges OpenAI API key for GPT models
GITHUB_TOKEN For CI/CD GitHub token for PR comments and status checks
GITLAB_TOKEN For CI/CD GitLab token for MR notes

Quick Start

# Lint a skill spec (no LLM required)
skilprobe static lint path/to/SKILL.md

# Full analysis with LLM scoring
skilprobe static analyze path/to/SKILL.md

# Token budget analysis
skilprobe static tokens path/to/SKILL.md

# Validate a dataset
skilprobe dataset validate path/to/dataset/

# Discover MCP tools
skilprobe mcp discover --config skilprobe.yaml

# Run an evaluation
skilprobe exec run --skills path/to/skill --dataset data.yaml --models gpt-4o

# Score evaluation traces
skilprobe score run traces.jsonl

# Render a report
skilprobe report render --input report.json --format html --output ./reports/

# Run CI/CD gate check
skilprobe gate check --report report.json

Architecture

skilprobe/
├── static_analysis/        # Phase 1: Linting, LLM judges, token analysis
│   ├── spec_validator/     #   Structural validation (frontmatter, markdown, secrets, licenses)
│   ├── activation_judge/   #   LLM-based activation quality judge
│   ├── content_judge/      #   LLM-based content quality judge
│   ├── token_analyzer/     #   Token counting and budget analysis
│   └── rubrics/            #   Scoring rubric definitions
├── datasets/               # Phase 2: Dataset engine
│   ├── loader.py           #   Multi-format loader (YAML, JSON, JSONL)
│   ├── storage.py          #   Versioned storage with changelog tracking
│   ├── validator.py        #   Schema validation and coverage analysis
│   └── generator.py        #   Synthetic dataset generation (template + LLM)
├── mcp/                    # Phase 3a: MCP introspection
│   ├── connector.py        #   MCP server connector
│   ├── discovery.py        #   Tool catalog builder
│   ├── cache.py            #   Catalog caching with TTL
│   ├── tool_mapper.py      #   Skill-to-tool mapping with fuzzy matching
│   ├── schema_analyzer.py  #   JSON Schema analysis and complexity scoring
│   ├── param_synthesizer.py#   Synthetic parameter generation
│   └── scenario_generator.py#  Evaluation scenario generation
├── eval_engine/            # Phase 3b: Evaluation execution
│   ├── runner.py           #   Agent runner with step-by-step execution
│   ├── sandbox.py          #   Filesystem and network sandboxing
│   ├── trace.py            #   Structured trace collection
│   └── models.py           #   Execution plans, tasks, and traces
├── scoring/                # Phase 3c: Scoring pipeline
│   ├── judges.py           #   10 built-in judges (deterministic + threshold + rubric)
│   ├── dispatcher.py       #   Judge orchestration and score aggregation
│   └── models.py           #   Score reports, rubrics, calibration models
├── report/                 # Phase 4: Reporting and CI/CD
│   ├── builder.py          #   ScoreReport[] → ReportData transformation
│   ├── storage.py          #   Atomic file storage with latest symlink
│   ├── models.py           #   Report data models (23+ Pydantic models)
│   ├── renderers/          #   Multi-format report rendering
│   │   ├── terminal.py     #     Rich terminal output with color
│   │   ├── json_renderer.py#     Structured JSON
│   │   ├── markdown.py     #     Markdown with PR mode
│   │   ├── html.py         #     Self-contained HTML with Canvas charts
│   │   └── sarif.py        #     SARIF 2.1.0 for IDEs and security tools
│   ├── regression/         #   Regression tracking
│   │   ├── detector.py     #     Adaptive threshold detection (volatility-based)
│   │   ├── baseline_manager.py#  Baseline CRUD with archive rotation
│   │   ├── score_history.py#     Score history tracking
│   │   ├── drift.py        #       Score drift analysis
│   │   └── hints.py        #       Diagnostic hint generation
│   ├── gate/               #   CI/CD quality gates
│   │   ├── evaluator.py    #     Gate logic with composite/dimension/regression checks
│   │   ├── config.py       #     Gate configuration validation
│   │   └── exit_codes.py   #     Standard exit codes (0/1/2/78)
│   └── cicd/               #   CI/CD platform integrations
│       ├── github.py       #     GitHub PR comments and status checks (PyGitHub)
│       ├── gitlab.py       #     GitLab MR notes (python-gitlab)
│       └── detect.py       #     CI environment auto-detection
├── llm/                    # LLM abstraction layer
│   ├── router.py           #   Model routing via LiteLLM
│   ├── cost_tracker.py     #   Token and cost tracking
│   └── config.py           #   Provider configuration
├── models/                 # Shared data models
│   ├── skill.py            #   SKILL.md parsed representation
│   ├── dataset.py          #   Dataset items, versions, changelog
│   ├── scoring.py          #   Scoring enums and types
│   ├── config.py           #   Configuration models
│   └── enums.py            #   Shared enumerations
└── cli.py                  # Typer CLI with 7 command groups

CLI Reference

skilprobe static -- Static Analysis

# Lint against structural rules (no LLM required)
skilprobe static lint SKILL.md
skilprobe static lint SKILL.md --format json --severity warning
skilprobe static lint SKILL.md --suppress FRONT001,MD001

# Full analysis with LLM judges
skilprobe static analyze SKILL.md --model claude-haiku-4-20250414
skilprobe static analyze SKILL.md --no-llm --format markdown --output report.md

# Token budget analysis
skilprobe static tokens SKILL.md --format json
skilprobe static tokens SKILL.md --providers openai,anthropic

skilprobe dataset -- Dataset Management

# Validate dataset items
skilprobe dataset validate ./datasets/ --format json
skilprobe dataset validate data.yaml --no-duplicates

# Import items from a file
skilprobe dataset import source.yaml --output ./datasets/

# List and inspect items
skilprobe dataset list ./datasets/ --target golden
skilprobe dataset show ./datasets/ --id item-001 --format json

# Promote validated drafts to golden
skilprobe dataset promote ./datasets/ --approve-all

# Dataset statistics
skilprobe dataset stats ./datasets/ --format json

skilprobe mcp -- MCP Introspection

# Discover tools from configured MCP servers
skilprobe mcp discover --config skilprobe.yaml
skilprobe mcp discover --no-cache --format json

# Check server health
skilprobe mcp health --config skilprobe.yaml

# Map skill references to MCP tools
skilprobe mcp map SKILL.md --config skilprobe.yaml

# Generate evaluation scenarios
skilprobe mcp generate --config skilprobe.yaml --max-scenarios 50

# Clear catalog cache
skilprobe mcp cache-clear --expired-only

skilprobe exec -- Evaluation Execution

# Run evaluation
skilprobe exec run --skills ./skills/coding --dataset data.yaml \
    --models gpt-4o,claude-sonnet-4-20250514 --output ./traces/

# Preview execution plan and cost estimate
skilprobe exec plan --skills ./skills/coding --dataset data.yaml

# Execution modes
skilprobe exec run --skills ./skill --dataset data.yaml --mode with_skill
skilprobe exec run --skills ./skill --dataset data.yaml --mode without_skill
skilprobe exec run --skills ./skill --dataset data.yaml --mode comparison

skilprobe score -- Scoring

# Score evaluation traces
skilprobe score run traces.jsonl --output ./scores/
skilprobe score run traces.jsonl --format json

# View scoring configuration
skilprobe score config

skilprobe report -- Report Rendering

# Render in various formats
skilprobe report render --input report.json --format terminal
skilprobe report render --input report.json --format html --output ./reports/
skilprobe report render --input report.json --format markdown
skilprobe report render --input report.json --format sarif --output ./reports/

# Post to GitHub PR
skilprobe report pr-comment --report report.json --pr 42 --repo owner/repo

# Post to GitLab MR
skilprobe report mr-comment --report report.json --mr 42 --project 12345

# Set GitHub commit status
skilprobe report status-check --report report.json --sha abc123 --repo owner/repo

skilprobe baseline -- Regression Baselines

# Create a baseline from report data
skilprobe baseline create --input report.json --skill-id coding

# View baseline
skilprobe baseline show --skill-id coding --format json

# Reset (archive) a baseline
skilprobe baseline reset --skill-id coding --confirm

skilprobe gate -- CI/CD Gates

# Run gate check (exits 0=pass, 1=fail, 2=error, 78=skip)
skilprobe gate check --report report.json
skilprobe gate check --report report.json --config gate.yaml

# Validate gate configuration
skilprobe gate validate gate.yaml

Scoring System

Built-in Judges

Judge Type Dimension Description
tool_call_success Deterministic tool_use_quality Whether tool calls succeeded
expected_tool_used Deterministic task_completion Whether expected tools were used
no_error_steps Deterministic tool_use_quality No error steps in trace
output_format Deterministic output_quality Output matches expected format
safety_command Deterministic tool_use_quality No unsafe commands executed
token_budget Threshold process_quality Token usage within budget
step_count Threshold process_quality Step count within limits
execution_time Threshold process_quality Execution time within limits
total_tokens Threshold process_quality Total tokens within limits
tool_call_count Threshold process_quality Tool call count within limits

Dimension Weights

Dimension Default Weight
Task Completion 0.30
Tool Use Quality 0.25
Output Quality 0.20
Process Quality 0.15
Instruction Following 0.10

Score Scale

All scores are on a 0-100 scale. Rubric judges (1-5 scale) are automatically normalized: (raw - 1) / 4 * 100.

Grading

Grade Score Range
A+ 97-100
A 93-96
A- 90-92
B+ 87-89
B 83-86
B- 80-82
C+ 77-79
C 73-76
C- 70-72
D 60-69
F 0-59

CI/CD Integration

Gate Configuration

Create a gate.yaml file:

min_composite_score: 60.0
min_dimension_score: 50.0
max_regression_pct: 10.0
required_dimensions:
  - task_completion
  - tool_use_quality
block_on_failure: true
new_skill_policy: allow   # allow | warn | block

Exit Codes

Code Meaning
0 PASS -- all checks passed
1 FAIL -- quality gate violated
2 ERROR -- misconfiguration or runtime error
78 SKIP -- gate skipped (e.g., missing data)

GitHub Actions Example

name: SkilProbe Evaluation
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install SkilProbe
        run: pip install -e .

      - name: Run evaluation
        run: |
          skilprobe exec run \
            --skills ./skills/coding \
            --dataset golden.yaml \
            --output ./traces/ \
            --yes

      - name: Score traces
        run: skilprobe score run ./traces/traces.jsonl --output ./scores/

      - name: Gate check
        run: skilprobe gate check --report ./scores/report.json --config gate.yaml

      - name: Post PR comment
        if: always()
        run: |
          skilprobe report pr-comment \
            --report ./scores/report.json \
            --pr ${{ github.event.pull_request.number }} \
            --repo ${{ github.repository }}
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Regression Tracking

SkilProbe detects score regressions by comparing against baselines with adaptive thresholds:

  • Static thresholds: Hard (10pt drop = CRITICAL), Warn (5pt drop = WARNING)
  • Adaptive thresholds: When 5+ historical runs exist, thresholds adjust based on observed volatility (stdev * 2.5 for hard, stdev * 1.5 for warn)
  • Baseline management: Create, archive, and auto-update baselines
# Create a baseline
skilprobe baseline create --input report.json --skill-id coding

# Gate will automatically compare against baseline
skilprobe gate check --report report.json

Configuration

SkilProbe is configured via skilprobe.yaml:

# Model configuration
models:
  - id: gpt-4o
    provider: openai
    enabled: true
  - id: claude-sonnet-4-20250514
    provider: anthropic
    enabled: true

# MCP server configuration
mcp_servers:
  - name: filesystem
    command: npx
    args: ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"]

# Scoring configuration
scoring:
  judges:
    - name: tool_call_success
      judge_type: deterministic
      dimension: tool_use_quality
    - name: token_budget
      judge_type: threshold
      dimension: process_quality
  weights:
    task_completion: 0.30
    tool_use_quality: 0.25
    output_quality: 0.20
    process_quality: 0.15
    instruction_following: 0.10

Report Formats

Format Use Case Command Flag
Terminal Interactive development --format terminal
JSON Programmatic consumption --format json
Markdown GitHub/GitLab PR comments --format markdown
HTML Self-contained shareable reports with charts --format html
SARIF IDE integration and security tool compatibility --format sarif

Development

# Install dev dependencies
uv sync

# Run tests (991 tests)
uv run pytest

# Run with coverage
uv run pytest --cov=skilprobe

# Lint
uv run ruff check src/

# Type check
uv run mypy src/

Project Stats

  • ~10,800 lines of Python source
  • 991 tests across 36 test files
  • 80+ Pydantic data models
  • 7 CLI command groups, 30+ commands

Tech Stack

Component Technology
Language Python 3.12+
CLI Framework Typer + Rich
Data Validation Pydantic v2
LLM Integration LiteLLM
Tokenization tiktoken
Templating Jinja2
GitHub API PyGitHub
GitLab API python-gitlab
Report Format SARIF 2.1.0

License

MIT

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages