SkilProbe

CLI-first framework for evaluating AI agent skills. SkilProbe analyzes SKILL.md specifications through static validation, LLM-as-judge scoring, multi-model evaluation, regression tracking, and CI/CD integration.

Overview

SkilProbe provides a complete pipeline for measuring how well AI agents perform on defined skills:

Static Analysis -- Lint and validate SKILL.md files against structural rules, check token budgets, and run LLM-as-judge quality assessments
Dataset Engine -- Manage golden evaluation datasets with versioning, validation, synthetic generation, and promotion workflows
MCP Introspection -- Discover tools from MCP servers, map them to skill references, analyze schemas, and generate evaluation scenarios
Eval Engine -- Execute multi-model evaluations with sandbox isolation and structured trace collection
Scoring Pipeline -- Score traces using deterministic, threshold, rubric, and comparative judges with weighted dimension aggregation
Reporting -- Render reports in 6 formats (terminal, JSON, Markdown, HTML, SARIF), track regressions with adaptive thresholds, and enforce CI/CD quality gates

Installation

# Clone the repository
git clone https://github.com/mimran-khan/SkilProbe.git
cd SkilProbe

# Install with uv (recommended)
uv sync

# Or with pip
pip install -e .

Requirements

Python 3.12+
uv (recommended) or pip

Environment Variables

Variable	Required	Description
`ANTHROPIC_API_KEY`	For LLM judges	Anthropic API key for Claude models
`OPENAI_API_KEY`	For LLM judges	OpenAI API key for GPT models
`GITHUB_TOKEN`	For CI/CD	GitHub token for PR comments and status checks
`GITLAB_TOKEN`	For CI/CD	GitLab token for MR notes

Quick Start

# Lint a skill spec (no LLM required)
skilprobe static lint path/to/SKILL.md

# Full analysis with LLM scoring
skilprobe static analyze path/to/SKILL.md

# Token budget analysis
skilprobe static tokens path/to/SKILL.md

# Validate a dataset
skilprobe dataset validate path/to/dataset/

# Discover MCP tools
skilprobe mcp discover --config skilprobe.yaml

# Run an evaluation
skilprobe exec run --skills path/to/skill --dataset data.yaml --models gpt-4o

# Score evaluation traces
skilprobe score run traces.jsonl

# Render a report
skilprobe report render --input report.json --format html --output ./reports/

# Run CI/CD gate check
skilprobe gate check --report report.json

Architecture

skilprobe/
├── static_analysis/        # Phase 1: Linting, LLM judges, token analysis
│   ├── spec_validator/     #   Structural validation (frontmatter, markdown, secrets, licenses)
│   ├── activation_judge/   #   LLM-based activation quality judge
│   ├── content_judge/      #   LLM-based content quality judge
│   ├── token_analyzer/     #   Token counting and budget analysis
│   └── rubrics/            #   Scoring rubric definitions
├── datasets/               # Phase 2: Dataset engine
│   ├── loader.py           #   Multi-format loader (YAML, JSON, JSONL)
│   ├── storage.py          #   Versioned storage with changelog tracking
│   ├── validator.py        #   Schema validation and coverage analysis
│   └── generator.py        #   Synthetic dataset generation (template + LLM)
├── mcp/                    # Phase 3a: MCP introspection
│   ├── connector.py        #   MCP server connector
│   ├── discovery.py        #   Tool catalog builder
│   ├── cache.py            #   Catalog caching with TTL
│   ├── tool_mapper.py      #   Skill-to-tool mapping with fuzzy matching
│   ├── schema_analyzer.py  #   JSON Schema analysis and complexity scoring
│   ├── param_synthesizer.py#   Synthetic parameter generation
│   └── scenario_generator.py#  Evaluation scenario generation
├── eval_engine/            # Phase 3b: Evaluation execution
│   ├── runner.py           #   Agent runner with step-by-step execution
│   ├── sandbox.py          #   Filesystem and network sandboxing
│   ├── trace.py            #   Structured trace collection
│   └── models.py           #   Execution plans, tasks, and traces
├── scoring/                # Phase 3c: Scoring pipeline
│   ├── judges.py           #   10 built-in judges (deterministic + threshold + rubric)
│   ├── dispatcher.py       #   Judge orchestration and score aggregation
│   └── models.py           #   Score reports, rubrics, calibration models
├── report/                 # Phase 4: Reporting and CI/CD
│   ├── builder.py          #   ScoreReport[] → ReportData transformation
│   ├── storage.py          #   Atomic file storage with latest symlink
│   ├── models.py           #   Report data models (23+ Pydantic models)
│   ├── renderers/          #   Multi-format report rendering
│   │   ├── terminal.py     #     Rich terminal output with color
│   │   ├── json_renderer.py#     Structured JSON
│   │   ├── markdown.py     #     Markdown with PR mode
│   │   ├── html.py         #     Self-contained HTML with Canvas charts
│   │   └── sarif.py        #     SARIF 2.1.0 for IDEs and security tools
│   ├── regression/         #   Regression tracking
│   │   ├── detector.py     #     Adaptive threshold detection (volatility-based)
│   │   ├── baseline_manager.py#  Baseline CRUD with archive rotation
│   │   ├── score_history.py#     Score history tracking
│   │   ├── drift.py        #       Score drift analysis
│   │   └── hints.py        #       Diagnostic hint generation
│   ├── gate/               #   CI/CD quality gates
│   │   ├── evaluator.py    #     Gate logic with composite/dimension/regression checks
│   │   ├── config.py       #     Gate configuration validation
│   │   └── exit_codes.py   #     Standard exit codes (0/1/2/78)
│   └── cicd/               #   CI/CD platform integrations
│       ├── github.py       #     GitHub PR comments and status checks (PyGitHub)
│       ├── gitlab.py       #     GitLab MR notes (python-gitlab)
│       └── detect.py       #     CI environment auto-detection
├── llm/                    # LLM abstraction layer
│   ├── router.py           #   Model routing via LiteLLM
│   ├── cost_tracker.py     #   Token and cost tracking
│   └── config.py           #   Provider configuration
├── models/                 # Shared data models
│   ├── skill.py            #   SKILL.md parsed representation
│   ├── dataset.py          #   Dataset items, versions, changelog
│   ├── scoring.py          #   Scoring enums and types
│   ├── config.py           #   Configuration models
│   └── enums.py            #   Shared enumerations
└── cli.py                  # Typer CLI with 7 command groups

CLI Reference

`skilprobe static` -- Static Analysis

# Lint against structural rules (no LLM required)
skilprobe static lint SKILL.md
skilprobe static lint SKILL.md --format json --severity warning
skilprobe static lint SKILL.md --suppress FRONT001,MD001

# Full analysis with LLM judges
skilprobe static analyze SKILL.md --model claude-haiku-4-20250414
skilprobe static analyze SKILL.md --no-llm --format markdown --output report.md

# Token budget analysis
skilprobe static tokens SKILL.md --format json
skilprobe static tokens SKILL.md --providers openai,anthropic

`skilprobe dataset` -- Dataset Management

# Validate dataset items
skilprobe dataset validate ./datasets/ --format json
skilprobe dataset validate data.yaml --no-duplicates

# Import items from a file
skilprobe dataset import source.yaml --output ./datasets/

# List and inspect items
skilprobe dataset list ./datasets/ --target golden
skilprobe dataset show ./datasets/ --id item-001 --format json

# Promote validated drafts to golden
skilprobe dataset promote ./datasets/ --approve-all

# Dataset statistics
skilprobe dataset stats ./datasets/ --format json

`skilprobe mcp` -- MCP Introspection

# Discover tools from configured MCP servers
skilprobe mcp discover --config skilprobe.yaml
skilprobe mcp discover --no-cache --format json

# Check server health
skilprobe mcp health --config skilprobe.yaml

# Map skill references to MCP tools
skilprobe mcp map SKILL.md --config skilprobe.yaml

# Generate evaluation scenarios
skilprobe mcp generate --config skilprobe.yaml --max-scenarios 50

# Clear catalog cache
skilprobe mcp cache-clear --expired-only

`skilprobe exec` -- Evaluation Execution

# Run evaluation
skilprobe exec run --skills ./skills/coding --dataset data.yaml \
    --models gpt-4o,claude-sonnet-4-20250514 --output ./traces/

# Preview execution plan and cost estimate
skilprobe exec plan --skills ./skills/coding --dataset data.yaml

# Execution modes
skilprobe exec run --skills ./skill --dataset data.yaml --mode with_skill
skilprobe exec run --skills ./skill --dataset data.yaml --mode without_skill
skilprobe exec run --skills ./skill --dataset data.yaml --mode comparison

`skilprobe score` -- Scoring

# Score evaluation traces
skilprobe score run traces.jsonl --output ./scores/
skilprobe score run traces.jsonl --format json

# View scoring configuration
skilprobe score config

`skilprobe report` -- Report Rendering

# Render in various formats
skilprobe report render --input report.json --format terminal
skilprobe report render --input report.json --format html --output ./reports/
skilprobe report render --input report.json --format markdown
skilprobe report render --input report.json --format sarif --output ./reports/

# Post to GitHub PR
skilprobe report pr-comment --report report.json --pr 42 --repo owner/repo

# Post to GitLab MR
skilprobe report mr-comment --report report.json --mr 42 --project 12345

# Set GitHub commit status
skilprobe report status-check --report report.json --sha abc123 --repo owner/repo

`skilprobe baseline` -- Regression Baselines

# Create a baseline from report data
skilprobe baseline create --input report.json --skill-id coding

# View baseline
skilprobe baseline show --skill-id coding --format json

# Reset (archive) a baseline
skilprobe baseline reset --skill-id coding --confirm

`skilprobe gate` -- CI/CD Gates

# Run gate check (exits 0=pass, 1=fail, 2=error, 78=skip)
skilprobe gate check --report report.json
skilprobe gate check --report report.json --config gate.yaml

# Validate gate configuration
skilprobe gate validate gate.yaml

Scoring System

Built-in Judges

Judge	Type	Dimension	Description
`tool_call_success`	Deterministic	tool_use_quality	Whether tool calls succeeded
`expected_tool_used`	Deterministic	task_completion	Whether expected tools were used
`no_error_steps`	Deterministic	tool_use_quality	No error steps in trace
`output_format`	Deterministic	output_quality	Output matches expected format
`safety_command`	Deterministic	tool_use_quality	No unsafe commands executed
`token_budget`	Threshold	process_quality	Token usage within budget
`step_count`	Threshold	process_quality	Step count within limits
`execution_time`	Threshold	process_quality	Execution time within limits
`total_tokens`	Threshold	process_quality	Total tokens within limits
`tool_call_count`	Threshold	process_quality	Tool call count within limits

Dimension Weights

Dimension	Default Weight
Task Completion	0.30
Tool Use Quality	0.25
Output Quality	0.20
Process Quality	0.15
Instruction Following	0.10

Score Scale

All scores are on a 0-100 scale. Rubric judges (1-5 scale) are automatically normalized: (raw - 1) / 4 * 100.

Grading

Grade	Score Range
A+	97-100
A	93-96
A-	90-92
B+	87-89
B	83-86
B-	80-82
C+	77-79
C	73-76
C-	70-72
D	60-69
F	0-59

CI/CD Integration

Gate Configuration

Create a gate.yaml file:

min_composite_score: 60.0
min_dimension_score: 50.0
max_regression_pct: 10.0
required_dimensions:
  - task_completion
  - tool_use_quality
block_on_failure: true
new_skill_policy: allow   # allow | warn | block

Exit Codes

Code	Meaning
`0`	PASS -- all checks passed
`1`	FAIL -- quality gate violated
`2`	ERROR -- misconfiguration or runtime error
`78`	SKIP -- gate skipped (e.g., missing data)

GitHub Actions Example

name: SkilProbe Evaluation
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install SkilProbe
        run: pip install -e .

      - name: Run evaluation
        run: |
          skilprobe exec run \
            --skills ./skills/coding \
            --dataset golden.yaml \
            --output ./traces/ \
            --yes

      - name: Score traces
        run: skilprobe score run ./traces/traces.jsonl --output ./scores/

      - name: Gate check
        run: skilprobe gate check --report ./scores/report.json --config gate.yaml

      - name: Post PR comment
        if: always()
        run: |
          skilprobe report pr-comment \
            --report ./scores/report.json \
            --pr ${{ github.event.pull_request.number }} \
            --repo ${{ github.repository }}
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Regression Tracking

SkilProbe detects score regressions by comparing against baselines with adaptive thresholds:

Static thresholds: Hard (10pt drop = CRITICAL), Warn (5pt drop = WARNING)
Adaptive thresholds: When 5+ historical runs exist, thresholds adjust based on observed volatility (stdev * 2.5 for hard, stdev * 1.5 for warn)
Baseline management: Create, archive, and auto-update baselines

# Create a baseline
skilprobe baseline create --input report.json --skill-id coding

# Gate will automatically compare against baseline
skilprobe gate check --report report.json

Configuration

SkilProbe is configured via skilprobe.yaml:

# Model configuration
models:
  - id: gpt-4o
    provider: openai
    enabled: true
  - id: claude-sonnet-4-20250514
    provider: anthropic
    enabled: true

# MCP server configuration
mcp_servers:
  - name: filesystem
    command: npx
    args: ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"]

# Scoring configuration
scoring:
  judges:
    - name: tool_call_success
      judge_type: deterministic
      dimension: tool_use_quality
    - name: token_budget
      judge_type: threshold
      dimension: process_quality
  weights:
    task_completion: 0.30
    tool_use_quality: 0.25
    output_quality: 0.20
    process_quality: 0.15
    instruction_following: 0.10

Report Formats

Format	Use Case	Command Flag
Terminal	Interactive development	`--format terminal`
JSON	Programmatic consumption	`--format json`
Markdown	GitHub/GitLab PR comments	`--format markdown`
HTML	Self-contained shareable reports with charts	`--format html`
SARIF	IDE integration and security tool compatibility	`--format sarif`

Development

# Install dev dependencies
uv sync

# Run tests (991 tests)
uv run pytest

# Run with coverage
uv run pytest --cov=skilprobe

# Lint
uv run ruff check src/

# Type check
uv run mypy src/

Project Stats

~10,800 lines of Python source
991 tests across 36 test files
80+ Pydantic data models
7 CLI command groups, 30+ commands

Tech Stack

Component	Technology
Language	Python 3.12+
CLI Framework	Typer + Rich
Data Validation	Pydantic v2
LLM Integration	LiteLLM
Tokenization	tiktoken
Templating	Jinja2
GitHub API	PyGitHub
GitLab API	python-gitlab
Report Format	SARIF 2.1.0

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github		.github
docs		docs
examples		examples
skills/gitlab-project-reporter		skills/gitlab-project-reporter
src/skilprobe		src/skilprobe
tests		tests
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
pyproject.toml		pyproject.toml
skilprobe.yaml		skilprobe.yaml

Folders and files

Latest commit

History

Repository files navigation

SkilProbe

Overview

Installation

Requirements

Environment Variables

Quick Start

Architecture

CLI Reference

skilprobe static -- Static Analysis

skilprobe dataset -- Dataset Management

skilprobe mcp -- MCP Introspection

skilprobe exec -- Evaluation Execution

skilprobe score -- Scoring

skilprobe report -- Report Rendering

skilprobe baseline -- Regression Baselines

skilprobe gate -- CI/CD Gates

Scoring System

Built-in Judges

Dimension Weights

Score Scale

Grading

CI/CD Integration

Gate Configuration

Exit Codes

GitHub Actions Example

Regression Tracking

Configuration

Report Formats

Development

Project Stats

Tech Stack

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`skilprobe static` -- Static Analysis

`skilprobe dataset` -- Dataset Management

`skilprobe mcp` -- MCP Introspection

`skilprobe exec` -- Evaluation Execution

`skilprobe score` -- Scoring

`skilprobe report` -- Report Rendering

`skilprobe baseline` -- Regression Baselines

`skilprobe gate` -- CI/CD Gates

Packages