⚒️ SkillForge

Self-Evolving Skill Intelligence Platform for AI Agents

Turn static agent skills into living, data-driven assets that learn from every interaction,
self-diagnose failures, and continuously evolve to stay optimal.

Overview

SkillForge is a skill intelligence layer for AI agents. Rather than treating skills as static prompt files, SkillForge makes them living assets — tracked, ranked, diagnosed, and evolved through real-world usage data.

Most agents ship with a fixed set of skills. When a skill fails, nobody notices. When a better version exists, nobody knows. SkillForge solves this by wrapping every skill in a feedback loop:

Track every execution outcome (success, latency, tokens, feedback)
Rank skills dynamically using reinforcement-learning-inspired Q-values
Diagnose failure patterns automatically (rule-based or LLM-assisted)
Evolve underperforming skills — patch prompts, adjust metadata, prune dead skills
Load progressively at 3 detail tiers to minimize context window waste

Built on research from Memento-Skills, AEL, SKILLREDUCER, SEA-Eval, and MemQ, SkillForge is designed as a drop-in skill intelligence layer for any agent framework.

Key Stats

Metric	Value
Components	35+
Python files	58
Lines of code	22,593
Tests	373 passing
Test time	1.30s
External deps	0 (stdlib-only)

Features

Core (6 components)

Component	Description
Skill Registry	SQLite-backed registry with 3-tier progressive loading (metadata → core prompt → full resources). Full CRUD, versioning, lifecycle management (draft → active → deprecated → archived).
Effectiveness Tracker	Tracks execution outcomes (success, latency, tokens, feedback). Maintains rolling Q-values via TD(λ) temporal-difference learning — skills that consistently succeed get higher priority.
Self-Diagnosis Engine	Analyzes failure patterns across recent outcomes. Supports rule-based heuristics out of the box, with optional LLM-assisted analysis for deeper insights. Generates patch suggestions with confidence scores.
Skill Dependency Graph	Directed acyclic graph of skill relationships. Supports topological sorting, downstream impact analysis, and Q-value propagation through dependency chains.
Progressive Loader	Loads skills at the right detail level for the task. Tier 1 = metadata (~30 tokens), Tier 2 = core prompt, Tier 3 = full resources. Supports multiple routing strategies (Q-value, success rate, relevance, usage count).
Evolution Loop	Continuous lifecycle management. Evaluates skill health (healthy/warning/critical), triggers diagnosis and patching for underperformers, and prunes dead skills that are low-Q, low-usage, and stale.

Intelligence (5 components)

Component	Description
Conflict Detector	Detects skill overlaps, contradictions, and dependency cycles. Uses Jaccard similarity for overlap detection (threshold > 0.6).
Health Monitor	4-weight health scoring: Q-value (35%), success rate (30%), recency decay (20%, half-life 14 days), usage frequency (15%, log-scaled saturation).
Skill Creator	Auto-creates skills from successful execution trajectories. Pattern frequency threshold (>60%), optional LLM-assisted extraction.
Skill Analyzer	Clustering, pattern detection, and recommendations. Identifies underperforming skill clusters and suggests improvements.
Skill Optimizer	Compression (deduplicate, reorder), splitting (decompose monolithic skills), and merging (consolidate similar skills).

Advanced Intelligence (6 components)

Component	Description
Elastic Memory	Adaptive memory store with importance tracking. Consolidation via Jaccard similarity > 0.7. Auto-compact with composite retention score (importance 0.45, recency 0.30, access frequency 0.25).
Alert Manager	Threshold, trend, and anomaly detection alerts. Alert lifecycle (active → acknowledged → resolved). Cooldown support, rule targeting, callback registration for webhook/email/Slack notifications.
Skill Generator	Zero-shot skill generation from natural language descriptions. Template-driven (no LLM required), TF-IDF-lite keyword extraction, domain-aware tag inference.
Enhanced RL Optimizer	Contextual bandits (epsilon-greedy with decay), replay buffer (fixed-capacity deque), reward model (linear SGD), curriculum scheduler (difficulty-based ordering).
Performance Predictor	OLS linear regression for performance trend prediction. Predicts skill improvement/decline from last 20 outcomes. Population variance for consistency measurement.
Skill Transfer Engine	Cross-agent skill export/import. Formats skills for Hermes, OpenClaw, LangChain, CrewAI. Includes metadata, core content, and dependencies.

Platform (4 components)

Component	Description
REST API Server	13 endpoints on port 8742 (stdlib http.server). CRUD for skills, outcomes, evolution, health, graph, metrics. Zero external dependencies.
MCP Server	8 tools via stdin/stdout JSON-RPC. Compatible with Claude Desktop, Cursor, and other MCP clients.
CLI Tool	8 commands: serve, skills list/health/search, evolve, dashboard, import, export.
Web Dashboard	React 18 + Vite + Tailwind CSS 4 + shadcn/ui. 5 pages (Dashboard, Skills, Evolution, Graph, Settings). Recharts visualization. Dark theme. Mock data fallback.

Marketplace & Observability (4 components)

Component	Description
Skill Marketplace	Publish, install, rate, and deprecate skills. SQLite-backed catalogue with search, filtering, and install tracking.
Observability — Tracing	OpenTelemetry-inspired span-based distributed tracing. SQLite persistence, nested spans, parent-child relationships.
Observability — Metrics	Counters, gauges, timings with summarization (min/max/mean/p50/p95/p99), histograms, label-based filtering.
Observability — Logging	JSON-structured log entries with 5 severity levels, trace/span correlation, component/skill filtering, full-text search.

Async & Versioning (4 components)

Component	Description
Async Skill Registry	Async wrapper via `asyncio.to_thread()` for all registry CRUD operations.
Async Q-Value Tracker	Async outcome recording, Q-value queries, and TD(λ) updates.
Async Progressive Loader	Async skill loading and sticky skill retrieval.
Version Manager	Semantic versioning (semver 2.0.0). Full version history with snapshots, rollback to any version, content diff, auto-generated changelogs.

Scale (3 components)

Component	Description
A/B Testing	Full experiment lifecycle. 4 assignment strategies (RANDOM, ROUND_ROBIN, WEIGHTED_RANDOM, HASH_BASED). Two-proportion z-test (2 variants) + chi-squared (3+ variants). Normal CDF via Abramowitz & Stegun (~1.5e-7 accuracy).
DB Abstraction	`DatabaseBackend` Protocol interface. `SQLiteBackend` (WAL mode, threading lock) + `MemoryBackend` (ephemeral for tests). `DatabaseConnection` context manager with auto-commit/rollback.
Integration Adapters	`GitHubAdapter` (REST API with SHA tracking), `SlackAdapter` (incoming webhooks), `WebhookAdapter` (HTTP POST + HMAC-SHA256), `FileAdapter` (JSON/markdown export). `AdapterRegistry` for dispatch and health checks.

Production (6 components)

Component	Description
Circuit Breaker	CLOSED → OPEN → HALF_OPEN state machine. Configurable failure threshold, recovery timeout, half-open probes. Thread-safe via `threading.Lock`.
Retry Policy	Exponential backoff with jitter. Configurable max attempts, base delay, max delay, backoff factor, retriable exceptions.
Bulkhead	Concurrency limiter with semaphore. Configurable max concurrent executions and max wait time.
Graceful Degradation	Cascading fallback chains. Primary function + ordered fallback list. `FallbackChainExhaustedError` on total failure.
Resilient Executor	Composable wrapper: Bulkhead → Circuit Breaker → Retry → Callable. Single API for all resilience layers.
Caching	`TTLCache` (time-to-live with lazy expiry + max_size eviction), `LRUCache` (Least Recently Used), `CachedStore` (composable over data source), `@cached` decorator. `CacheStats` for hit/miss tracking.

Installation

# From Git
git clone https://github.com/dwickyfp/skillforge.git
cd skillforge
pip install -e .

# Or install directly
pip install git+https://github.com/dwickyfp/skillforge.git

Requirements: Python 3.10+, SQLite 3.35+ (ships with Python)

Zero external dependencies — SkillForge uses only Python stdlib.

Quick Start

from skillforge.forge import SkillForge

# Initialize — creates ~/.skillforge/skillforge.db automatically
forge = SkillForge()

# Register a skill with 3-tier content
skill = forge.register_skill(
    name="code-reviewer",
    tier1_metadata="Review code for bugs, style, and performance issues",
    tier2_core="You are an expert code reviewer. Analyze the given code for...",
    tier3_resources=["examples/review_template.md", "rules/style_guide.md"],
    tags=["code", "review", "quality"],
)

# Load skills by natural-language query (ranked by Q-value)
skills = forge.load_skill("review my code", tier=2, limit=3)
print(f"Best match: {skills[0].name} (Q={skills[0].q_value:.2f})")

# Record execution outcomes
forge.record_outcome(
    skill_id=skill.id,
    success=True,
    latency_ms=340,
    tokens_used=1200,
    user_feedback=4.5,
)

# Run evolution cycle — diagnose failures, patch skills, prune dead ones
report = forge.run_evolution_loop()
print(report.summary())

# Inspect skill stats
stats = forge.get_skill_stats(skill.id)
print(f"Success rate: {stats['success_rate']:.0%}")

forge.close()

Or use the context manager:

with SkillForge() as forge:
    forge.register_skill("greeter", "Greet users warmly")
    # ... all operations ...
# automatically closed

Architecture

┌───────────────────────────────────────────────────────────────────────────┐
│                         SkillForge Orchestrator                           │
│                             (forge.py)                                    │
├───────────────┬───────────────┬──────────────────┬────────────────────────┤
│               │               │                  │                        │
│  ┌────────────▼──────────┐    │  ┌───────────────▼────────────┐          │
│  │   Skill Registry      │    │  │  Effectiveness Tracker     │          │
│  │   ─────────────────   │    │  │  ──────────────────────    │          │
│  │   • 3-tier loading    │    │  │  • Outcome recording       │          │
│  │   • CRUD + version    │    │  │  • Q-values (TD(λ))       │          │
│  │   • Lifecycle mgmt    │    │  │  • Rolling statistics      │          │
│  │   • SQLite backend    │    │  │  • SQLite backend          │          │
│  └────────────┬──────────┘    │  └───────────────┬────────────┘          │
│               │               │                  │                        │
│  ┌────────────▼───────────────▼──────────────────▼──────────────┐        │
│  │                  Progressive Loader                           │        │
│  │                  ──────────────────                           │        │
│  │   • Tier-by-tier retrieval                                    │        │
│  │   • Q-value routing / relevance / success rate               │        │
│  │   • Sticky skills (recently used get priority)               │        │
│  └──────────────────────────────┬────────────────────────────────┘        │
│                                  │                                        │
│  ┌───────────────────────────────▼────────────────────────────────┐      │
│  │                    Evolution Loop                               │      │
│  │                    ──────────────                               │      │
│  │   • Health assessment (healthy/warning/critical)                │      │
│  │   • Triggers diagnosis → patch → version bump                   │      │
│  │   • Prunes dead skills (low-Q, low-usage, stale)               │      │
│  └───────┬────────────────────────────────────────┬───────────────┘      │
│          │                                        │                       │
│  ┌───────▼──────────┐              ┌──────────────▼──────────────┐       │
│  │  Self-Diagnosis  │              │  Skill Dependency Graph     │       │
│  │  ─────────────── │              │  ──────────────────────    │       │
│  │  • Failure       │              │  • DAG of dependencies      │       │
│  │    pattern       │              │  • Topological sort         │       │
│  │    analysis      │              │  • Impact analysis          │       │
│  │  • Rule-based +  │              │  • Q-value propagation      │       │
│  │    LLM insights  │              │  • In-memory graph          │       │
│  │  • Auto-patching │              │                             │       │
│  └──────────────────┘              └─────────────────────────────┘       │
│                                                                           │
├───────────────────────────────────────────────────────────────────────────┤
│  Intelligence Layer                                                       │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │  Conflict Detector  │  Health Monitor  │  Skill Creator            │  │
│  │  Skill Analyzer     │  Skill Optimizer │  Elastic Memory           │  │
│  │  Alert Manager      │  Skill Generator │  Enhanced RL Optimizer    │  │
│  │  Performance Predictor │  Skill Transfer Engine                    │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                           │
├───────────────────────────────────────────────────────────────────────────┤
│  Platform Layer                                                           │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │  REST API (port 8742)  │  MCP Server  │  CLI  │  Web Dashboard    │  │
│  │  Skill Marketplace     │  Observability (Tracing/Metrics/Logging) │  │
│  │  Async Support         │  Versioning  │  A/B Testing              │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                           │
├───────────────────────────────────────────────────────────────────────────┤
│  Production Layer                                                         │
│  ┌────────────────────────────────────────────────────────────────────┐  │
│  │  Circuit Breaker │ Retry Policy │ Bulkhead │ Graceful Degradation │  │
│  │  Resilient Executor │ TTL/LRU Caching │ DB Abstraction            │  │
│  │  Integration Adapters (GitHub/Slack/Webhook/File)                  │  │
│  └────────────────────────────────────────────────────────────────────┘  │
│                                                                           │
├───────────────────────────────────────────────────────────────────────────┤
│  Integrations                                                             │
│  ┌──────────────────────────────────────────┐                            │
│  │  Hermes SkillForge Adapter               │                            │
│  │  • Import SKILL.md → SkillForge          │                            │
│  │  • Export SkillForge → SKILL.md          │                            │
│  │  • Bidirectional sync                    │                            │
│  └──────────────────────────────────────────┘                            │
│  ┌──────────────────────────────────────────┐                            │
│  │  Benchmark Runner                        │                            │
│  │  • A/B comparison (vanilla vs +SF)       │                            │
│  │  • Correctness, cost, latency metrics    │                            │
│  │  • Statistical significance testing      │                            │
│  └──────────────────────────────────────────┘                            │
└───────────────────────────────────────────────────────────────────────────┘

Core Components

Skill Registry

The central store for all skills, backed by SQLite with WAL mode for concurrent access.

from skillforge.core.registry import SkillRegistry, SkillLifecycle

registry = SkillRegistry(db_path="./my_skills.db")

# Register
skill = registry.register_skill(
    name="summarizer",
    tier1_metadata="Summarize long documents into key points",
    tier2_core="You are an expert summarizer. Given the following document...",
    tier3_resources=["examples/summary_format.md"],
    tags=["nlp", "summarization"],
)

# Progressive retrieval — control detail level
s1 = registry.get_skill(skill.id, tier=1)   # metadata only
s2 = registry.get_skill(skill.id, tier=2)   # + core prompt
s3 = registry.get_skill(skill.id, tier=3)   # + resources

# Search and filter
results = registry.search_skills("summarize")
active = registry.list_skills(lifecycle=SkillLifecycle.ACTIVE, tags=["nlp"])

# Update and version
registry.update_skill(skill.id, {"tier2_core": "Updated prompt..."})
registry.version_skill(skill.id)  # bumps version number

registry.close()

Effectiveness Tracker

Records execution outcomes and maintains Q-values using TD(λ) temporal-difference learning.

from skillforge.core.tracker import QValueTracker, Outcome

tracker = QValueTracker(db_path="./tracker.db")

# Record outcomes
tracker.record_outcome(Outcome(
    skill_id="summarizer",
    success=True,
    latency_ms=250.0,
    tokens_used=800,
    user_feedback=4.0,
))

# Get rolling statistics
stats = tracker.get_stats("summarizer")
# → {'q_value': 0.72, 'success_rate': 0.85, 'avg_latency_ms': 250.0, ...}

# TD(λ) update — called automatically or manually
new_q = tracker.td_lambda_update("summarizer", reward=0.85)

tracker.close()

Self-Diagnosis Engine

Analyzes failure patterns and generates patch suggestions.

from skillforge.core.diagnosis import SelfDiagnosisEngine

# Rule-based analysis (no LLM needed)
diagnosis = SelfDiagnosisEngine(registry=registry, tracker=tracker)

insights = diagnosis.analyze_failures("summarizer", window=10)
for insight in insights:
    print(f"Issue: {insight.failure_type}")
    print(f"Patch: {insight.patch_suggestion}")
    print(f"Confidence: {insight.confidence:.0%}")

# With LLM-assisted analysis
def my_llm(prompt: str) -> str:
    return call_my_model(prompt)

diagnosis_llm = SelfDiagnosisEngine(
    registry=registry,
    tracker=tracker,
    llm_fn=my_llm,
)

# Auto-patch
result = diagnosis.auto_patch_skill("summarizer", insights[0])
print(f"Patched: {result['success']}, new version: {result.get('new_version')}")

Skill Dependency Graph

DAG for modeling relationships between skills.

from skillforge.core.graph import SkillDependencyGraph

graph = SkillDependencyGraph()

graph.add_skill("http-client")
graph.add_skill("api-caller")
graph.add_skill("data-pipeline")

# api-caller depends on http-client (weight 0.8)
graph.add_dependency("api-caller", "http-client", weight=0.8)
graph.add_dependency("data-pipeline", "api-caller", weight=0.5)

# Topological sort
order = graph.topological_sort()
# → ["http-client", "api-caller", "data-pipeline"]

# Impact analysis — what breaks if http-client degrades?
impact = graph.downstream_impact("http-client")
# → ["api-caller", "data-pipeline"]

# Q-value propagation
propagated = graph.propagate_q_update("http-client", delta=-0.2, gamma=0.8)
# → {"api-caller": -0.16, "data-pipeline": -0.128}

Progressive Loader

Loads skills at the right detail level with configurable routing strategies.

from skillforge.core.loader import ProgressiveLoader

loader = ProgressiveLoader(registry, tracker)

# Load by query — tier 1 (fast, metadata-only)
quick = loader.load_skill("email management", tier=1, limit=5)

# Load by query — tier 2 (core prompt for execution)
ready = loader.load_skill("email management", tier=2, routing="q_value")

# Different routing strategies
by_quality  = loader.load_skill("code", routing="q_value")
by_success  = loader.load_skill("code", routing="success_rate")
by_usage    = loader.load_skill("code", routing="usage_count")
by_relevance = loader.load_skill("code", routing="relevance")

# Sticky skills — recently and frequently used
sticky = loader.get_sticky_skills(limit=5)

Evolution Loop

Continuous lifecycle management for all skills.

from skillforge.core.evolution import EvolutionLoop

evolution = EvolutionLoop(
    registry=registry,
    tracker=tracker,
    graph=graph,
    diagnosis=diagnosis,
)

# Full evolution cycle
report = evolution.run_evolution_loop()

print(f"Evaluated: {report.total_skills_evaluated}")
print(f"Healthy:   {report.skills_healthy}")
print(f"Warning:   {report.skills_warning}")
print(f"Critical:  {report.skills_critical}")
print(f"Evolved:   {report.skills_evolved}")
print(f"Pruned:    {report.skills_pruned}")

# Custom thresholds
report = evolution.run_evolution_loop(thresholds={
    "q_warning": 0.6,
    "q_critical": 0.35,
    "prune_max_age_days": 60,
})

Intelligence Layer

Elastic Memory

Adaptive memory store with consolidation and auto-compact.

from skillforge.advanced.elastic_memory import ElasticMemory, MemoryEntry

memory = ElasticMemory(db_path="./memory.db")

# Remember a skill execution
entry = memory.remember(
    skill_id="code-reviewer",
    content="Found critical bug in auth module",
    importance=0.9,
    metadata={"severity": "high", "module": "auth"},
)

# Recall relevant memories
results = memory.recall("auth bug critical")
for result in results:
    print(f"[{result.importance:.2f}] {result.content}")

# Consolidate similar memories (Jaccard > 0.7)
memory.consolidate()

# Auto-compact (LRU with composite retention score)
memory.auto_compact(max_memories=1000)

memory.close()

Alert Manager

Threshold, trend, and anomaly detection alerts.

from skillforge.intelligence.alert_manager import AlertManager, AlertRule, AlertRuleType

alerts = AlertManager(registry, tracker, health_monitor)

# Add alert rules
alerts.add_rule(AlertRule(
    name="critical-skill-degradation",
    rule_type=AlertRuleType.THRESHOLD,
    metric="health_score",
    threshold=0.4,
    operator="less_than",
    cooldown_seconds=3600,
))

# Check alerts (evaluates all rules)
fired = alerts.check_alerts()
for alert in fired:
    print(f"🚨 {alert.severity}: {alert.message}")

# Get active alerts
active = alerts.get_active_alerts()

# Acknowledge and resolve
alerts.acknowledge_alert(alert.id)
alerts.resolve_alert(alert.id, resolution="Fixed by updating prompt")

Skill Generator

Zero-shot skill generation from natural language.

from skillforge.advanced.skill_generator import SkillGenerator, GenerationRequest

generator = SkillGenerator(registry)

# Generate a skill from description
skill = generator.generate(GenerationRequest(
    description="A skill for reviewing pull requests on GitHub, checking for bugs, style issues, and performance problems",
    domain="code-review",
    complexity="medium",
    include_examples=True,
))

print(f"Generated: {skill.name}")
print(f"Tags: {skill.tags}")
print(f"Confidence: {skill.confidence:.2f}")

# Generate and register in one step
skill_id = generator.generate_and_register(
    description="Summarize long documents into bullet points",
    domain="nlp",
)

# Batch generation
skills = generator.generate_batch([
    GenerationRequest(description="Debug Python code", domain="debugging"),
    GenerationRequest(description="Write unit tests", domain="testing"),
])

Enhanced RL Optimizer

Contextual bandits, replay buffer, reward model, curriculum scheduling.

from skillforge.advanced.rl_optimizer import RLOptimizer

rl = RLOptimizer(registry, tracker, evolution)

# Contextual bandit: recommend next action
action = rl.recommend_action("code-reviewer")
# → "evolve" | "patch" | "prune" | "keep"

# Predict reward for an action
reward = rl.predict_reward("code-reviewer", action="evolve")

# Curriculum optimization (difficulty-based ordering)
rl.curriculum_optimize(skill_ids=["skill-1", "skill-2", "skill-3"])

# Batch optimize all underperforming skills
rl.batch_optimize(threshold=0.5)

# Get diagnostics
diag = rl.get_diagnostics()
print(f"Bandit epsilon: {diag['epsilon']:.3f}")
print(f"Replay buffer size: {diag['buffer_size']}")

Performance Predictor

OLS linear regression for performance trend prediction.

from skillforge.advanced.predictor import SkillPredictor

predictor = SkillPredictor(registry, tracker, graph)

# Predict performance trend
trend = predictor.predict_performance("code-reviewer")
print(f"Predicted Q in 7 days: {trend.predicted_q:.2f}")
print(f"Confidence: {trend.confidence:.2f}")
print(f"Trend: {trend.direction}")  # "improving" | "stable" | "declining"

# Get recommendations for a task
recommendations = predictor.predict_for_task("review github pr")
for rec in recommendations:
    print(f"  {rec.skill_id}: Q={rec.predicted_q:.2f} (σ²={rec.variance:.3f})")

Skill Transfer Engine

Cross-agent skill export/import.

from skillforge.advanced.transfer import SkillTransferEngine

transfer = SkillTransferEngine(registry)

# Export to Hermes format
hermes_path = transfer.export_to_hermes(
    skill_id="code-reviewer",
    output_dir="~/.hermes/skills/productivity/",
)

# Export to OpenClaw format
openclaw_path = transfer.export_to_openclaw(
    skill_id="code-reviewer",
    output_dir="./openclaw-skills/",
)

# Export to LangChain format
langchain_path = transfer.export_to_langchain(
    skill_id="code-reviewer",
    output_dir="./langchain-skills/",
)

# Import from another agent
imported = transfer.import_skill(
    format="hermes",
    path="~/.hermes/skills/productivity/code-reviewer/SKILL.md",
)

Advanced Intelligence

Conflict Detector

Detects skill overlaps, contradictions, and dependency cycles.

from skillforge.intelligence.conflict_detector import ConflictDetector

detector = ConflictDetector(registry, tracker)

# Detect overlapping skills (Jaccard similarity > 0.6)
overlaps = detector.detect_overlaps()
for overlap in overlaps:
    print(f"Overlap: {overlap.skill_a} ↔ {overlap.skill_b} ({overlap.similarity:.2f})")

# Detect contradictory skills
contradictions = detector.detect_contradictions()

# Detect dependency cycles
cycles = detector.detect_cycles()

Health Monitor

4-weight health scoring with exponential decay.

from skillforge.intelligence.health_monitor import HealthMonitor

health = HealthMonitor(registry, tracker, graph)

# Check all skills
results = health.check_all()
for result in results:
    print(f"{result.skill_id}: {result.status} (score={result.score:.2f})")

# Health formula:
# 0.35 × Q_value
# + 0.30 × Success_Rate
# + 0.20 × Recency_Decay (half-life 14 days)
# + 0.15 × Usage_Frequency (log-scaled, saturates at ~100 uses)

Platform Layer

REST API Server

# Start the API server
skillforge serve --port 8742

# Endpoints:
# GET  /api/skills              → List all skills
# GET  /api/skills/:id          → Skill detail
# POST /api/skills              → Create skill
# POST /api/skills/:id/outcomes → Record outcome
# POST /api/evolution/run       → Trigger evolution
# GET  /api/health              → Health dashboard
# GET  /api/metrics             → Aggregated metrics
# GET  /api/graph               → Dependency graph
# GET  /api/evolution/history   → Evolution timeline

Python API Client

from skillforge.api.client import SkillForgeClient

client = SkillForgeClient(base_url="http://localhost:8742")

# List skills
skills = client.list_skills()

# Record outcome
client.record_outcome("skill-id", success=True, latency_ms=300, tokens_used=1000)

# Trigger evolution
client.run_evolution()

# Get health dashboard
health = client.get_health()

CLI Tool

# List all skills
skillforge skills list

# Search skills
skillforge skills search "code review"

# Check health
skillforge health

# Run evolution
skillforge evolve

# Start API server
skillforge serve --port 8742

# Open dashboard
skillforge dashboard

# Import from Hermes
skillforge import --source ~/.hermes/skills

# Export to Hermes
skillforge export --skill code-reviewer --target ~/.hermes/skills/

MCP Server

# Start MCP server (stdin/stdout JSON-RPC)
skillforge-mcp

# 8 tools available:
# - list_skills
# - get_skill
# - search_skills
# - record_outcome
# - run_evolution
# - get_health
# - get_graph
# - get_metrics

A/B Testing

from skillforge.advanced.ab_testing import ABTestRunner, ExperimentConfig, Variant

runner = ABTestRunner(db_path="./ab_tests.db")

# Create experiment
config = ExperimentConfig(
    name="prompt-optimization",
    variants=[
        Variant(id="control", skill_id="code-reviewer-v1", weight=0.5),
        Variant(id="treatment", skill_id="code-reviewer-v2", weight=0.5),
    ],
    significance_level=0.05,
    min_sample_size=50,
)

exp_id = runner.create_experiment(config)
runner.start_experiment(exp_id)

# Assign users to variants (4 strategies: RANDOM, ROUND_ROBIN, WEIGHTED_RANDOM, HASH_BASED)
variant = runner.assign_variant(exp_id, user_id="user-123")

# Record outcomes
runner.record_outcome(exp_id, variant_id="control", success=True, latency_ms=300)
runner.record_outcome(exp_id, variant_id="treatment", success=True, latency_ms=250)

# Evaluate statistical significance
result = runner.evaluate(exp_id)
print(f"Significant: {result.is_significant}")
print(f"p-value: {result.p_value:.4f}")
print(f"Winner: {result.winner}")

Skill Marketplace

from skillforge.marketplace.registry import MarketplaceRegistry
from skillforge.marketplace.publisher import SkillPublisher
from skillforge.marketplace.installer import SkillInstaller

marketplace = MarketplaceRegistry(db_path="./marketplace.db")
publisher = SkillPublisher(marketplace, registry)
installer = SkillInstaller(marketplace, registry)

# Publish a skill
publisher.publish(
    skill_id="code-reviewer",
    version="1.2.0",
    description="Expert code review for Python projects",
    tags=["code", "review", "python"],
    visibility="public",
)

# Search marketplace
results = marketplace.search("code review", tags=["python"])

# Install a skill
installed = installer.install(
    skill_id="code-reviewer",
    version="1.2.0",
    target_dir="~/.hermes/skills/productivity/",
)

# Rate a skill
marketplace.rate("code-reviewer", rating=5, review="Excellent skill!")

Observability

from skillforge.observability.tracer import SkillTracer
from skillforge.observability.metrics import MetricsCollector
from skillforge.observability.logger import StructuredLogger

# Tracing
tracer = SkillTracer(db_path="./traces.db")
with tracer.start_span("skill-execution", skill_id="code-reviewer") as span:
    # ... execute skill ...
    span.set_attribute("tokens_used", 1200)
    span.set_status("ok")

# Metrics
metrics = MetricsCollector(db_path="./metrics.db")
metrics.counter("skill_loads", labels={"skill": "code-reviewer"}).inc()
metrics.gauge("q_value", labels={"skill": "code-reviewer"}).set(0.85)
metrics.timing("execution_time", labels={"skill": "code-reviewer"}).observe(340.0)

# Get summary
summary = metrics.summarize("execution_time")
print(f"p50: {summary['p50']:.1f}ms, p95: {summary['p95']:.1f}ms")

# Structured logging
logger = StructuredLogger(db_path="./logs.db")
logger.info("Skill loaded", component="loader", skill="code-reviewer")
logger.warning("Low Q-value", component="evolution", skill="old-skill", q_value=0.3)

Versioning

from skillforge.versioning.version_manager import VersionManager, SemanticVersion

vm = VersionManager(db_path="./versions.db")

# Save a version snapshot
vm.save_version("code-reviewer", version="1.2.0", data={
    "tier1_metadata": "Review code for bugs...",
    "tier2_core": "You are an expert code reviewer...",
})

# List version history
history = vm.list_versions("code-reviewer")
for v in history:
    print(f"  v{v.version} — {v.created_at}")

# Diff between versions
diff = vm.diff("code-reviewer", "1.1.0", "1.2.0")
print(f"Added: {diff['added']}")
print(f"Modified: {diff['modified']}")
print(f"Removed: {diff['removed']}")

# Rollback to a previous version
vm.rollback("code-reviewer", version="1.1.0")

Production Layer

Circuit Breaker

from skillforge.core.resilience import CircuitBreaker, CircuitState

cb = CircuitBreaker(
    name="llm-api",
    failure_threshold=5,        # Open after 5 consecutive failures
    recovery_timeout=30.0,      # Wait 30s before half-open
    half_open_max=2,            # Allow 2 probe calls
    excluded_exceptions=(ValueError,),  # Don't count validation errors
)

# Use the circuit breaker
try:
    result = cb.call(lambda: call_llm_api(prompt))
except CircuitOpenError as e:
    print(f"Circuit open, retry in {e.remaining_seconds:.1f}s")

# Check state
stats = cb.get_stats()
print(f"State: {stats.state}")  # CLOSED / OPEN / HALF_OPEN
print(f"Failures: {stats.consecutive_failures}")

Retry Policy

from skillforge.core.resilience import RetryPolicy

rp = RetryPolicy(
    max_attempts=3,
    base_delay=1.0,
    max_delay=60.0,
    backoff_factor=2.0,
    jitter=True,
    retriable_exceptions=(ConnectionError, TimeoutError),
)

# Execute with retries
result = rp.execute(lambda: fetch_remote_data(url))

Resilient Executor

Composable wrapper combining circuit breaker, retry, and bulkhead.

from skillforge.core.resilience import ResilientExecutor, CircuitBreaker, RetryPolicy, Bulkhead

executor = ResilientExecutor(
    name="llm-call",
    circuit_breaker=CircuitBreaker("llm", failure_threshold=5),
    retry_policy=RetryPolicy(max_attempts=3, base_delay=1.0),
    bulkhead=Bulkhead("llm-pool", max_concurrent=10),
)

# Execute through all resilience layers
result = executor.execute(lambda: call_llm(prompt))

# Get aggregated stats
stats = executor.get_stats()
print(f"Total calls: {stats.total_attempts}")
print(f"Successes: {stats.total_successes}")

Caching

from skillforge.core.cache import TTLCache, LRUCache, CachedStore, cached

# TTL Cache
cache = TTLCache(ttl_seconds=300, max_size=1024)
cache.put("skill-stats:code-reviewer", {"q_value": 0.85, "usage": 42})
stats = cache.get("skill-stats:code-reviewer")

# LRU Cache
lru = LRUCache(max_size=256)
lru.put("key", "value")

# CachedStore — wraps a data source with TTL
store = CachedStore(
    fetch_fn=lambda key: expensive_query(key),
    ttl_seconds=60,
)
result = store.get("query-key")  # cached after first call

# @cached decorator
@cached(ttl_seconds=120)
def get_skill_stats(skill_id: str) -> dict:
    return forge.get_skill_stats(skill_id)

# Cache stats
print(f"Hit rate: {cache.stats.hit_rate:.1%}")

DB Abstraction

from skillforge.core.db import create_backend, SQLiteBackend, MemoryBackend

# SQLite backend (production)
db = create_backend("sqlite", path="./skillforge.db")

# Memory backend (testing)
db = create_backend("memory")

# Use the connection
with db.connect() as conn:
    result = conn.execute("SELECT * FROM skills WHERE q_value > ?", (0.7,))
    for row in result:
        print(row)

Integration Adapters

from skillforge.integrations.adapters import (
    GitHubAdapter, SlackAdapter, WebhookAdapter, FileAdapter, AdapterRegistry
)

registry = AdapterRegistry()

# Register adapters
registry.register("github", GitHubAdapter(
    repo="user/skills-repo",
    token="ghp_xxx",
))

registry.register("slack", SlackAdapter(
    webhook_url="https://hooks.slack.com/services/xxx",
))

registry.register("webhook", WebhookAdapter(
    url="https://api.example.com/skills",
    secret="hmac-secret",  # HMAC-SHA256 signature
))

# Dispatch events to all adapters
registry.dispatch(event_type="skill_evolved", payload={
    "skill_id": "code-reviewer",
    "old_q": 0.45,
    "new_q": 0.72,
})

# Health check all adapters
health = registry.health_check()
for name, status in health.items():
    print(f"  {name}: {'✅' if status else '❌'}")

Web Dashboard

React 18 + Vite + Tailwind CSS 4 + shadcn/ui dashboard with 5 pages.

Setup

cd dashboard
npm install
npm run dev    # Dev server at http://localhost:5173
npm run build  # Production build in dist/

Pages

Page	Features
Dashboard	KPI cards (Total Skills, Avg Q-Value, Success Rate, Outcomes), health pie chart, token usage trend, top 5 skills, recent evolution events
Skills	Searchable skill list, expandable rows (tier1/2/3), health badges, Q-value progress bars, per-skill evolution
Evolution	Timeline visualization, stat cards, global evolution trigger, event cards with success/failure
Graph	SVG dependency visualization, Q-value color-coded nodes, hover highlighting, click-to-select
Settings	API endpoint config, auto-evolution toggle, alert thresholds, import/export

Connect to API

# Terminal 1: Start SkillForge API server
python -m skillforge.api --port 8742

# Terminal 2: Start dashboard dev server
cd dashboard && npm run dev
# Dashboard fetches from localhost:8742 (Vite proxy)

Dashboard automatically falls back to mock data when the API server is offline.

Benchmarking

SkillForge ships with a benchmark runner for A/B comparison: vanilla agent vs. agent + SkillForge.

from skillforge.benchmark.runner import BenchmarkRunner
from skillforge.benchmark.tasks import Task, TaskSuite, TaskCategory

# Define a task suite
tasks = TaskSuite(
    name="coding-benchmark",
    tasks=[
        Task(
            id="task-001",
            description="Write a Python function to parse CSV files",
            category=TaskCategory.SKILL_INTENSIVE,
            expected_output="Working CSV parser",
            optimal_steps=3,
        ),
        Task(
            id="task-002",
            description="Debug a segmentation fault in C code",
            category=TaskCategory.REASONING,
            expected_output="Root cause identified",
            optimal_steps=5,
        ),
    ],
)

# Run the benchmark
runner = BenchmarkRunner(skillforge=forge)
results = runner.run_full_benchmark(tasks, n_retries=3)

# View summary
summary = results["summary"]
print(f"Correctness lift: {summary['avg_correctness_lift']:+.1%}")
print(f"Cost ratio:       {summary['avg_cost_ratio']:.2f}x")
print(f"Token ratio:      {summary['avg_token_ratio']:.2f}x")
print(f"Stat. significant: {summary['significant']} (p={summary['avg_p_value']:.4f})")

Hermes Integration

SkillForge provides a bidirectional adapter for Hermes Agent skills.

Import Hermes Skills → SkillForge

from skillforge.forge import SkillForge
from skillforge.integrations.hermes.adapter import HermesSkillForgeAdapter

forge = SkillForge()
adapter = HermesSkillForgeAdapter(
    skillforge=forge._registry,
    hermes_skills_dir="~/.hermes/skills",
)

# Import all SKILL.md files
imported = adapter.import_hermes_skills()
for skill in imported:
    print(f"Imported: {skill.name} (v{skill.version}, Q={skill.q_value})")

Export SkillForge Skills → Hermes

# Export a SkillForge skill as a SKILL.md file
path = adapter.export_skill_to_hermes(
    skill_id="code-reviewer",
    category="productivity",
)
print(f"Written to: {path}")
# → ~/.hermes/skills/productivity/code-reviewer/SKILL.md

Bidirectional Sync

# Sync in both directions
summary = adapter.sync()
print(f"Imported: {summary['imported']}")
print(f"Exported: {summary['exported']}")
print(f"Errors:   {len(summary['errors'])}")

Project Structure

skillforge/
├── skillforge/
│   ├── __init__.py
│   ├── forge.py                  # Main SkillForge orchestrator
│   ├── core/
│   │   ├── registry.py           # SkillRegistry (SQLite)
│   │   ├── tracker.py            # EffectivenessTracker + QValueTracker
│   │   ├── loader.py             # ProgressiveLoader
│   │   ├── graph.py              # SkillDependencyGraph
│   │   ├── diagnosis.py          # SelfDiagnosisEngine
│   │   ├── evolution.py          # EvolutionLoop
│   │   ├── db.py                 # Database abstraction (SQLite/Memory)
│   │   ├── resilience.py         # CircuitBreaker, Retry, Bulkhead, ResilientExecutor
│   │   └── cache.py              # TTLCache, LRUCache, @cached
│   ├── intelligence/
│   │   ├── conflict_detector.py  # Overlap/contradiction/cycle detection
│   │   ├── health_monitor.py     # 4-weight health scoring
│   │   ├── skill_creator.py      # Auto-create from trajectories
│   │   ├── analyzer.py           # Clustering + recommendations
│   │   ├── optimizer.py          # Compression, splitting, merging
│   │   └── alert_manager.py      # Threshold/trend/anomaly alerts
│   ├── advanced/
│   │   ├── elastic_memory.py     # Adaptive memory store + consolidation
│   │   ├── skill_generator.py    # Zero-shot skill generation
│   │   ├── rl_optimizer.py       # Bandits + replay + curriculum
│   │   ├── predictor.py          # OLS performance prediction
│   │   ├── transfer.py           # Cross-agent skill export/import
│   │   ├── multi_agent.py        # Shared skill pool + access control
│   │   └── ab_testing.py         # A/B testing with statistical tests
│   ├── marketplace/
│   │   ├── registry.py           # Marketplace catalogue
│   │   ├── publisher.py          # Skill validation + packaging
│   │   └── installer.py          # Install + update tracking
│   ├── observability/
│   │   ├── tracer.py             # Distributed tracing (spans)
│   │   ├── metrics.py            # Counters, gauges, timings
│   │   └── logger.py             # Structured JSON logging
│   ├── async_support/
│   │   ├── async_registry.py     # Async SkillRegistry wrapper
│   │   ├── async_tracker.py      # Async QValueTracker wrapper
│   │   └── async_loader.py       # Async ProgressiveLoader wrapper
│   ├── versioning/
│   │   └── version_manager.py    # Semver + history + rollback + diff
│   ├── integrations/
│   │   ├── hermes/
│   │   │   └── adapter.py        # HermesSkillForgeAdapter
│   │   └── adapters.py           # GitHub/Slack/Webhook/File adapters
│   ├── api/
│   │   ├── server.py             # REST API server (stdlib http.server)
│   │   ├── client.py             # Python API client
│   │   └── cli.py                # CLI tool
│   ├── mcp/
│   │   └── server.py             # MCP server (stdin/stdout JSON-RPC)
│   └── benchmark/
│       ├── runner.py             # BenchmarkRunner
│       ├── tasks.py              # Task + TaskSuite definitions
│       ├── metrics.py            # MetricCollector
│       └── report.py             # Report generation
├── dashboard/                    # React + Vite + Tailwind + shadcn/ui
│   ├── src/
│   │   ├── pages/                # Dashboard, Skills, Evolution, Graph, Settings
│   │   ├── components/ui/        # shadcn components (Radix UI)
│   │   ├── components/layout/    # Sidebar, Header, ThemeToggle
│   │   └── lib/                  # API client, utils, mock data
│   ├── package.json
│   └── vite.config.ts
├── tests/                        # 373 tests
│   ├── test_registry.py
│   ├── test_tracker.py
│   ├── test_graph.py
│   ├── test_forge.py
│   ├── test_phase5a.py
│   ├── test_phase6a.py
│   └── test_phase8.py
├── docs/
│   ├── ARTICLE.md                # Medium article draft
│   ├── INTEGRATION_GUIDE.md
│   └── BENCHMARK_GUIDE.md
├── examples/
│   └── hermes_integration/
│       └── example.py
└── README.md

Running Tests

# Run all 373 tests
pytest

# With coverage
pytest --cov=skillforge --cov-report=term-missing

# Run specific test module
pytest tests/test_registry.py -v

Research References

SkillForge is built on a comprehensive survey of 2026's most important research in agent skill systems. Below are the papers that directly inspired our architecture:

Self-Evolution & Skill Learning

Paper	Key Finding	SkillForge Usage
Memento-Skills	Self-evolving skill library: +13.7pp GAIA, +20.8pp HLE	EvolutionLoop, SkillCreator
Skill-Pro	Non-parametric PPO for skill evolution	Q-value update mechanism
AutoSkill	Version-controlled skill lifecycle	SkillRegistry lifecycle management
AgentFactory	Skills as executable Python subagent code	SkillTransferEngine
MUSE-Autoskill	5-stage lifecycle with per-skill memory	Skill lifecycle stages
XSkill	Dual-stream: skills (task-level) + experiences (action-level)	Tier architecture (L1/L2/L3)
MemSkill	Memory operations as learnable meta-skills	SelfDiagnosisEngine
SkillFlow	Opus 4.6 improves from 62.65% to 71.08% (+8.43pp)	Benchmark targets

Learning to Self-Evolve

Paper	Key Finding	SkillForge Usage
LSE — Learning to Self-Evolve	4B model beats GPT-5 via learned self-evolution	Core philosophy
Evolving-RL	98.7% improvement on ALFWorld unseen tasks	RLOptimizer
AEL — Agent Evolving Learning	"Less is more" — self-diagnosis > more mechanisms (Sharpe 2.13)	SelfDiagnosisEngine (minimal mechanism)
Native Evolution	14B model outperforms unassisted Gemini-2.5-Flash	Evolution philosophy
AutoAgent	Evolving cognition + elastic memory, closed-loop	Architecture design

Context Efficiency

Paper	Key Finding	SkillForge Usage
Anthropic Progressive Disclosure	98.7% token reduction (150K → 2K)	ProgressiveLoader 3-tier
SKILLREDUCER	Compression improves quality (48% desc + 39% body)	SkillOptimizer
GenericAgent	Context density maximization, 4-tier memory	Architecture design
Cloudflare Code Mode	99.9% token reduction (1.17M → 1K)	Token efficiency targets

Experience-Based Learning & Tool Memory

Paper	Key Finding	SkillForge Usage
SEARL — Tool Graph Memory	23% higher completion, 68% tool reuse rate	SkillDependencyGraph
MemQ — Provenance DAG	Q-learning on provenance DAG, TD(λ) traces	EffectivenessTracker Q-values
ERL — Experiential Reflective Learning	Heuristics > raw trajectories, +7.8% over ReAct	SelfDiagnosisEngine insights
DeepAgent	Autonomous memory folding, brain-inspired	Architecture design

Evaluation & Benchmarking

Paper	Key Finding	SkillForge Usage
SEA-Eval	SR + T convergence detects genuine vs pseudo-evolution	BenchmarkRunner evolution metrics
SWE-Bench	Standard for coding task evaluation	Benchmark task suite
GAIA Benchmark	General AI assistant benchmark	Benchmark correctness metric

License

This project is licensed under the MIT License. See LICENSE for details.

SkillForge — Making every agent interaction a training signal.
_{Built with research. Shaped by usage. Evolved by intelligence.}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
dashboard		dashboard
docker		docker
docs		docs
examples/hermes_integration		examples/hermes_integration
skillforge		skillforge
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

⚒️ SkillForge

Table of Contents

Overview

Key Stats

Features

Core (6 components)

Intelligence (5 components)

Advanced Intelligence (6 components)

Platform (4 components)

Marketplace & Observability (4 components)

Async & Versioning (4 components)

Scale (3 components)

Production (6 components)

Installation

Quick Start

Architecture

Core Components

Skill Registry

Effectiveness Tracker

Self-Diagnosis Engine

Skill Dependency Graph

Progressive Loader

Evolution Loop

Intelligence Layer

Elastic Memory

Alert Manager

Skill Generator

Enhanced RL Optimizer

Performance Predictor

Skill Transfer Engine

Advanced Intelligence

Conflict Detector

Health Monitor

Platform Layer

REST API Server

Python API Client

CLI Tool

MCP Server

A/B Testing

Skill Marketplace

Observability

Versioning

Production Layer

Circuit Breaker

Retry Policy

Resilient Executor

Caching

DB Abstraction

Integration Adapters

Web Dashboard

Setup

Pages

Connect to API

Benchmarking

Hermes Integration

Import Hermes Skills → SkillForge

Export SkillForge Skills → Hermes

Bidirectional Sync

Project Structure

Running Tests

Research References

Self-Evolution & Skill Learning

Learning to Self-Evolve

Context Efficiency

Experience-Based Learning & Tool Memory

Evaluation & Benchmarking

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages