Self-Evolving Skill Intelligence Platform for AI Agents
Turn static agent skills into living, data-driven assets that learn from every interaction,
self-diagnose failures, and continuously evolve to stay optimal.
- Overview
- Features
- Installation
- Quick Start
- Architecture
- Core Components
- Intelligence Layer
- Advanced Intelligence
- Platform Layer
- Production Layer
- Web Dashboard
- Benchmarking
- Hermes Integration
- Project Structure
- Research Foundations
- License
SkillForge is a skill intelligence layer for AI agents. Rather than treating skills as static prompt files, SkillForge makes them living assets — tracked, ranked, diagnosed, and evolved through real-world usage data.
Most agents ship with a fixed set of skills. When a skill fails, nobody notices. When a better version exists, nobody knows. SkillForge solves this by wrapping every skill in a feedback loop:
- Track every execution outcome (success, latency, tokens, feedback)
- Rank skills dynamically using reinforcement-learning-inspired Q-values
- Diagnose failure patterns automatically (rule-based or LLM-assisted)
- Evolve underperforming skills — patch prompts, adjust metadata, prune dead skills
- Load progressively at 3 detail tiers to minimize context window waste
Built on research from Memento-Skills, AEL, SKILLREDUCER, SEA-Eval, and MemQ, SkillForge is designed as a drop-in skill intelligence layer for any agent framework.
| Metric | Value |
|---|---|
| Components | 35+ |
| Python files | 58 |
| Lines of code | 22,593 |
| Tests | 373 passing |
| Test time | 1.30s |
| External deps | 0 (stdlib-only) |
| Component | Description |
|---|---|
| Skill Registry | SQLite-backed registry with 3-tier progressive loading (metadata → core prompt → full resources). Full CRUD, versioning, lifecycle management (draft → active → deprecated → archived). |
| Effectiveness Tracker | Tracks execution outcomes (success, latency, tokens, feedback). Maintains rolling Q-values via TD(λ) temporal-difference learning — skills that consistently succeed get higher priority. |
| Self-Diagnosis Engine | Analyzes failure patterns across recent outcomes. Supports rule-based heuristics out of the box, with optional LLM-assisted analysis for deeper insights. Generates patch suggestions with confidence scores. |
| Skill Dependency Graph | Directed acyclic graph of skill relationships. Supports topological sorting, downstream impact analysis, and Q-value propagation through dependency chains. |
| Progressive Loader | Loads skills at the right detail level for the task. Tier 1 = metadata (~30 tokens), Tier 2 = core prompt, Tier 3 = full resources. Supports multiple routing strategies (Q-value, success rate, relevance, usage count). |
| Evolution Loop | Continuous lifecycle management. Evaluates skill health (healthy/warning/critical), triggers diagnosis and patching for underperformers, and prunes dead skills that are low-Q, low-usage, and stale. |
| Component | Description |
|---|---|
| Conflict Detector | Detects skill overlaps, contradictions, and dependency cycles. Uses Jaccard similarity for overlap detection (threshold > 0.6). |
| Health Monitor | 4-weight health scoring: Q-value (35%), success rate (30%), recency decay (20%, half-life 14 days), usage frequency (15%, log-scaled saturation). |
| Skill Creator | Auto-creates skills from successful execution trajectories. Pattern frequency threshold (>60%), optional LLM-assisted extraction. |
| Skill Analyzer | Clustering, pattern detection, and recommendations. Identifies underperforming skill clusters and suggests improvements. |
| Skill Optimizer | Compression (deduplicate, reorder), splitting (decompose monolithic skills), and merging (consolidate similar skills). |
| Component | Description |
|---|---|
| Elastic Memory | Adaptive memory store with importance tracking. Consolidation via Jaccard similarity > 0.7. Auto-compact with composite retention score (importance 0.45, recency 0.30, access frequency 0.25). |
| Alert Manager | Threshold, trend, and anomaly detection alerts. Alert lifecycle (active → acknowledged → resolved). Cooldown support, rule targeting, callback registration for webhook/email/Slack notifications. |
| Skill Generator | Zero-shot skill generation from natural language descriptions. Template-driven (no LLM required), TF-IDF-lite keyword extraction, domain-aware tag inference. |
| Enhanced RL Optimizer | Contextual bandits (epsilon-greedy with decay), replay buffer (fixed-capacity deque), reward model (linear SGD), curriculum scheduler (difficulty-based ordering). |
| Performance Predictor | OLS linear regression for performance trend prediction. Predicts skill improvement/decline from last 20 outcomes. Population variance for consistency measurement. |
| Skill Transfer Engine | Cross-agent skill export/import. Formats skills for Hermes, OpenClaw, LangChain, CrewAI. Includes metadata, core content, and dependencies. |
| Component | Description |
|---|---|
| REST API Server | 13 endpoints on port 8742 (stdlib http.server). CRUD for skills, outcomes, evolution, health, graph, metrics. Zero external dependencies. |
| MCP Server | 8 tools via stdin/stdout JSON-RPC. Compatible with Claude Desktop, Cursor, and other MCP clients. |
| CLI Tool | 8 commands: serve, skills list/health/search, evolve, dashboard, import, export. |
| Web Dashboard | React 18 + Vite + Tailwind CSS 4 + shadcn/ui. 5 pages (Dashboard, Skills, Evolution, Graph, Settings). Recharts visualization. Dark theme. Mock data fallback. |
| Component | Description |
|---|---|
| Skill Marketplace | Publish, install, rate, and deprecate skills. SQLite-backed catalogue with search, filtering, and install tracking. |
| Observability — Tracing | OpenTelemetry-inspired span-based distributed tracing. SQLite persistence, nested spans, parent-child relationships. |
| Observability — Metrics | Counters, gauges, timings with summarization (min/max/mean/p50/p95/p99), histograms, label-based filtering. |
| Observability — Logging | JSON-structured log entries with 5 severity levels, trace/span correlation, component/skill filtering, full-text search. |
| Component | Description |
|---|---|
| Async Skill Registry | Async wrapper via asyncio.to_thread() for all registry CRUD operations. |
| Async Q-Value Tracker | Async outcome recording, Q-value queries, and TD(λ) updates. |
| Async Progressive Loader | Async skill loading and sticky skill retrieval. |
| Version Manager | Semantic versioning (semver 2.0.0). Full version history with snapshots, rollback to any version, content diff, auto-generated changelogs. |
| Component | Description |
|---|---|
| A/B Testing | Full experiment lifecycle. 4 assignment strategies (RANDOM, ROUND_ROBIN, WEIGHTED_RANDOM, HASH_BASED). Two-proportion z-test (2 variants) + chi-squared (3+ variants). Normal CDF via Abramowitz & Stegun (~1.5e-7 accuracy). |
| DB Abstraction | DatabaseBackend Protocol interface. SQLiteBackend (WAL mode, threading lock) + MemoryBackend (ephemeral for tests). DatabaseConnection context manager with auto-commit/rollback. |
| Integration Adapters | GitHubAdapter (REST API with SHA tracking), SlackAdapter (incoming webhooks), WebhookAdapter (HTTP POST + HMAC-SHA256), FileAdapter (JSON/markdown export). AdapterRegistry for dispatch and health checks. |
| Component | Description |
|---|---|
| Circuit Breaker | CLOSED → OPEN → HALF_OPEN state machine. Configurable failure threshold, recovery timeout, half-open probes. Thread-safe via threading.Lock. |
| Retry Policy | Exponential backoff with jitter. Configurable max attempts, base delay, max delay, backoff factor, retriable exceptions. |
| Bulkhead | Concurrency limiter with semaphore. Configurable max concurrent executions and max wait time. |
| Graceful Degradation | Cascading fallback chains. Primary function + ordered fallback list. FallbackChainExhaustedError on total failure. |
| Resilient Executor | Composable wrapper: Bulkhead → Circuit Breaker → Retry → Callable. Single API for all resilience layers. |
| Caching | TTLCache (time-to-live with lazy expiry + max_size eviction), LRUCache (Least Recently Used), CachedStore (composable over data source), @cached decorator. CacheStats for hit/miss tracking. |
# From Git
git clone https://github.com/dwickyfp/skillforge.git
cd skillforge
pip install -e .
# Or install directly
pip install git+https://github.com/dwickyfp/skillforge.gitRequirements: Python 3.10+, SQLite 3.35+ (ships with Python)
Zero external dependencies — SkillForge uses only Python stdlib.
from skillforge.forge import SkillForge
# Initialize — creates ~/.skillforge/skillforge.db automatically
forge = SkillForge()
# Register a skill with 3-tier content
skill = forge.register_skill(
name="code-reviewer",
tier1_metadata="Review code for bugs, style, and performance issues",
tier2_core="You are an expert code reviewer. Analyze the given code for...",
tier3_resources=["examples/review_template.md", "rules/style_guide.md"],
tags=["code", "review", "quality"],
)
# Load skills by natural-language query (ranked by Q-value)
skills = forge.load_skill("review my code", tier=2, limit=3)
print(f"Best match: {skills[0].name} (Q={skills[0].q_value:.2f})")
# Record execution outcomes
forge.record_outcome(
skill_id=skill.id,
success=True,
latency_ms=340,
tokens_used=1200,
user_feedback=4.5,
)
# Run evolution cycle — diagnose failures, patch skills, prune dead ones
report = forge.run_evolution_loop()
print(report.summary())
# Inspect skill stats
stats = forge.get_skill_stats(skill.id)
print(f"Success rate: {stats['success_rate']:.0%}")
forge.close()Or use the context manager:
with SkillForge() as forge:
forge.register_skill("greeter", "Greet users warmly")
# ... all operations ...
# automatically closed┌───────────────────────────────────────────────────────────────────────────┐
│ SkillForge Orchestrator │
│ (forge.py) │
├───────────────┬───────────────┬──────────────────┬────────────────────────┤
│ │ │ │ │
│ ┌────────────▼──────────┐ │ ┌───────────────▼────────────┐ │
│ │ Skill Registry │ │ │ Effectiveness Tracker │ │
│ │ ───────────────── │ │ │ ────────────────────── │ │
│ │ • 3-tier loading │ │ │ • Outcome recording │ │
│ │ • CRUD + version │ │ │ • Q-values (TD(λ)) │ │
│ │ • Lifecycle mgmt │ │ │ • Rolling statistics │ │
│ │ • SQLite backend │ │ │ • SQLite backend │ │
│ └────────────┬──────────┘ │ └───────────────┬────────────┘ │
│ │ │ │ │
│ ┌────────────▼───────────────▼──────────────────▼──────────────┐ │
│ │ Progressive Loader │ │
│ │ ────────────────── │ │
│ │ • Tier-by-tier retrieval │ │
│ │ • Q-value routing / relevance / success rate │ │
│ │ • Sticky skills (recently used get priority) │ │
│ └──────────────────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────────▼────────────────────────────────┐ │
│ │ Evolution Loop │ │
│ │ ────────────── │ │
│ │ • Health assessment (healthy/warning/critical) │ │
│ │ • Triggers diagnosis → patch → version bump │ │
│ │ • Prunes dead skills (low-Q, low-usage, stale) │ │
│ └───────┬────────────────────────────────────────┬───────────────┘ │
│ │ │ │
│ ┌───────▼──────────┐ ┌──────────────▼──────────────┐ │
│ │ Self-Diagnosis │ │ Skill Dependency Graph │ │
│ │ ─────────────── │ │ ────────────────────── │ │
│ │ • Failure │ │ • DAG of dependencies │ │
│ │ pattern │ │ • Topological sort │ │
│ │ analysis │ │ • Impact analysis │ │
│ │ • Rule-based + │ │ • Q-value propagation │ │
│ │ LLM insights │ │ • In-memory graph │ │
│ │ • Auto-patching │ │ │ │
│ └──────────────────┘ └─────────────────────────────┘ │
│ │
├───────────────────────────────────────────────────────────────────────────┤
│ Intelligence Layer │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Conflict Detector │ Health Monitor │ Skill Creator │ │
│ │ Skill Analyzer │ Skill Optimizer │ Elastic Memory │ │
│ │ Alert Manager │ Skill Generator │ Enhanced RL Optimizer │ │
│ │ Performance Predictor │ Skill Transfer Engine │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
├───────────────────────────────────────────────────────────────────────────┤
│ Platform Layer │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ REST API (port 8742) │ MCP Server │ CLI │ Web Dashboard │ │
│ │ Skill Marketplace │ Observability (Tracing/Metrics/Logging) │ │
│ │ Async Support │ Versioning │ A/B Testing │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
├───────────────────────────────────────────────────────────────────────────┤
│ Production Layer │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Circuit Breaker │ Retry Policy │ Bulkhead │ Graceful Degradation │ │
│ │ Resilient Executor │ TTL/LRU Caching │ DB Abstraction │ │
│ │ Integration Adapters (GitHub/Slack/Webhook/File) │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
├───────────────────────────────────────────────────────────────────────────┤
│ Integrations │
│ ┌──────────────────────────────────────────┐ │
│ │ Hermes SkillForge Adapter │ │
│ │ • Import SKILL.md → SkillForge │ │
│ │ • Export SkillForge → SKILL.md │ │
│ │ • Bidirectional sync │ │
│ └──────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────┐ │
│ │ Benchmark Runner │ │
│ │ • A/B comparison (vanilla vs +SF) │ │
│ │ • Correctness, cost, latency metrics │ │
│ │ • Statistical significance testing │ │
│ └──────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────────┘
The central store for all skills, backed by SQLite with WAL mode for concurrent access.
from skillforge.core.registry import SkillRegistry, SkillLifecycle
registry = SkillRegistry(db_path="./my_skills.db")
# Register
skill = registry.register_skill(
name="summarizer",
tier1_metadata="Summarize long documents into key points",
tier2_core="You are an expert summarizer. Given the following document...",
tier3_resources=["examples/summary_format.md"],
tags=["nlp", "summarization"],
)
# Progressive retrieval — control detail level
s1 = registry.get_skill(skill.id, tier=1) # metadata only
s2 = registry.get_skill(skill.id, tier=2) # + core prompt
s3 = registry.get_skill(skill.id, tier=3) # + resources
# Search and filter
results = registry.search_skills("summarize")
active = registry.list_skills(lifecycle=SkillLifecycle.ACTIVE, tags=["nlp"])
# Update and version
registry.update_skill(skill.id, {"tier2_core": "Updated prompt..."})
registry.version_skill(skill.id) # bumps version number
registry.close()Records execution outcomes and maintains Q-values using TD(λ) temporal-difference learning.
from skillforge.core.tracker import QValueTracker, Outcome
tracker = QValueTracker(db_path="./tracker.db")
# Record outcomes
tracker.record_outcome(Outcome(
skill_id="summarizer",
success=True,
latency_ms=250.0,
tokens_used=800,
user_feedback=4.0,
))
# Get rolling statistics
stats = tracker.get_stats("summarizer")
# → {'q_value': 0.72, 'success_rate': 0.85, 'avg_latency_ms': 250.0, ...}
# TD(λ) update — called automatically or manually
new_q = tracker.td_lambda_update("summarizer", reward=0.85)
tracker.close()Analyzes failure patterns and generates patch suggestions.
from skillforge.core.diagnosis import SelfDiagnosisEngine
# Rule-based analysis (no LLM needed)
diagnosis = SelfDiagnosisEngine(registry=registry, tracker=tracker)
insights = diagnosis.analyze_failures("summarizer", window=10)
for insight in insights:
print(f"Issue: {insight.failure_type}")
print(f"Patch: {insight.patch_suggestion}")
print(f"Confidence: {insight.confidence:.0%}")
# With LLM-assisted analysis
def my_llm(prompt: str) -> str:
return call_my_model(prompt)
diagnosis_llm = SelfDiagnosisEngine(
registry=registry,
tracker=tracker,
llm_fn=my_llm,
)
# Auto-patch
result = diagnosis.auto_patch_skill("summarizer", insights[0])
print(f"Patched: {result['success']}, new version: {result.get('new_version')}")DAG for modeling relationships between skills.
from skillforge.core.graph import SkillDependencyGraph
graph = SkillDependencyGraph()
graph.add_skill("http-client")
graph.add_skill("api-caller")
graph.add_skill("data-pipeline")
# api-caller depends on http-client (weight 0.8)
graph.add_dependency("api-caller", "http-client", weight=0.8)
graph.add_dependency("data-pipeline", "api-caller", weight=0.5)
# Topological sort
order = graph.topological_sort()
# → ["http-client", "api-caller", "data-pipeline"]
# Impact analysis — what breaks if http-client degrades?
impact = graph.downstream_impact("http-client")
# → ["api-caller", "data-pipeline"]
# Q-value propagation
propagated = graph.propagate_q_update("http-client", delta=-0.2, gamma=0.8)
# → {"api-caller": -0.16, "data-pipeline": -0.128}Loads skills at the right detail level with configurable routing strategies.
from skillforge.core.loader import ProgressiveLoader
loader = ProgressiveLoader(registry, tracker)
# Load by query — tier 1 (fast, metadata-only)
quick = loader.load_skill("email management", tier=1, limit=5)
# Load by query — tier 2 (core prompt for execution)
ready = loader.load_skill("email management", tier=2, routing="q_value")
# Different routing strategies
by_quality = loader.load_skill("code", routing="q_value")
by_success = loader.load_skill("code", routing="success_rate")
by_usage = loader.load_skill("code", routing="usage_count")
by_relevance = loader.load_skill("code", routing="relevance")
# Sticky skills — recently and frequently used
sticky = loader.get_sticky_skills(limit=5)Continuous lifecycle management for all skills.
from skillforge.core.evolution import EvolutionLoop
evolution = EvolutionLoop(
registry=registry,
tracker=tracker,
graph=graph,
diagnosis=diagnosis,
)
# Full evolution cycle
report = evolution.run_evolution_loop()
print(f"Evaluated: {report.total_skills_evaluated}")
print(f"Healthy: {report.skills_healthy}")
print(f"Warning: {report.skills_warning}")
print(f"Critical: {report.skills_critical}")
print(f"Evolved: {report.skills_evolved}")
print(f"Pruned: {report.skills_pruned}")
# Custom thresholds
report = evolution.run_evolution_loop(thresholds={
"q_warning": 0.6,
"q_critical": 0.35,
"prune_max_age_days": 60,
})Adaptive memory store with consolidation and auto-compact.
from skillforge.advanced.elastic_memory import ElasticMemory, MemoryEntry
memory = ElasticMemory(db_path="./memory.db")
# Remember a skill execution
entry = memory.remember(
skill_id="code-reviewer",
content="Found critical bug in auth module",
importance=0.9,
metadata={"severity": "high", "module": "auth"},
)
# Recall relevant memories
results = memory.recall("auth bug critical")
for result in results:
print(f"[{result.importance:.2f}] {result.content}")
# Consolidate similar memories (Jaccard > 0.7)
memory.consolidate()
# Auto-compact (LRU with composite retention score)
memory.auto_compact(max_memories=1000)
memory.close()Threshold, trend, and anomaly detection alerts.
from skillforge.intelligence.alert_manager import AlertManager, AlertRule, AlertRuleType
alerts = AlertManager(registry, tracker, health_monitor)
# Add alert rules
alerts.add_rule(AlertRule(
name="critical-skill-degradation",
rule_type=AlertRuleType.THRESHOLD,
metric="health_score",
threshold=0.4,
operator="less_than",
cooldown_seconds=3600,
))
# Check alerts (evaluates all rules)
fired = alerts.check_alerts()
for alert in fired:
print(f"🚨 {alert.severity}: {alert.message}")
# Get active alerts
active = alerts.get_active_alerts()
# Acknowledge and resolve
alerts.acknowledge_alert(alert.id)
alerts.resolve_alert(alert.id, resolution="Fixed by updating prompt")Zero-shot skill generation from natural language.
from skillforge.advanced.skill_generator import SkillGenerator, GenerationRequest
generator = SkillGenerator(registry)
# Generate a skill from description
skill = generator.generate(GenerationRequest(
description="A skill for reviewing pull requests on GitHub, checking for bugs, style issues, and performance problems",
domain="code-review",
complexity="medium",
include_examples=True,
))
print(f"Generated: {skill.name}")
print(f"Tags: {skill.tags}")
print(f"Confidence: {skill.confidence:.2f}")
# Generate and register in one step
skill_id = generator.generate_and_register(
description="Summarize long documents into bullet points",
domain="nlp",
)
# Batch generation
skills = generator.generate_batch([
GenerationRequest(description="Debug Python code", domain="debugging"),
GenerationRequest(description="Write unit tests", domain="testing"),
])Contextual bandits, replay buffer, reward model, curriculum scheduling.
from skillforge.advanced.rl_optimizer import RLOptimizer
rl = RLOptimizer(registry, tracker, evolution)
# Contextual bandit: recommend next action
action = rl.recommend_action("code-reviewer")
# → "evolve" | "patch" | "prune" | "keep"
# Predict reward for an action
reward = rl.predict_reward("code-reviewer", action="evolve")
# Curriculum optimization (difficulty-based ordering)
rl.curriculum_optimize(skill_ids=["skill-1", "skill-2", "skill-3"])
# Batch optimize all underperforming skills
rl.batch_optimize(threshold=0.5)
# Get diagnostics
diag = rl.get_diagnostics()
print(f"Bandit epsilon: {diag['epsilon']:.3f}")
print(f"Replay buffer size: {diag['buffer_size']}")OLS linear regression for performance trend prediction.
from skillforge.advanced.predictor import SkillPredictor
predictor = SkillPredictor(registry, tracker, graph)
# Predict performance trend
trend = predictor.predict_performance("code-reviewer")
print(f"Predicted Q in 7 days: {trend.predicted_q:.2f}")
print(f"Confidence: {trend.confidence:.2f}")
print(f"Trend: {trend.direction}") # "improving" | "stable" | "declining"
# Get recommendations for a task
recommendations = predictor.predict_for_task("review github pr")
for rec in recommendations:
print(f" {rec.skill_id}: Q={rec.predicted_q:.2f} (σ²={rec.variance:.3f})")Cross-agent skill export/import.
from skillforge.advanced.transfer import SkillTransferEngine
transfer = SkillTransferEngine(registry)
# Export to Hermes format
hermes_path = transfer.export_to_hermes(
skill_id="code-reviewer",
output_dir="~/.hermes/skills/productivity/",
)
# Export to OpenClaw format
openclaw_path = transfer.export_to_openclaw(
skill_id="code-reviewer",
output_dir="./openclaw-skills/",
)
# Export to LangChain format
langchain_path = transfer.export_to_langchain(
skill_id="code-reviewer",
output_dir="./langchain-skills/",
)
# Import from another agent
imported = transfer.import_skill(
format="hermes",
path="~/.hermes/skills/productivity/code-reviewer/SKILL.md",
)Detects skill overlaps, contradictions, and dependency cycles.
from skillforge.intelligence.conflict_detector import ConflictDetector
detector = ConflictDetector(registry, tracker)
# Detect overlapping skills (Jaccard similarity > 0.6)
overlaps = detector.detect_overlaps()
for overlap in overlaps:
print(f"Overlap: {overlap.skill_a} ↔ {overlap.skill_b} ({overlap.similarity:.2f})")
# Detect contradictory skills
contradictions = detector.detect_contradictions()
# Detect dependency cycles
cycles = detector.detect_cycles()4-weight health scoring with exponential decay.
from skillforge.intelligence.health_monitor import HealthMonitor
health = HealthMonitor(registry, tracker, graph)
# Check all skills
results = health.check_all()
for result in results:
print(f"{result.skill_id}: {result.status} (score={result.score:.2f})")
# Health formula:
# 0.35 × Q_value
# + 0.30 × Success_Rate
# + 0.20 × Recency_Decay (half-life 14 days)
# + 0.15 × Usage_Frequency (log-scaled, saturates at ~100 uses)# Start the API server
skillforge serve --port 8742
# Endpoints:
# GET /api/skills → List all skills
# GET /api/skills/:id → Skill detail
# POST /api/skills → Create skill
# POST /api/skills/:id/outcomes → Record outcome
# POST /api/evolution/run → Trigger evolution
# GET /api/health → Health dashboard
# GET /api/metrics → Aggregated metrics
# GET /api/graph → Dependency graph
# GET /api/evolution/history → Evolution timelinefrom skillforge.api.client import SkillForgeClient
client = SkillForgeClient(base_url="http://localhost:8742")
# List skills
skills = client.list_skills()
# Record outcome
client.record_outcome("skill-id", success=True, latency_ms=300, tokens_used=1000)
# Trigger evolution
client.run_evolution()
# Get health dashboard
health = client.get_health()# List all skills
skillforge skills list
# Search skills
skillforge skills search "code review"
# Check health
skillforge health
# Run evolution
skillforge evolve
# Start API server
skillforge serve --port 8742
# Open dashboard
skillforge dashboard
# Import from Hermes
skillforge import --source ~/.hermes/skills
# Export to Hermes
skillforge export --skill code-reviewer --target ~/.hermes/skills/# Start MCP server (stdin/stdout JSON-RPC)
skillforge-mcp
# 8 tools available:
# - list_skills
# - get_skill
# - search_skills
# - record_outcome
# - run_evolution
# - get_health
# - get_graph
# - get_metricsfrom skillforge.advanced.ab_testing import ABTestRunner, ExperimentConfig, Variant
runner = ABTestRunner(db_path="./ab_tests.db")
# Create experiment
config = ExperimentConfig(
name="prompt-optimization",
variants=[
Variant(id="control", skill_id="code-reviewer-v1", weight=0.5),
Variant(id="treatment", skill_id="code-reviewer-v2", weight=0.5),
],
significance_level=0.05,
min_sample_size=50,
)
exp_id = runner.create_experiment(config)
runner.start_experiment(exp_id)
# Assign users to variants (4 strategies: RANDOM, ROUND_ROBIN, WEIGHTED_RANDOM, HASH_BASED)
variant = runner.assign_variant(exp_id, user_id="user-123")
# Record outcomes
runner.record_outcome(exp_id, variant_id="control", success=True, latency_ms=300)
runner.record_outcome(exp_id, variant_id="treatment", success=True, latency_ms=250)
# Evaluate statistical significance
result = runner.evaluate(exp_id)
print(f"Significant: {result.is_significant}")
print(f"p-value: {result.p_value:.4f}")
print(f"Winner: {result.winner}")from skillforge.marketplace.registry import MarketplaceRegistry
from skillforge.marketplace.publisher import SkillPublisher
from skillforge.marketplace.installer import SkillInstaller
marketplace = MarketplaceRegistry(db_path="./marketplace.db")
publisher = SkillPublisher(marketplace, registry)
installer = SkillInstaller(marketplace, registry)
# Publish a skill
publisher.publish(
skill_id="code-reviewer",
version="1.2.0",
description="Expert code review for Python projects",
tags=["code", "review", "python"],
visibility="public",
)
# Search marketplace
results = marketplace.search("code review", tags=["python"])
# Install a skill
installed = installer.install(
skill_id="code-reviewer",
version="1.2.0",
target_dir="~/.hermes/skills/productivity/",
)
# Rate a skill
marketplace.rate("code-reviewer", rating=5, review="Excellent skill!")from skillforge.observability.tracer import SkillTracer
from skillforge.observability.metrics import MetricsCollector
from skillforge.observability.logger import StructuredLogger
# Tracing
tracer = SkillTracer(db_path="./traces.db")
with tracer.start_span("skill-execution", skill_id="code-reviewer") as span:
# ... execute skill ...
span.set_attribute("tokens_used", 1200)
span.set_status("ok")
# Metrics
metrics = MetricsCollector(db_path="./metrics.db")
metrics.counter("skill_loads", labels={"skill": "code-reviewer"}).inc()
metrics.gauge("q_value", labels={"skill": "code-reviewer"}).set(0.85)
metrics.timing("execution_time", labels={"skill": "code-reviewer"}).observe(340.0)
# Get summary
summary = metrics.summarize("execution_time")
print(f"p50: {summary['p50']:.1f}ms, p95: {summary['p95']:.1f}ms")
# Structured logging
logger = StructuredLogger(db_path="./logs.db")
logger.info("Skill loaded", component="loader", skill="code-reviewer")
logger.warning("Low Q-value", component="evolution", skill="old-skill", q_value=0.3)from skillforge.versioning.version_manager import VersionManager, SemanticVersion
vm = VersionManager(db_path="./versions.db")
# Save a version snapshot
vm.save_version("code-reviewer", version="1.2.0", data={
"tier1_metadata": "Review code for bugs...",
"tier2_core": "You are an expert code reviewer...",
})
# List version history
history = vm.list_versions("code-reviewer")
for v in history:
print(f" v{v.version} — {v.created_at}")
# Diff between versions
diff = vm.diff("code-reviewer", "1.1.0", "1.2.0")
print(f"Added: {diff['added']}")
print(f"Modified: {diff['modified']}")
print(f"Removed: {diff['removed']}")
# Rollback to a previous version
vm.rollback("code-reviewer", version="1.1.0")from skillforge.core.resilience import CircuitBreaker, CircuitState
cb = CircuitBreaker(
name="llm-api",
failure_threshold=5, # Open after 5 consecutive failures
recovery_timeout=30.0, # Wait 30s before half-open
half_open_max=2, # Allow 2 probe calls
excluded_exceptions=(ValueError,), # Don't count validation errors
)
# Use the circuit breaker
try:
result = cb.call(lambda: call_llm_api(prompt))
except CircuitOpenError as e:
print(f"Circuit open, retry in {e.remaining_seconds:.1f}s")
# Check state
stats = cb.get_stats()
print(f"State: {stats.state}") # CLOSED / OPEN / HALF_OPEN
print(f"Failures: {stats.consecutive_failures}")from skillforge.core.resilience import RetryPolicy
rp = RetryPolicy(
max_attempts=3,
base_delay=1.0,
max_delay=60.0,
backoff_factor=2.0,
jitter=True,
retriable_exceptions=(ConnectionError, TimeoutError),
)
# Execute with retries
result = rp.execute(lambda: fetch_remote_data(url))Composable wrapper combining circuit breaker, retry, and bulkhead.
from skillforge.core.resilience import ResilientExecutor, CircuitBreaker, RetryPolicy, Bulkhead
executor = ResilientExecutor(
name="llm-call",
circuit_breaker=CircuitBreaker("llm", failure_threshold=5),
retry_policy=RetryPolicy(max_attempts=3, base_delay=1.0),
bulkhead=Bulkhead("llm-pool", max_concurrent=10),
)
# Execute through all resilience layers
result = executor.execute(lambda: call_llm(prompt))
# Get aggregated stats
stats = executor.get_stats()
print(f"Total calls: {stats.total_attempts}")
print(f"Successes: {stats.total_successes}")from skillforge.core.cache import TTLCache, LRUCache, CachedStore, cached
# TTL Cache
cache = TTLCache(ttl_seconds=300, max_size=1024)
cache.put("skill-stats:code-reviewer", {"q_value": 0.85, "usage": 42})
stats = cache.get("skill-stats:code-reviewer")
# LRU Cache
lru = LRUCache(max_size=256)
lru.put("key", "value")
# CachedStore — wraps a data source with TTL
store = CachedStore(
fetch_fn=lambda key: expensive_query(key),
ttl_seconds=60,
)
result = store.get("query-key") # cached after first call
# @cached decorator
@cached(ttl_seconds=120)
def get_skill_stats(skill_id: str) -> dict:
return forge.get_skill_stats(skill_id)
# Cache stats
print(f"Hit rate: {cache.stats.hit_rate:.1%}")from skillforge.core.db import create_backend, SQLiteBackend, MemoryBackend
# SQLite backend (production)
db = create_backend("sqlite", path="./skillforge.db")
# Memory backend (testing)
db = create_backend("memory")
# Use the connection
with db.connect() as conn:
result = conn.execute("SELECT * FROM skills WHERE q_value > ?", (0.7,))
for row in result:
print(row)from skillforge.integrations.adapters import (
GitHubAdapter, SlackAdapter, WebhookAdapter, FileAdapter, AdapterRegistry
)
registry = AdapterRegistry()
# Register adapters
registry.register("github", GitHubAdapter(
repo="user/skills-repo",
token="ghp_xxx",
))
registry.register("slack", SlackAdapter(
webhook_url="https://hooks.slack.com/services/xxx",
))
registry.register("webhook", WebhookAdapter(
url="https://api.example.com/skills",
secret="hmac-secret", # HMAC-SHA256 signature
))
# Dispatch events to all adapters
registry.dispatch(event_type="skill_evolved", payload={
"skill_id": "code-reviewer",
"old_q": 0.45,
"new_q": 0.72,
})
# Health check all adapters
health = registry.health_check()
for name, status in health.items():
print(f" {name}: {'✅' if status else '❌'}")React 18 + Vite + Tailwind CSS 4 + shadcn/ui dashboard with 5 pages.
cd dashboard
npm install
npm run dev # Dev server at http://localhost:5173
npm run build # Production build in dist/| Page | Features |
|---|---|
| Dashboard | KPI cards (Total Skills, Avg Q-Value, Success Rate, Outcomes), health pie chart, token usage trend, top 5 skills, recent evolution events |
| Skills | Searchable skill list, expandable rows (tier1/2/3), health badges, Q-value progress bars, per-skill evolution |
| Evolution | Timeline visualization, stat cards, global evolution trigger, event cards with success/failure |
| Graph | SVG dependency visualization, Q-value color-coded nodes, hover highlighting, click-to-select |
| Settings | API endpoint config, auto-evolution toggle, alert thresholds, import/export |
# Terminal 1: Start SkillForge API server
python -m skillforge.api --port 8742
# Terminal 2: Start dashboard dev server
cd dashboard && npm run dev
# Dashboard fetches from localhost:8742 (Vite proxy)Dashboard automatically falls back to mock data when the API server is offline.
SkillForge ships with a benchmark runner for A/B comparison: vanilla agent vs. agent + SkillForge.
from skillforge.benchmark.runner import BenchmarkRunner
from skillforge.benchmark.tasks import Task, TaskSuite, TaskCategory
# Define a task suite
tasks = TaskSuite(
name="coding-benchmark",
tasks=[
Task(
id="task-001",
description="Write a Python function to parse CSV files",
category=TaskCategory.SKILL_INTENSIVE,
expected_output="Working CSV parser",
optimal_steps=3,
),
Task(
id="task-002",
description="Debug a segmentation fault in C code",
category=TaskCategory.REASONING,
expected_output="Root cause identified",
optimal_steps=5,
),
],
)
# Run the benchmark
runner = BenchmarkRunner(skillforge=forge)
results = runner.run_full_benchmark(tasks, n_retries=3)
# View summary
summary = results["summary"]
print(f"Correctness lift: {summary['avg_correctness_lift']:+.1%}")
print(f"Cost ratio: {summary['avg_cost_ratio']:.2f}x")
print(f"Token ratio: {summary['avg_token_ratio']:.2f}x")
print(f"Stat. significant: {summary['significant']} (p={summary['avg_p_value']:.4f})")SkillForge provides a bidirectional adapter for Hermes Agent skills.
from skillforge.forge import SkillForge
from skillforge.integrations.hermes.adapter import HermesSkillForgeAdapter
forge = SkillForge()
adapter = HermesSkillForgeAdapter(
skillforge=forge._registry,
hermes_skills_dir="~/.hermes/skills",
)
# Import all SKILL.md files
imported = adapter.import_hermes_skills()
for skill in imported:
print(f"Imported: {skill.name} (v{skill.version}, Q={skill.q_value})")# Export a SkillForge skill as a SKILL.md file
path = adapter.export_skill_to_hermes(
skill_id="code-reviewer",
category="productivity",
)
print(f"Written to: {path}")
# → ~/.hermes/skills/productivity/code-reviewer/SKILL.md# Sync in both directions
summary = adapter.sync()
print(f"Imported: {summary['imported']}")
print(f"Exported: {summary['exported']}")
print(f"Errors: {len(summary['errors'])}")skillforge/
├── skillforge/
│ ├── __init__.py
│ ├── forge.py # Main SkillForge orchestrator
│ ├── core/
│ │ ├── registry.py # SkillRegistry (SQLite)
│ │ ├── tracker.py # EffectivenessTracker + QValueTracker
│ │ ├── loader.py # ProgressiveLoader
│ │ ├── graph.py # SkillDependencyGraph
│ │ ├── diagnosis.py # SelfDiagnosisEngine
│ │ ├── evolution.py # EvolutionLoop
│ │ ├── db.py # Database abstraction (SQLite/Memory)
│ │ ├── resilience.py # CircuitBreaker, Retry, Bulkhead, ResilientExecutor
│ │ └── cache.py # TTLCache, LRUCache, @cached
│ ├── intelligence/
│ │ ├── conflict_detector.py # Overlap/contradiction/cycle detection
│ │ ├── health_monitor.py # 4-weight health scoring
│ │ ├── skill_creator.py # Auto-create from trajectories
│ │ ├── analyzer.py # Clustering + recommendations
│ │ ├── optimizer.py # Compression, splitting, merging
│ │ └── alert_manager.py # Threshold/trend/anomaly alerts
│ ├── advanced/
│ │ ├── elastic_memory.py # Adaptive memory store + consolidation
│ │ ├── skill_generator.py # Zero-shot skill generation
│ │ ├── rl_optimizer.py # Bandits + replay + curriculum
│ │ ├── predictor.py # OLS performance prediction
│ │ ├── transfer.py # Cross-agent skill export/import
│ │ ├── multi_agent.py # Shared skill pool + access control
│ │ └── ab_testing.py # A/B testing with statistical tests
│ ├── marketplace/
│ │ ├── registry.py # Marketplace catalogue
│ │ ├── publisher.py # Skill validation + packaging
│ │ └── installer.py # Install + update tracking
│ ├── observability/
│ │ ├── tracer.py # Distributed tracing (spans)
│ │ ├── metrics.py # Counters, gauges, timings
│ │ └── logger.py # Structured JSON logging
│ ├── async_support/
│ │ ├── async_registry.py # Async SkillRegistry wrapper
│ │ ├── async_tracker.py # Async QValueTracker wrapper
│ │ └── async_loader.py # Async ProgressiveLoader wrapper
│ ├── versioning/
│ │ └── version_manager.py # Semver + history + rollback + diff
│ ├── integrations/
│ │ ├── hermes/
│ │ │ └── adapter.py # HermesSkillForgeAdapter
│ │ └── adapters.py # GitHub/Slack/Webhook/File adapters
│ ├── api/
│ │ ├── server.py # REST API server (stdlib http.server)
│ │ ├── client.py # Python API client
│ │ └── cli.py # CLI tool
│ ├── mcp/
│ │ └── server.py # MCP server (stdin/stdout JSON-RPC)
│ └── benchmark/
│ ├── runner.py # BenchmarkRunner
│ ├── tasks.py # Task + TaskSuite definitions
│ ├── metrics.py # MetricCollector
│ └── report.py # Report generation
├── dashboard/ # React + Vite + Tailwind + shadcn/ui
│ ├── src/
│ │ ├── pages/ # Dashboard, Skills, Evolution, Graph, Settings
│ │ ├── components/ui/ # shadcn components (Radix UI)
│ │ ├── components/layout/ # Sidebar, Header, ThemeToggle
│ │ └── lib/ # API client, utils, mock data
│ ├── package.json
│ └── vite.config.ts
├── tests/ # 373 tests
│ ├── test_registry.py
│ ├── test_tracker.py
│ ├── test_graph.py
│ ├── test_forge.py
│ ├── test_phase5a.py
│ ├── test_phase6a.py
│ └── test_phase8.py
├── docs/
│ ├── ARTICLE.md # Medium article draft
│ ├── INTEGRATION_GUIDE.md
│ └── BENCHMARK_GUIDE.md
├── examples/
│ └── hermes_integration/
│ └── example.py
└── README.md
# Run all 373 tests
pytest
# With coverage
pytest --cov=skillforge --cov-report=term-missing
# Run specific test module
pytest tests/test_registry.py -vSkillForge is built on a comprehensive survey of 2026's most important research in agent skill systems. Below are the papers that directly inspired our architecture:
| Paper | Key Finding | SkillForge Usage |
|---|---|---|
| Memento-Skills | Self-evolving skill library: +13.7pp GAIA, +20.8pp HLE | EvolutionLoop, SkillCreator |
| Skill-Pro | Non-parametric PPO for skill evolution | Q-value update mechanism |
| AutoSkill | Version-controlled skill lifecycle | SkillRegistry lifecycle management |
| AgentFactory | Skills as executable Python subagent code | SkillTransferEngine |
| MUSE-Autoskill | 5-stage lifecycle with per-skill memory | Skill lifecycle stages |
| XSkill | Dual-stream: skills (task-level) + experiences (action-level) | Tier architecture (L1/L2/L3) |
| MemSkill | Memory operations as learnable meta-skills | SelfDiagnosisEngine |
| SkillFlow | Opus 4.6 improves from 62.65% to 71.08% (+8.43pp) | Benchmark targets |
| Paper | Key Finding | SkillForge Usage |
|---|---|---|
| LSE — Learning to Self-Evolve | 4B model beats GPT-5 via learned self-evolution | Core philosophy |
| Evolving-RL | 98.7% improvement on ALFWorld unseen tasks | RLOptimizer |
| AEL — Agent Evolving Learning | "Less is more" — self-diagnosis > more mechanisms (Sharpe 2.13) | SelfDiagnosisEngine (minimal mechanism) |
| Native Evolution | 14B model outperforms unassisted Gemini-2.5-Flash | Evolution philosophy |
| AutoAgent | Evolving cognition + elastic memory, closed-loop | Architecture design |
| Paper | Key Finding | SkillForge Usage |
|---|---|---|
| Anthropic Progressive Disclosure | 98.7% token reduction (150K → 2K) | ProgressiveLoader 3-tier |
| SKILLREDUCER | Compression improves quality (48% desc + 39% body) | SkillOptimizer |
| GenericAgent | Context density maximization, 4-tier memory | Architecture design |
| Cloudflare Code Mode | 99.9% token reduction (1.17M → 1K) | Token efficiency targets |
| Paper | Key Finding | SkillForge Usage |
|---|---|---|
| SEARL — Tool Graph Memory | 23% higher completion, 68% tool reuse rate | SkillDependencyGraph |
| MemQ — Provenance DAG | Q-learning on provenance DAG, TD(λ) traces | EffectivenessTracker Q-values |
| ERL — Experiential Reflective Learning | Heuristics > raw trajectories, +7.8% over ReAct | SelfDiagnosisEngine insights |
| DeepAgent | Autonomous memory folding, brain-inspired | Architecture design |
| Paper | Key Finding | SkillForge Usage |
|---|---|---|
| SEA-Eval | SR + T convergence detects genuine vs pseudo-evolution | BenchmarkRunner evolution metrics |
| SWE-Bench | Standard for coding task evaluation | Benchmark task suite |
| GAIA Benchmark | General AI assistant benchmark | Benchmark correctness metric |
This project is licensed under the MIT License. See LICENSE for details.
SkillForge — Making every agent interaction a training signal.
Built with research. Shaped by usage. Evolved by intelligence.