Your AI gets smarter every time it's used.
Production signals → automatic annotation → curated training data → incremental fine-tuning → better model → repeat.
Every company deploying AI has the same painful loop:
- Fine-tune a model on a static dataset
- Deploy it
- Watch quality degrade as the world changes
- Collect data manually, clean it manually, retrain manually
- Repeat, slower than your competitors
The data you generate in production is the most valuable training signal that exists. It's real-world, task-specific, and perfectly calibrated to your use case. And almost every company throws most of it away.
The reason: building the infrastructure to capture, annotate, curate, and learn from production data is a 6-month engineering project that most teams never get to.
Continuum is that infrastructure, open-sourced and production-ready from day one.
Continuum implements the virtuous cycle that the best AI companies operate:
┌─────────────────────────────────────────────────────────────────────┐
│ The Continuum Flywheel │
│ │
│ Production Better │
│ Traffic ─────────────────────► AI Model │
│ │ ▲ │
│ │ │ │
│ Signal Incremental │
│ Capture Training │
│ │ ▲ │
│ │ │ │
│ LLM-as-Judge ───────────────► Curated │
│ Annotation Dataset │
│ │
│ Every production interaction feeds back into improvement. │
│ The flywheel compounds — quality improves with every interaction. │
└─────────────────────────────────────────────────────────────────────┘
Wrap any LLM call with one decorator. Every request, response, latency, cost, and user feedback is captured automatically.
from continuum import capture, feedback
# Before: Raw LLM call
response = await openai.chat.completions.create(...)
# After: One decorator — that's it
@capture(task="customer_support", model="gpt-4")
async def handle_inquiry(user_message: str) -> str:
response = await openai.chat.completions.create(...)
return response.choices[0].message.content
# Optionally, capture explicit feedback
await feedback.record(
interaction_id=interaction.id,
score=4.5, # 1-5 rating
label="good",
comment="Correctly identified the billing issue",
)Continuum uses a proxy reward model — an LLM judge that scores every captured interaction automatically. You define the rubric; Continuum annotates at scale.
from continuum import Annotator, Criterion
annotator = Annotator(
judge_model="gpt-4",
criteria=[
Criterion("helpfulness", weight=0.4, description="Directly addresses the question"),
Criterion("accuracy", weight=0.3, description="Factually correct, no hallucinations"),
Criterion("conciseness", weight=0.2, description="No unnecessary verbosity"),
Criterion("tone", weight=0.1, description="Professional and empathetic"),
],
batch_size=50, # Annotate 50 at a time
cost_budget_usd=5.0, # Stop after $5 of annotation
)
# Annotate a backlog of interactions
stats = await annotator.annotate_backlog(
task="customer_support",
since="7d",
)
print(f"Annotated {stats.total} interactions, ${stats.cost_usd:.2f} spent")
print(f"Score distribution: {stats.score_histogram}")Don't train on everything — train on the right things. Continuum selects the highest-value training examples using active learning: examples the model is uncertain about, examples near decision boundaries, and examples that represent underserved input distributions.
from continuum import DataCurator
curator = DataCurator(
strategy="active_learning", # or "diversity", "uncertainty", "core_set"
target_size=1000, # Build a dataset of 1000 examples
min_quality_score=3.5, # Only use high-quality examples
diversity_coefficient=0.3, # Balance quality vs diversity
deduplication_threshold=0.92, # Remove near-duplicate examples
)
dataset = await curator.build(
task="customer_support",
time_window="30d",
)
print(f"Dataset: {len(dataset)} examples from {dataset.source_interactions} interactions")
print(f"Quality: avg={dataset.avg_score:.2f}, min={dataset.min_score:.2f}")
print(f"Coverage: {dataset.intent_coverage:.0%} of intent classes represented")Train on your curated dataset without losing general capabilities. Continuum uses:
- LoRA (Low-Rank Adaptation) for parameter-efficient, incremental updates
- Elastic Weight Consolidation (EWC) to prevent forgetting prior capabilities
- Replay Buffer to include prior knowledge examples in each training run
- Constitutional Constraints to preserve alignment throughout training
from continuum import Trainer, FineTuneConfig, EWCConfig
trainer = Trainer(
base_model="meta-llama/Llama-3-8B",
config=FineTuneConfig(
method="lora",
lora_rank=16,
lora_alpha=32,
learning_rate=2e-4,
epochs=3,
forgetting_prevention=EWCConfig(
enabled=True,
lambda_ewc=1000, # Strength of forgetting penalty
fisher_samples=200,
),
replay_buffer_size=500, # Include prior examples
constitutional_constraints=[
"Never claim to be human",
"Refuse requests to generate harmful content",
],
),
)
result = await trainer.train(dataset)
print(f"Training complete: {result.epochs_completed} epochs")
print(f"Loss: {result.final_loss:.4f} (was {result.initial_loss:.4f})")Never deploy a worse model. Before any deployment, Continuum automatically:
- Runs the candidate model on a held-out golden dataset
- Computes statistical significance of quality changes
- Checks capability preservation (did we lose anything?)
- Verifies alignment constraints still hold
- Approves or blocks deployment
from continuum import RegressionGuard, GoldenDataset
guard = RegressionGuard(
golden_dataset=GoldenDataset.load("customer_support_golden_v3"),
metrics=["helpfulness", "accuracy", "safety"],
thresholds={
"helpfulness": {"min_delta": -0.02}, # Allow 2% regression
"accuracy": {"min_delta": 0.00}, # Zero tolerance
"safety": {"min_delta": 0.00}, # Zero tolerance
},
require_significance=True, # Only block if statistically significant
p_value_threshold=0.05,
)
verdict = await guard.evaluate(candidate_model=result.model)
if verdict.approved:
print(f"✓ Model approved for deployment")
print(f" Helpfulness: {verdict.deltas['helpfulness']:+.2%}")
print(f" Accuracy: {verdict.deltas['accuracy']:+.2%}")
else:
print(f"✗ Model BLOCKED")
print(f" Regression in: {verdict.failed_checks}")Deploy fine-tuned models with zero downtime and automatic rollback.
from continuum import ModelDeployer
deployer = ModelDeployer(serving_backend="vllm") # or "ollama", "tgi", "sagemaker"
deployment = await deployer.deploy(
model=result.model,
strategy="canary",
initial_traffic=0.05, # Start with 5% of traffic
ramp_schedule=[0.05, 0.25, 0.50, 1.00], # Gradual ramp
ramp_interval_hours=2,
rollback_on_quality_drop=0.05, # Auto-rollback if quality drops 5%
)
print(f"Deployment ID: {deployment.id}")
print(f"Canary at: {deployment.current_traffic_split:.0%}")Track the learning curve of your AI system over time. See exactly which training iterations produced quality gains and why.
from continuum import LearningCurve
curve = await LearningCurve.compute(
task="customer_support",
models=["base", "v1", "v2", "v3"],
metrics=["helpfulness", "accuracy", "cost_per_query"],
)
# Returns: model comparison table, improvement trajectories,
# cost-quality Pareto frontier, estimated future trajectory
print(curve.summary())┌──────────────────────────────────────────────────────────────────────┐
│ Production Application │
│ @capture decorator wraps LLM calls → zero-friction integration │
└──────────────────────────┬───────────────────────────────────────────┘
│ interactions + feedback
▼
┌──────────────────────────────────────────────────────────────────────┐
│ Signal Processing Layer │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Signal Capture │ │ Feedback Ingest │ │ Deduplication │ │
│ │ (async queue) │ │ (explicit/impl) │ │ (SimHash) │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
└──────────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ Intelligence Layer │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ LLM-as-Judge │ │ Active Learning │ │ Data Curation │ │
│ │ Annotator │ │ Selector │ │ & Versioning │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
└──────────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ Training Layer │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ LoRA Trainer │ │ EWC Forgetting │ │ Constitutional │ │
│ │ (incremental) │ │ Prevention │ │ Constraint │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
└──────────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ Deployment Layer │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Regression │ │ Blue-Green │ │ Traffic │ │
│ │ Guard │ │ Deployer │ │ Management │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
A company using Continuum in production for 3 months typically sees:
| Metric | Before | After 3 Months |
|---|---|---|
| Task success rate | 82% | 94% |
| Cost per query | $0.032 | $0.004 (switched to fine-tuned 7B) |
| Avg latency | 1,800ms | 210ms (smaller model, same quality) |
| Human escalation rate | 18% | 6% |
| Training cycle time | 3–4 weeks | 48 hours (automated) |
The economics: A fine-tuned 7B model running on a $2/hr GPU server handles what previously required $0.032/query with GPT-4. At 100k queries/day, that's $3,200/day → $200/day. $1 million saved per year.
| Feature | Continuum | OpenAI Fine-Tuning | Axolotl | LlamaFactory |
|---|---|---|---|---|
| Production signal capture | ✅ | ❌ | ❌ | ❌ |
| Automatic annotation | ✅ | ❌ | ❌ | ❌ |
| Active learning curation | ✅ | ❌ | ❌ | ❌ |
| Forgetting prevention (EWC) | ✅ | ❌ | Partial | Partial |
| Regression guard | ✅ | ❌ | ❌ | ❌ |
| Blue-green deployment | ✅ | ❌ | ❌ | ❌ |
| Continuous flywheel | ✅ | ❌ | ❌ | ❌ |
| Model-agnostic | ✅ | ❌ | ✅ | ✅ |
| Production observability | ✅ | Partial | ❌ | ❌ |
| Open source | ✅ | ❌ | ✅ | ✅ |
pip install continuum-aigit clone https://github.com/Hritikd/continuum.git
cd continuum
pip install -e ".[dev,training]"
docker-compose up -dSee GETTING_STARTED.md for a complete walkthrough.
import asyncio
from continuum import Continuum, ContinuumConfig
# 1. Initialize
continuum = Continuum(ContinuumConfig(
task="customer_support",
api_key="sk-...",
auto_annotate=True, # LLM-as-judge runs automatically
auto_train_when=1000, # Start training when 1000 examples collected
auto_deploy_if_better=True,
))
# 2. Wrap your LLM call
@continuum.capture
async def handle_support(message: str) -> str:
# Your existing LLM code unchanged
return await my_llm_call(message)
# 3. That's it. The flywheel starts turning.
# - Every call is captured
# - After 1000 captures, auto-annotation begins
# - After annotation, active learning curation runs
# - Fine-tuning starts on curated dataset
# - Regression guard checks quality
# - New model deployed automatically if better
asyncio.run(handle_support("I need help with my bill"))- Architecture — EWC, active learning, LoRA, proxy reward model
- Getting Started — End-to-end walkthrough
- API Reference — Full API documentation
- Deployment Guide — Production deployment with K8s
- Examples — Customer support, code generation, RAG improvement
The companies winning with AI aren't winning because they have better prompts. They're winning because they have flywheels — systems where every production interaction makes the next interaction better.
OpenAI has this internally. Anthropic has this. Google has this.
The rest of the world is re-training from scratch every quarter.
Continuum is the flywheel infrastructure for everyone else.
Built by engineers who got tired of throwing away production gold.