Other tools show what happened. Rewind lets you fix it - without re-running.
Website •
Why •
Demo •
Install •
Quickstart •
Guides •
Roadmap
Single binary · zero dependencies · MIT licensed
Rewind is an open-source time-travel debugger for LLM-powered AI agents. Every observability tool - Langfuse, LangSmith, Helicone - shows you what happened. None of them let you change the past and observe a different future. Rewind does.
AI agents are shipping to production - tool-calling chains with 10, 30, 50 LLM steps. When they fail, debugging is brutal:
- You can't see what the model saw. What was in the context window at step 41? What got truncated?
- You can't reproduce it. Re-run the agent and you get a different result. The LLM is non-deterministic.
- You can't isolate the failure. Was it step 5 or step 2? You have to re-run all 50 steps ($$$, minutes) just to test a theory.
- You can't prove your fix works. You changed the prompt - did it actually improve things, or just shift the problem?
Agent broke at step 30? Fix step 30 - not steps 1 through 29 again. Each re-run costs tokens, time, and a different answer.
Rewind is Chrome DevTools for AI agents - fork at any failure, replay with the fix, prove it works.
| Capability | What it means |
|---|---|
rewind fix |
Agent broke? rewind fix latest — an LLM diagnoses the failure, suggests a fix (model swap, system prompt, temperature, retry), and optionally forks + replays with the patch to verify it works. One command from "broken" to "proven fix." Diagnosis works on all sessions; --apply requires proxy-recorded sessions. No other tool does this. |
| Fork & Replay | Branch the execution timeline at any step. Fix your code, run rewind replay --from 4. Steps 1-4 served from cache (0 tokens, 0ms). Only the fixed step re-runs live. |
| Prove the Fix | Score original vs. forked timelines with LLM-as-judge: rewind replay → rewind eval score → proof the fix works. Correctness, coherence, safety, relevance scored automatically. |
| Import & Debug | Import production traces from Langfuse, Datadog, or any OTel backend (rewind import otel). Fork at the failure, replay locally, export the fix back. Debug production failures without re-running in production. |
| Record | A transparent proxy captures every LLM call. Streaming works in real-time - zero added latency. Or one-line Python SDK: rewind_agent.init(). |
| Inspect | See the exact context window at each step. Every message, system prompt, and tool response the model saw. |
| Diff | Compare original and forked timelines. See exactly where they diverge and why. |
| Langfuse Import | See a broken trace in Langfuse? rewind import from-langfuse --trace <id> - import it, fork at the failure, replay with the fix. One command from "broken production trace" to "forked, fixed, verified." |
| Replay Savings | Every replay shows concrete ROI: tokens saved, estimated cost saved, time saved. CLI, Python SDK, and Web API. Know exactly how much each debug cycle is worth. |
| Session Sharing | rewind share latest - generate a self-contained HTML file. Open in any browser, share via Slack or email. No install, no login, works offline. Like a Jupyter notebook export for debug sessions. |
| Instant Replay | Identical requests are served from cache at 0 tokens, 0ms latency. Run the same agent 10 times - only the first run hits the LLM. |
| Evaluation | Create datasets, run your agent against them, score with 7 evaluator types (exact match, contains, regex, JSON schema, tool use, custom, LLM-as-judge). CI-ready with --fail-below thresholds. |
| Regression Testing | Turn any session into a baseline. After code changes, check step types, models, tool calls, token counts. 3-line GitHub Action. |
| Multi-Agent Tracing | Span tree visualization groups LLM calls, tool invocations, and handoffs under their parent agent. Thread view for multi-turn conversations. |
| Snapshots | Capture your entire workspace. Restore in one command if your agent breaks something. No git dependency. |
The only tool where debugging, tracing, and evals share the same data model. Fork a session, replay it, diff it, score it - all on the same timeline.
# try it now - no API keys needed
rewind demo && rewind inspect latestIf Rewind is useful to you, a ⭐ helps others find it.
⏪ Rewind - Session Trace
Session: customer-service Steps: 12 Tokens: 3,450
Agents: supervisor, researcher, writer
▼ ✓ 🤖 supervisor (agent) 1.2s
├ ✓ 🧠 gpt-4o "Route to researcher" 320ms 156↓ 28↑
▼ ✓ 🤖 researcher (agent) 2.1s
│ ├ ✓ 🧠 gpt-4o "Search for information" 890ms 312↓ 35↑
│ ├ ✓ 🔧 web_search("Tokyo population") 45ms
│ └ ✓ 🧠 gpt-4o "Synthesize results" 650ms 280↓ 95↑
├ ✓ 🔀 handoff: researcher → writer
▼ ✗ 🤖 writer (agent) 1.8s
│ ├ ✓ 🧠 gpt-4o "Draft article" 1200ms 450↓ 180↑
│ └ ✗ 🧠 gpt-4o "Polish final draft" 600ms 320↓ 120↑
│ ERROR: Hallucination - used stale data
└ ✓ 🧠 gpt-4o "Final review" 400ms 200↓ 45↑
The writer agent hallucinated at step 8 because the researcher used stale data. Without the span tree, you'd see a flat list of 12 steps with no agent boundaries.
# fix your code, then replay from the fork point:
rewind replay latest --from 4
# Steps 1-3: cached instantly (0ms, 0 tokens)
# Steps 4-5: re-run live with corrected context
rewind diff latest main fixed⏪ Rewind - Timeline Diff (main vs fixed, diverge at step 4)
═ Step 1 identical
═ Step 2 identical
═ Step 3 identical
≠ Step 4 [stale data] → [fresh data]
≠ Step 5 [error] 700tok → [success] 715tok
rewind fix latest⏪ Diagnosing session "research-agent-demo" (5 steps)...
Failure: Step 5 — llmcall (gpt-4o) — error
Error: HALLUCINATION: Agent used stale 2019 projection as current fact
Root cause: The agent relied on outdated data due to a search API rate
limit, leading to incorrect population figures.
Suggested fix: retry_step
Confidence: high
To apply this fix automatically:
rewind fix latest --apply
One command: diagnose the failure, suggest a fix, and optionally verify it. --apply --command automates the entire loop:
rewind fix latest --apply --yes --command "python agent.py"
# Fork at step 4, replay with fix...
# Steps 1-4: cached (0 tokens, 0ms)
# ✓ Agent finished — 531 tokens saved, 1.2s savedSkip the AI entirely and test your own theory:
rewind fix latest --hypothesis "swap_model:gpt-4o" --apply --yes --command "python agent.py"# Install hooks for Claude Code / Cursor, then open the dashboard
rewind hooks install
rewind webActivity timeline with swim lanes, context window viewer, visual diff, regression baselines — all in the browser. Plus in-browser fork/replay/diff/delete: branch any timeline from a step, copy a rewind replay --fork-id … command, diff a fork against its parent in one click, or hard-delete a fork with invariant checks. Works with Claude Code sessions, Cursor, or any agent recorded via the SDK. See docs/web-ui.md.
result = rewind_agent.evaluate(
dataset="booking-tests",
target_fn=my_agent,
evaluators=[
exact_match,
rewind_agent.llm_judge_evaluator(criteria="correctness"),
],
fail_below=0.9, # CI fails if score drops below 90%
)
# Score: 95.0%, Pass rate: 100% - ship itpip (recommended - installs Python SDK + CLI):
pip install rewind-agentBinary only (macOS / Linux):
curl -fsSL https://raw.githubusercontent.com/agentoptics/rewind/master/install.sh | shFrom source (requires Rust):
cargo install --git https://github.com/agentoptics/rewind rewind-cliDirect mode - one line, no proxy:
import rewind_agent
import openai
rewind_agent.init() # that's it - all LLM calls are now recorded
client = openai.OpenAI()
client.chat.completions.create(model="gpt-4o", messages=[...])
# rewind show latest → see the traceProxy mode - works with any language:
# Terminal 1: Start the recording proxy
rewind record --name "my-agent" --upstream https://api.openai.com --replay
# Terminal 2: Point your agent at the proxy
export OPENAI_BASE_URL=http://127.0.0.1:8443/v1
python3 my_agent.py # or node, go, rust - anything
# See what happened (trace view / interactive TUI)
rewind show latest
rewind inspect latestIf the proxy is unreachable, the SDK automatically falls back to direct recording mode. Your agent never stops working. See proxy-resilience.md.
Claude Code - observe sessions via plugin:
# Install the plugin (one-time)
claude marketplace add agentoptics --source github --repo agentoptics/rewind-plugin
claude plugin install agentoptics/rewind
# Start the dashboard
rewind web --port 4800
# Open http://127.0.0.1:4800 - sessions appear automaticallyOr manually with the CLI: rewind hooks install
See the Getting Started guide for more options.
| Feature | Description | Guide | Example |
|---|---|---|---|
| Recording | One line to start (init()), or a transparent HTTP proxy for any language. Monkey-patches OpenAI + Anthropic SDKs in-process. Streaming pass-through with zero added latency. |
recording.md | 05_direct_mode.py |
rewind fix |
Agent broke? rewind fix latest diagnoses the failure with an LLM, suggests a fix (model swap, system prompt, temperature, retry), and optionally forks + replays with the patch applied. --hypothesis lets you skip diagnosis and test your own theory. |
fix.md | - |
| Replay from Failure | Agent fails at step 5? Fix your code, replay from step 4. Steps 1-4 served from cache (0 tokens, 0ms). Only the fixed step re-runs live. Diff the original vs replayed timeline. | replay-and-forking.md | 06_replay_from_failure.py |
| Regression Testing | Turn any recorded session into a baseline. After code changes, check step types, models, tool calls, token counts, and error status. 3-line GitHub Action for CI. | regression-testing.md | 07_regression_testing.py |
| Evaluation | Create datasets of test cases, run your agent against them, score with built-in evaluators (exact match, contains, regex, JSON schema, tool use, LLM-as-judge), compare experiments side-by-side. CI-ready with --fail-below thresholds. |
evaluation.md | 08_evaluation.py |
| LLM-as-Judge | Score agent outputs with an LLM on correctness, coherence, safety, relevance, and task completion. Score timelines, compare original vs. forks, prove fixes work. | evaluation.md | 13_llm_judge.py, 14_fork_and_score.py |
| Custom Evaluators | Define domain-specific scoring with @evaluator() - check keyword coverage, format compliance, or any custom logic. Plug into the same experiment/comparison pipeline. |
evaluation.md | 12_custom_evaluator.py |
| Snapshots | Checkpoint your entire workspace before an agent runs. If it breaks something, restore in one command. Compressed tar+gz in the blob store - no git required. | snapshots.md | 11_snapshots.sh |
| Web Dashboard | Browser-based session explorer with activity timeline (swim-lane visualization), step list, context window viewer, visual timeline diff, multi-metric axis (duration/tokens/cost), and live recording via WebSocket. Everything embedded in the single binary. | web-ui.md | - |
| Multi-Agent Tracing | Hierarchical span tree and activity timeline for multi-agent workflows. Each agent gets its own swim lane with duration bars. Auto-captures agent boundaries, tool calls, and handoffs from OpenAI Agents SDK. Manual @span() decorator for custom grouping. Thread view for multi-turn conversations. |
multi-agent-tracing.md | - |
| Framework Integrations | Native support for OpenAI Agents SDK and Pydantic AI (auto-detected on init()). Wrapper support for LangGraph and CrewAI. Any other framework works via the HTTP proxy. |
framework-integrations.md | 09_pydantic_ai.py, 10_openai_agents_sdk.py |
| Claude Code Observation | Observe Claude Code sessions in real-time via hooks. See every tool call (Read, Edit, Bash, Grep, Agent), user prompts, and session lifecycle. Token usage extracted from transcripts. One-command setup: rewind hooks install. |
- | - |
| MCP Server | 26 tools for AI assistants (Claude Code, Cursor, Windsurf) to query recordings, view span trees, browse threads, diff timelines, create baselines, run evals - all from your IDE. | mcp-server.md | - |
| OpenTelemetry Export | Export recorded sessions as OTel traces via OTLP to Langfuse, Datadog, Grafana Tempo, Jaeger, or any OTel-compatible backend. CLI, Python SDK, and Web API. Uses gen_ai.* semantic conventions. Privacy-first: message content requires explicit opt-in. |
otel-export.md | - |
| OpenTelemetry Import | Import OTLP traces from any source into Rewind for time-travel debugging. Accepts protobuf or JSON via HTTP API (POST /v1/traces), CLI (rewind import otel), or Python SDK. Imported sessions with content blobs are forkable and replayable - debug production failures locally. |
otel-import.md | - |
| Langfuse Import | Fetch a trace from Langfuse by ID, convert to OTLP, import into Rewind. CLI: rewind import from-langfuse --trace <id>. Python: rewind_agent.import_from_langfuse(trace_id="..."). Supports Cloud and self-hosted. Zero dependencies (urllib only). |
langfuse-import.md | - |
| Replay Savings | After fork-and-execute replays, shows tokens saved, estimated cost (model-aware price table), and time saved. Displayed in rewind show, Python SDK (stderr), and Web API (GET /api/sessions/{id}/savings). |
replay-and-forking.md | - |
| Session Sharing | Export a session as a self-contained HTML file that works offline. Step tree, span tree, timeline diffs, scores - all in one portable file. rewind share latest for metadata-only, --include-content for full LLM content. |
- | - |
| SQL Query Explorer | Run ad-hoc SQL against the Rewind database. Token usage by model, average step duration, sessions with errors, cost estimation - read-only, safe to explore. | sql-queries.md | - |
| CLI Reference | Full command reference for all 29 CLI commands. | cli-reference.md | - |
| CLI Walkthrough | Every command run with real output — install, demo, fork, replay, diff, assert, eval, share, query, and more. Copy-paste examples with actual terminal output. | cli-walkthrough.md | - |
| Provider | Non-streaming | Streaming (SSE) |
|---|---|---|
| OpenAI (GPT-4o, o1, etc.) | ✅ | ✅ |
| Anthropic (Claude) | ✅ | ✅ |
| AWS Bedrock | ✅ | - |
| Any OpenAI-compatible (Ollama, vLLM, LiteLLM) | ✅ | ✅ |
Agent frameworks:
| Level | Frameworks | What it means |
|---|---|---|
Native - auto-detected on init() |
OpenAI Agents SDK, Pydantic AI | Zero config. Agent boundaries, tool calls, and handoffs captured automatically. |
| Wrapper - manual setup | LangGraph, CrewAI | Thin integration via wrap_langgraph() / wrap_crew(). CrewAI requires proxy mode. |
| Works via proxy | Any framework using OpenAI/Anthropic APIs | Point OPENAI_BASE_URL at the proxy. Works with Autogen, smolagents, custom code, any language. |
Already using Langfuse, LangSmith, or Datadog? You don't have to choose. Rewind works alongside them:
| Direction | How | Use Case |
|---|---|---|
| Import traces into Rewind | rewind import otel --file trace.pb, POST /v1/traces, or rewind import from-langfuse --trace <id> |
Debug a production failure locally - fork, replay, fix |
| Export sessions to your backend | rewind export otel latest --endpoint <langfuse> |
Send debugging sessions to the team dashboard |
| Dual-ship traces to both | Configure your agent's OTel exporter to send to both endpoints | Record locally + observe in production simultaneously |
Use your existing tool for production dashboards and alerting. Use Rewind when something breaks and you need to fix it, not just see it.
| Phase | Features | Status |
|---|---|---|
| v0.1 | Record, inspect, fork, diff, TUI, streaming, Instant Replay, Snapshots, Python SDK, LangGraph + CrewAI | ✅ Shipped |
| v0.2 | Direct recording, fork-and-execute replay, regression testing, MCP server | ✅ Shipped |
| v0.3 | Web UI (flight recorder + live dashboard) | ✅ Shipped |
| v0.4 | Evaluation system - datasets, evaluators, experiments, comparison, CI | ✅ Shipped |
| v0.5 | Multi-agent tracing (spans, threads, span tree UI) | ✅ Shipped |
| v0.6 | Claude Code hooks integration, transcript token parsing, session observation | ✅ Shipped |
| v0.7 | OpenTelemetry export (CLI, Python SDK, Web API, Dashboard) | ✅ Shipped |
| v0.8 | LLM-as-judge evaluators, timeline scoring, rewind eval score command |
✅ Shipped |
| v0.9 | OTel trace ingestion - import OTLP traces, debug production failures locally | ✅ Shipped |
| v0.10 | Langfuse import, replay cost savings calculator, session sharing (HTML export) | ✅ Shipped |
| v0.11 | rewind fix - AI-powered diagnosis, proxy request rewriting, hypothesis testing |
✅ Shipped |
| v1.0 | Enterprise readiness | Planned |
Agent debugging today is where web debugging was before Chrome DevTools. You had alert() and console.log(). Then DevTools gave you breakpoints, time-travel debugging, and network inspection - and everything changed.
Rewind brings that same leap to AI agents.
We welcome contributions! See CONTRIBUTING.md for guidelines.
git clone https://github.com/agentoptics/rewind.git
cd rewind
cargo build # build all crates
cargo run -- demo # seed demo data
cargo run -- inspect latest # open TUIMIT License. See LICENSE for details.


