GitHub - agentoptics/rewind: Fix broken AI agents without re-running them. Fork at any step, replay from failure, prove the fix works. Works alongside Langfuse/LangSmith or standalone.

Other tools show what happened. Rewind lets you fix it - without re-running.

Website • Why • Demo • Install • Quickstart • Guides • Roadmap

_{Single binary · zero dependencies · MIT licensed}

Rewind is an open-source time-travel debugger for LLM-powered AI agents. Every observability tool - Langfuse, LangSmith, Helicone - shows you what happened. None of them let you change the past and observe a different future. Rewind does.

The Problem

AI agents are shipping to production - tool-calling chains with 10, 30, 50 LLM steps. When they fail, debugging is brutal:

You can't see what the model saw. What was in the context window at step 41? What got truncated?
You can't reproduce it. Re-run the agent and you get a different result. The LLM is non-deterministic.
You can't isolate the failure. Was it step 5 or step 2? You have to re-run all 50 steps ($$$, minutes) just to test a theory.
You can't prove your fix works. You changed the prompt - did it actually improve things, or just shift the problem?

Agent broke at step 30? Fix step 30 - not steps 1 through 29 again. Each re-run costs tokens, time, and a different answer.

The Solution

Rewind is Chrome DevTools for AI agents - fork at any failure, replay with the fix, prove it works.

Capability	What it means
`rewind fix`	Agent broke? `rewind fix latest` — an LLM diagnoses the failure, suggests a fix (model swap, system prompt, temperature, retry), and optionally forks + replays with the patch to verify it works. One command from "broken" to "proven fix." Diagnosis works on all sessions; `--apply` requires proxy-recorded sessions. No other tool does this.
Fork & Replay	Branch the execution timeline at any step. Fix your code, run `rewind replay --from 4`. Steps 1-4 served from cache (0 tokens, 0ms). Only the fixed step re-runs live.
Prove the Fix	Score original vs. forked timelines with LLM-as-judge: `rewind replay → rewind eval score → proof the fix works`. Correctness, coherence, safety, relevance scored automatically.
Import & Debug	Import production traces from Langfuse, Datadog, or any OTel backend (`rewind import otel`). Fork at the failure, replay locally, export the fix back. Debug production failures without re-running in production.
Record	A transparent proxy captures every LLM call. Streaming works in real-time - zero added latency. Or one-line Python SDK: `rewind_agent.init()`.
Inspect	See the exact context window at each step. Every message, system prompt, and tool response the model saw.
Diff	Compare original and forked timelines. See exactly where they diverge and why.
Langfuse Import	See a broken trace in Langfuse? `rewind import from-langfuse --trace <id>` - import it, fork at the failure, replay with the fix. One command from "broken production trace" to "forked, fixed, verified."
Replay Savings	Every replay shows concrete ROI: tokens saved, estimated cost saved, time saved. CLI, Python SDK, and Web API. Know exactly how much each debug cycle is worth.
Session Sharing	`rewind share latest` - generate a self-contained HTML file. Open in any browser, share via Slack or email. No install, no login, works offline. Like a Jupyter notebook export for debug sessions.
Instant Replay	Identical requests are served from cache at 0 tokens, 0ms latency. Run the same agent 10 times - only the first run hits the LLM.
Evaluation	Create datasets, run your agent against them, score with 7 evaluator types (exact match, contains, regex, JSON schema, tool use, custom, LLM-as-judge). CI-ready with `--fail-below` thresholds.
Regression Testing	Turn any session into a baseline. After code changes, check step types, models, tool calls, token counts. 3-line GitHub Action.
Multi-Agent Tracing	Span tree visualization groups LLM calls, tool invocations, and handoffs under their parent agent. Thread view for multi-turn conversations.
Snapshots	Capture your entire workspace. Restore in one command if your agent breaks something. No git dependency.

The only tool where debugging, tracing, and evals share the same data model. Fork a session, replay it, diff it, score it - all on the same timeline.

See It in Action

Debug your agent — without re-running it

# try it now - no API keys needed
rewind demo && rewind inspect latest

If Rewind is useful to you, a ⭐ helps others find it.

See what the model saw - find the bug in 5 seconds

⏪ Rewind - Session Trace

  Session: customer-service   Steps: 12   Tokens: 3,450
  Agents: supervisor, researcher, writer

  ▼ ✓ 🤖 supervisor (agent)                          1.2s
    ├ ✓ 🧠  gpt-4o  "Route to researcher"           320ms  156↓ 28↑
    ▼ ✓ 🤖 researcher (agent)                        2.1s
    │ ├ ✓ 🧠  gpt-4o  "Search for information"      890ms  312↓ 35↑
    │ ├ ✓ 🔧  web_search("Tokyo population")          45ms
    │ └ ✓ 🧠  gpt-4o  "Synthesize results"          650ms  280↓ 95↑
    ├ ✓ 🔀 handoff: researcher → writer
    ▼ ✗ 🤖 writer (agent)                            1.8s
    │ ├ ✓ 🧠  gpt-4o  "Draft article"              1200ms  450↓ 180↑
    │ └ ✗ 🧠  gpt-4o  "Polish final draft"          600ms  320↓ 120↑
    │     ERROR: Hallucination - used stale data
    └ ✓ 🧠  gpt-4o  "Final review"                   400ms  200↓ 45↑

The writer agent hallucinated at step 8 because the researcher used stale data. Without the span tree, you'd see a flat list of 12 steps with no agent boundaries.

Fix and replay - only re-run what changed

# fix your code, then replay from the fork point:
rewind replay latest --from 4
# Steps 1-3: cached instantly (0ms, 0 tokens)
# Steps 4-5: re-run live with corrected context
rewind diff latest main fixed

⏪ Rewind - Timeline Diff (main vs fixed, diverge at step 4)

  ═ Step  1  identical
  ═ Step  2  identical
  ═ Step  3  identical
  ≠ Step  4  [stale data]  →  [fresh data]
  ≠ Step  5  [error] 700tok   →  [success] 715tok

AI diagnosis - let the debugger debug

rewind fix latest

⏪ Diagnosing session "research-agent-demo" (5 steps)...

  Failure: Step 5 — llmcall (gpt-4o) — error
  Error: HALLUCINATION: Agent used stale 2019 projection as current fact
  Root cause: The agent relied on outdated data due to a search API rate
              limit, leading to incorrect population figures.

  Suggested fix: retry_step
  Confidence:    high

  To apply this fix automatically:
    rewind fix latest --apply

One command: diagnose the failure, suggest a fix, and optionally verify it. --apply --command automates the entire loop:

rewind fix latest --apply --yes --command "python agent.py"
# Fork at step 4, replay with fix...
# Steps 1-4: cached (0 tokens, 0ms)
# ✓ Agent finished — 531 tokens saved, 1.2s saved

Skip the AI entirely and test your own theory:

rewind fix latest --hypothesis "swap_model:gpt-4o" --apply --yes --command "python agent.py"

Web dashboard — see everything your AI does

# Install hooks for Claude Code / Cursor, then open the dashboard
rewind hooks install
rewind web

Activity timeline with swim lanes, context window viewer, visual diff, regression baselines — all in the browser. Plus in-browser fork/replay/diff/delete: branch any timeline from a step, copy a rewind replay --fork-id … command, diff a fork against its parent in one click, or hard-delete a fork with invariant checks. Works with Claude Code sessions, Cursor, or any agent recorded via the SDK. See docs/web-ui.md.

Evaluate before shipping - catch regressions in CI

result = rewind_agent.evaluate(
    dataset="booking-tests",
    target_fn=my_agent,
    evaluators=[
        exact_match,
        rewind_agent.llm_judge_evaluator(criteria="correctness"),
    ],
    fail_below=0.9,   # CI fails if score drops below 90%
)
# Score: 95.0%, Pass rate: 100% - ship it

Install

pip (recommended - installs Python SDK + CLI):

pip install rewind-agent

Binary only (macOS / Linux):

curl -fsSL https://raw.githubusercontent.com/agentoptics/rewind/master/install.sh | sh

From source (requires Rust):

cargo install --git https://github.com/agentoptics/rewind rewind-cli

Quickstart

Direct mode - one line, no proxy:

import rewind_agent
import openai

rewind_agent.init()  # that's it - all LLM calls are now recorded

client = openai.OpenAI()
client.chat.completions.create(model="gpt-4o", messages=[...])
# rewind show latest → see the trace

Proxy mode - works with any language:

# Terminal 1: Start the recording proxy
rewind record --name "my-agent" --upstream https://api.openai.com --replay

# Terminal 2: Point your agent at the proxy
export OPENAI_BASE_URL=http://127.0.0.1:8443/v1
python3 my_agent.py   # or node, go, rust - anything

# See what happened (trace view / interactive TUI)
rewind show latest
rewind inspect latest

If the proxy is unreachable, the SDK automatically falls back to direct recording mode. Your agent never stops working. See proxy-resilience.md.

Claude Code - observe sessions via plugin:

# Install the plugin (one-time)
claude marketplace add agentoptics --source github --repo agentoptics/rewind-plugin
claude plugin install agentoptics/rewind

# Start the dashboard
rewind web --port 4800
# Open http://127.0.0.1:4800 - sessions appear automatically

Or manually with the CLI: rewind hooks install

See the Getting Started guide for more options.

Feature Guides

Feature	Description	Guide	Example
Recording	One line to start (`init()`), or a transparent HTTP proxy for any language. Monkey-patches OpenAI + Anthropic SDKs in-process. Streaming pass-through with zero added latency.	recording.md	05_direct_mode.py
`rewind fix`	Agent broke? `rewind fix latest` diagnoses the failure with an LLM, suggests a fix (model swap, system prompt, temperature, retry), and optionally forks + replays with the patch applied. `--hypothesis` lets you skip diagnosis and test your own theory.	fix.md	-
Replay from Failure	Agent fails at step 5? Fix your code, replay from step 4. Steps 1-4 served from cache (0 tokens, 0ms). Only the fixed step re-runs live. Diff the original vs replayed timeline.	replay-and-forking.md	06_replay_from_failure.py
Regression Testing	Turn any recorded session into a baseline. After code changes, check step types, models, tool calls, token counts, and error status. 3-line GitHub Action for CI.	regression-testing.md	07_regression_testing.py
Evaluation	Create datasets of test cases, run your agent against them, score with built-in evaluators (exact match, contains, regex, JSON schema, tool use, LLM-as-judge), compare experiments side-by-side. CI-ready with `--fail-below` thresholds.	evaluation.md	08_evaluation.py
LLM-as-Judge	Score agent outputs with an LLM on correctness, coherence, safety, relevance, and task completion. Score timelines, compare original vs. forks, prove fixes work.	evaluation.md	13_llm_judge.py, 14_fork_and_score.py
Custom Evaluators	Define domain-specific scoring with `@evaluator()` - check keyword coverage, format compliance, or any custom logic. Plug into the same experiment/comparison pipeline.	evaluation.md	12_custom_evaluator.py
Snapshots	Checkpoint your entire workspace before an agent runs. If it breaks something, restore in one command. Compressed tar+gz in the blob store - no git required.	snapshots.md	11_snapshots.sh
Web Dashboard	Browser-based session explorer with activity timeline (swim-lane visualization), step list, context window viewer, visual timeline diff, multi-metric axis (duration/tokens/cost), and live recording via WebSocket. Everything embedded in the single binary.	web-ui.md	-
Multi-Agent Tracing	Hierarchical span tree and activity timeline for multi-agent workflows. Each agent gets its own swim lane with duration bars. Auto-captures agent boundaries, tool calls, and handoffs from OpenAI Agents SDK. Manual `@span()` decorator for custom grouping. Thread view for multi-turn conversations.	multi-agent-tracing.md	-
Framework Integrations	Native support for OpenAI Agents SDK and Pydantic AI (auto-detected on `init()`). Wrapper support for LangGraph and CrewAI. Any other framework works via the HTTP proxy.	framework-integrations.md	09_pydantic_ai.py, 10_openai_agents_sdk.py
Claude Code Observation	Observe Claude Code sessions in real-time via hooks. See every tool call (Read, Edit, Bash, Grep, Agent), user prompts, and session lifecycle. Token usage extracted from transcripts. One-command setup: `rewind hooks install`.	-	-
MCP Server	26 tools for AI assistants (Claude Code, Cursor, Windsurf) to query recordings, view span trees, browse threads, diff timelines, create baselines, run evals - all from your IDE.	mcp-server.md	-
OpenTelemetry Export	Export recorded sessions as OTel traces via OTLP to Langfuse, Datadog, Grafana Tempo, Jaeger, or any OTel-compatible backend. CLI, Python SDK, and Web API. Uses `gen_ai.*` semantic conventions. Privacy-first: message content requires explicit opt-in.	otel-export.md	-
OpenTelemetry Import	Import OTLP traces from any source into Rewind for time-travel debugging. Accepts protobuf or JSON via HTTP API (`POST /v1/traces`), CLI (`rewind import otel`), or Python SDK. Imported sessions with content blobs are forkable and replayable - debug production failures locally.	otel-import.md	-
Langfuse Import	Fetch a trace from Langfuse by ID, convert to OTLP, import into Rewind. CLI: `rewind import from-langfuse --trace <id>`. Python: `rewind_agent.import_from_langfuse(trace_id="...")`. Supports Cloud and self-hosted. Zero dependencies (`urllib` only).	langfuse-import.md	-
Replay Savings	After fork-and-execute replays, shows tokens saved, estimated cost (model-aware price table), and time saved. Displayed in `rewind show`, Python SDK (stderr), and Web API (`GET /api/sessions/{id}/savings`).	replay-and-forking.md	-
Session Sharing	Export a session as a self-contained HTML file that works offline. Step tree, span tree, timeline diffs, scores - all in one portable file. `rewind share latest` for metadata-only, `--include-content` for full LLM content.	-	-
SQL Query Explorer	Run ad-hoc SQL against the Rewind database. Token usage by model, average step duration, sessions with errors, cost estimation - read-only, safe to explore.	sql-queries.md	-
CLI Reference	Full command reference for all 29 CLI commands.	cli-reference.md	-
CLI Walkthrough	Every command run with real output — install, demo, fork, replay, diff, assert, eval, share, query, and more. Copy-paste examples with actual terminal output.	cli-walkthrough.md	-

Compatibility

Provider	Non-streaming	Streaming (SSE)
OpenAI (GPT-4o, o1, etc.)	✅	✅
Anthropic (Claude)	✅	✅
AWS Bedrock	✅	-
Any OpenAI-compatible (Ollama, vLLM, LiteLLM)	✅	✅

Agent frameworks:

Level	Frameworks	What it means
Native - auto-detected on `init()`	OpenAI Agents SDK, Pydantic AI	Zero config. Agent boundaries, tool calls, and handoffs captured automatically.
Wrapper - manual setup	LangGraph, CrewAI	Thin integration via `wrap_langgraph()` / `wrap_crew()`. CrewAI requires proxy mode.
Works via proxy	Any framework using OpenAI/Anthropic APIs	Point `OPENAI_BASE_URL` at the proxy. Works with Autogen, smolagents, custom code, any language.

Works With Your Observability Stack

Already using Langfuse, LangSmith, or Datadog? You don't have to choose. Rewind works alongside them:

Direction	How	Use Case
Import traces into Rewind	`rewind import otel --file trace.pb`, `POST /v1/traces`, or `rewind import from-langfuse --trace <id>`	Debug a production failure locally - fork, replay, fix
Export sessions to your backend	`rewind export otel latest --endpoint <langfuse>`	Send debugging sessions to the team dashboard
Dual-ship traces to both	Configure your agent's OTel exporter to send to both endpoints	Record locally + observe in production simultaneously

Use your existing tool for production dashboards and alerting. Use Rewind when something breaks and you need to fix it, not just see it.

Roadmap

Phase	Features	Status
v0.1	Record, inspect, fork, diff, TUI, streaming, Instant Replay, Snapshots, Python SDK, LangGraph + CrewAI	✅ Shipped
v0.2	Direct recording, fork-and-execute replay, regression testing, MCP server	✅ Shipped
v0.3	Web UI (flight recorder + live dashboard)	✅ Shipped
v0.4	Evaluation system - datasets, evaluators, experiments, comparison, CI	✅ Shipped
v0.5	Multi-agent tracing (spans, threads, span tree UI)	✅ Shipped
v0.6	Claude Code hooks integration, transcript token parsing, session observation	✅ Shipped
v0.7	OpenTelemetry export (CLI, Python SDK, Web API, Dashboard)	✅ Shipped
v0.8	LLM-as-judge evaluators, timeline scoring, `rewind eval score` command	✅ Shipped
v0.9	OTel trace ingestion - import OTLP traces, debug production failures locally	✅ Shipped
v0.10	Langfuse import, replay cost savings calculator, session sharing (HTML export)	✅ Shipped
v0.11	`rewind fix` - AI-powered diagnosis, proxy request rewriting, hypothesis testing	✅ Shipped
v1.0	Enterprise readiness	Planned

Why "Rewind"?

Agent debugging today is where web debugging was before Chrome DevTools. You had alert() and console.log(). Then DevTools gave you breakpoints, time-travel debugging, and network inspection - and everything changed.

Rewind brings that same leap to AI agents.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

git clone https://github.com/agentoptics/rewind.git
cd rewind
cargo build          # build all crates
cargo run -- demo    # seed demo data
cargo run -- inspect latest   # open TUI

License

MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Problem

The Solution

See It in Action

Debug your agent — without re-running it

See what the model saw - find the bug in 5 seconds

Fix and replay - only re-run what changed

AI diagnosis - let the debugger debug

Web dashboard — see everything your AI does

Evaluate before shipping - catch regressions in CI

Install

Quickstart

Feature Guides

Compatibility

Works With Your Observability Stack

Roadmap

Why "Rewind"?

Contributing

License

About

Uh oh!

Releases 40

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 359 Commits
.github		.github
action		action
assets		assets
crates		crates
demo		demo
docs		docs
examples		examples
plans		plans
plugin		plugin
python-mcp		python-mcp
python		python
scripts		scripts
site		site
web		web
.claudeinclude		.claudeinclude
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Folders and files

Latest commit

History

Repository files navigation

The Problem

The Solution

See It in Action

Debug your agent — without re-running it

See what the model saw - find the bug in 5 seconds

Fix and replay - only re-run what changed

AI diagnosis - let the debugger debug

Web dashboard — see everything your AI does

Evaluate before shipping - catch regressions in CI

Install

Quickstart

Feature Guides

Compatibility

Works With Your Observability Stack

Roadmap

Why "Rewind"?

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 40

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages