IgnitionAI · salim4n · May 29, 2026 · May 29, 2026
diff --git a/README.md b/README.md
@@ -93,6 +93,7 @@ IgnitionRL
 
 To author an environment from a blank TypeScript project, follow the first guide in [`docs/BUILD_YOUR_FIRST_ENVIRONMENT.md`](docs/BUILD_YOUR_FIRST_ENVIRONMENT.md).
 To turn a stored learner checkpoint into an inference run and replay, follow [`docs/EXPORT_AND_REPLAY_TRAINED_POLICY.md`](docs/EXPORT_AND_REPLAY_TRAINED_POLICY.md).
+To debug reward shaping with named terms and replay frames, follow [`docs/REWARD_DEBUGGING_GUIDE.md`](docs/REWARD_DEBUGGING_GUIDE.md).
 
 After cloning and installing dependencies, generate a local project with traces, metrics and JSON exports:
 

diff --git a/docs/REWARD_DEBUGGING_GUIDE.md b/docs/REWARD_DEBUGGING_GUIDE.md
@@ -0,0 +1,212 @@
+# Reward Debugging Guide
+
+Reward debugging answers one question: did the agent receive the right signal
+for the behavior you wanted?
+
+IgnitionRL records named reward terms in every trace. The CLI can inspect those
+terms directly today, and the Studio shell reads the same exported JSON
+artifacts for replay and reward panels.
+
+This guide uses `DroneTarget-v0` because it has a useful mix of shaping,
+success, safety and time-cost terms:
+
+- `progress`;
+- `target_reached`;
+- `collision`;
+- `out_of_bounds`;
+- `step_penalty`.
+
+## 1. Create a Project With Failed and Improved Runs
+
+Run the current DroneTarget demo:
+
+```sh
+bun run --cwd packages/cli start demo drone-target ./drone-target-demo.ignitionrl \
+  --seed 42 \
+  --random-episodes 2 \
+  --heuristic-episodes 2 \
+  --learner-episodes 2 \
+  --inference-episodes 2 \
+  --max-steps 12 \
+  --json
+```
+
+The demo creates:
+
+- `drone-target-random`;
+- `drone-target-heuristic`;
+- `drone-target-linear-policy-search`;
+- `drone-target-linear-policy-search-inference`.
+
+Use `compare` to see which run is better:
+
+```sh
+bun run --cwd packages/cli start compare ./drone-target-demo.ignitionrl \
+  --score-by summary.bestReward \
+  --json
+```
+
+For reward debugging, start with a bad run and a better run. In this demo,
+`drone-target-random` is the failed baseline and
+`drone-target-linear-policy-search` is the improved trained run.
+
+## 2. Inspect Reward Terms
+
+Inspect the failed run:
+
+```sh
+bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \
+  drone-target-random \
+  --step 0 \
+  --export \
+  --json
+```
+
+Inspect the improved run:
+
+```sh
+bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \
+  drone-target-linear-policy-search \
+  --step 0 \
+  --export \
+  --json
+```
+
+The payload includes:
+
+- `termNames`: all reward terms found in the selected trace;
+- `terms`: per-term total, min, max, active-step count and last value;
+- `timeline`: every step with total reward, cumulative reward and term values;
+- `selectedStep`: the requested step;
+- `artifact`: the exported reward-debugger JSON path when `--export` is used.
+
+A healthy improved DroneTarget run usually shows positive `progress` over many
+steps, a single `target_reached` bonus and only the expected `step_penalty`.
+A failed random run often shows negative `progress` and no `target_reached`
+bonus.
+
+## 3. Pair Rewards With Replay Frames
+
+Reward terms explain the numeric signal. Replay frames explain what the agent
+was seeing and doing when it received that signal.
+
+Open the first frame of the failed run:
+
+```sh
+bun run --cwd packages/cli start replay ./drone-target-demo.ignitionrl \
+  drone-target-random \
+  --frame 0 \
+  --json
+```
+
+Open the final frame of the improved run:
+
+```sh
+bun run --cwd packages/cli start replay ./drone-target-demo.ignitionrl \
+  drone-target-linear-policy-search \
+  --frame 11 \
+  --json
+```
+
+Look at these fields together:
+
+- `selectedFrame.observation`;
+- `selectedFrame.action`;
+- `selectedFrame.rewardTerms`;
+- `selectedFrame.reason`;
+- `actionDistribution`;
+- `observationDimensions`.
+
+The pattern to look for is simple:
+
+- if `progress` is negative, inspect whether the action moved away from the target;
+- if `target_reached` is active, inspect whether the selected frame ended near the target;
+- if `collision` or `out_of_bounds` is active, inspect the terminal frame and done reason;
+- if `step_penalty` dominates, inspect whether the agent is taking too many steps without progress.
+
+## 4. Compare the Failed and Improved Runs
+
+Use `compare` for the summary view:
+
+```sh
+bun run --cwd packages/cli start compare ./drone-target-demo.ignitionrl \
+  --score-by summary.bestReward \
+  --json
+```
+
+Then use `rewards` for attribution:
+
+```sh
+bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \
+  drone-target-random \
+  --json
+
+bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \
+  drone-target-linear-policy-search \
+  --json
+```
+
+Good comparisons usually identify one of these:
+
+- the improved run has higher cumulative `progress`;
+- the improved run reaches `target_reached` while the failed run never does;
+- the failed run spends reward budget on `collision`, `out_of_bounds` or long step penalties;
+- the action distribution changed from random-looking noise to a purposeful control pattern.
+
+## 5. Common Reward Bugs
+
+| Symptom | Likely bug | What to inspect |
+| --- | --- | --- |
+| Agent moves away from the target but reward is positive. | `progress` sign is inverted. | Compare `selectedFrame.observation` target-relative dimensions with `selectedFrame.rewardTerms.progress`. |
+| Agent reaches the target but gets no bonus. | Target radius, success condition or `target_reached` condition does not match. | Inspect the terminal replay frame, `reason` and `target_reached.activeSteps`. |
+| Training prefers crashing or leaving bounds. | Penalty is missing, too small or inactive. | Check `collision`, `out_of_bounds`, done `reason` and per-term totals. |
+| Reward is dominated by time cost. | `step_penalty` is too large or progress reward is too weak. | Compare `step_penalty.total` with `progress.total` and episode `length`. |
+| Reward terms appear only as one scalar. | Environment returned a raw scalar or unnamed aggregate instead of named terms. | Return `reward().add("term", value)` terms from the environment. |
+| A term is always zero. | The condition is never true or the wrong state is used. | Inspect `activeSteps`, `lastValue` and whether the term uses `state` vs `nextState`. |
+| Run succeeds in training but fails in inference. | Checkpoint policy, exploration settings or seed distribution differ. | Compare training run replay with checkpoint inference replay and reward terms. |
+
+## 6. CLI Flow to Studio Flow
+
+The current CLI commands map directly to Studio panels:
+
+| CLI command | Current artifact | Future Studio panel |
+| --- | --- | --- |
+| `compare` | project report and run history rows | Experiment history |
+| `replay` | `episode-replay` JSON | Replay timeline and frame inspector |
+| `rewards --export` | `reward-debugger` JSON | Reward attribution panel |
+| `run --export` | `studio-run-view` JSON | Selected run detail |
+| `studio --export` | `studio-workspace-view` JSON | Workspace bootstrap |
+
+Refresh the workspace after exporting reward debugger payloads:
+
+```sh
+bun run --cwd packages/cli start studio ./drone-target-demo.ignitionrl \
+  --run-id drone-target-linear-policy-search \
+  --score-by summary.bestReward \
+  --export \
+  --json
+```
+
+Refresh the selected run detail:
+
+```sh
+bun run --cwd packages/cli start run ./drone-target-demo.ignitionrl \
+  drone-target-linear-policy-search \
+  --export \
+  --json
+```
+
+The Studio should not recompute rewards to explain a run. It should read the
+same trace, replay and reward-debugger payloads produced by the CLI.
+
+## 7. Reward Authoring Checklist
+
+Before trusting a learner result, check that:
+
+- every reward cause has a stable name;
+- success bonus and done success condition use the same threshold;
+- penalties use `nextState` when they depend on the result of the action;
+- shaping terms are positive for desired movement and negative for undesired movement;
+- step penalty is large enough to discourage wandering but not larger than useful progress;
+- terminal penalties and bonuses are large enough to dominate incidental shaping;
+- replay frames explain the sign and magnitude of the reward terms.