Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ IgnitionRL

To author an environment from a blank TypeScript project, follow the first guide in [`docs/BUILD_YOUR_FIRST_ENVIRONMENT.md`](docs/BUILD_YOUR_FIRST_ENVIRONMENT.md).
To turn a stored learner checkpoint into an inference run and replay, follow [`docs/EXPORT_AND_REPLAY_TRAINED_POLICY.md`](docs/EXPORT_AND_REPLAY_TRAINED_POLICY.md).
To debug reward shaping with named terms and replay frames, follow [`docs/REWARD_DEBUGGING_GUIDE.md`](docs/REWARD_DEBUGGING_GUIDE.md).

After cloning and installing dependencies, generate a local project with traces, metrics and JSON exports:

Expand Down
212 changes: 212 additions & 0 deletions docs/REWARD_DEBUGGING_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# Reward Debugging Guide

Reward debugging answers one question: did the agent receive the right signal
for the behavior you wanted?

IgnitionRL records named reward terms in every trace. The CLI can inspect those
terms directly today, and the Studio shell reads the same exported JSON
artifacts for replay and reward panels.

This guide uses `DroneTarget-v0` because it has a useful mix of shaping,
success, safety and time-cost terms:

- `progress`;
- `target_reached`;
- `collision`;
- `out_of_bounds`;
- `step_penalty`.

## 1. Create a Project With Failed and Improved Runs

Run the current DroneTarget demo:

```sh
bun run --cwd packages/cli start demo drone-target ./drone-target-demo.ignitionrl \
--seed 42 \
--random-episodes 2 \
--heuristic-episodes 2 \
--learner-episodes 2 \
--inference-episodes 2 \
--max-steps 12 \
--json
```

The demo creates:

- `drone-target-random`;
- `drone-target-heuristic`;
- `drone-target-linear-policy-search`;
- `drone-target-linear-policy-search-inference`.

Use `compare` to see which run is better:

```sh
bun run --cwd packages/cli start compare ./drone-target-demo.ignitionrl \
--score-by summary.bestReward \
--json
```

For reward debugging, start with a bad run and a better run. In this demo,
`drone-target-random` is the failed baseline and
`drone-target-linear-policy-search` is the improved trained run.

## 2. Inspect Reward Terms

Inspect the failed run:

```sh
bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \
drone-target-random \
--step 0 \
--export \
--json
```

Inspect the improved run:

```sh
bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \
drone-target-linear-policy-search \
--step 0 \
--export \
--json
```

The payload includes:

- `termNames`: all reward terms found in the selected trace;
- `terms`: per-term total, min, max, active-step count and last value;
- `timeline`: every step with total reward, cumulative reward and term values;
- `selectedStep`: the requested step;
- `artifact`: the exported reward-debugger JSON path when `--export` is used.

A healthy improved DroneTarget run usually shows positive `progress` over many
steps, a single `target_reached` bonus and only the expected `step_penalty`.
A failed random run often shows negative `progress` and no `target_reached`
bonus.

## 3. Pair Rewards With Replay Frames

Reward terms explain the numeric signal. Replay frames explain what the agent
was seeing and doing when it received that signal.

Open the first frame of the failed run:

```sh
bun run --cwd packages/cli start replay ./drone-target-demo.ignitionrl \
drone-target-random \
--frame 0 \
--json
```

Open the final frame of the improved run:

```sh
bun run --cwd packages/cli start replay ./drone-target-demo.ignitionrl \
drone-target-linear-policy-search \
--frame 11 \
--json
```

Look at these fields together:

- `selectedFrame.observation`;
- `selectedFrame.action`;
- `selectedFrame.rewardTerms`;
- `selectedFrame.reason`;
- `actionDistribution`;
- `observationDimensions`.

The pattern to look for is simple:

- if `progress` is negative, inspect whether the action moved away from the target;
- if `target_reached` is active, inspect whether the selected frame ended near the target;
- if `collision` or `out_of_bounds` is active, inspect the terminal frame and done reason;
- if `step_penalty` dominates, inspect whether the agent is taking too many steps without progress.

## 4. Compare the Failed and Improved Runs

Use `compare` for the summary view:

```sh
bun run --cwd packages/cli start compare ./drone-target-demo.ignitionrl \
--score-by summary.bestReward \
--json
```

Then use `rewards` for attribution:

```sh
bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \
drone-target-random \
--json

bun run --cwd packages/cli start rewards ./drone-target-demo.ignitionrl \
drone-target-linear-policy-search \
--json
```

Good comparisons usually identify one of these:

- the improved run has higher cumulative `progress`;
- the improved run reaches `target_reached` while the failed run never does;
- the failed run spends reward budget on `collision`, `out_of_bounds` or long step penalties;
- the action distribution changed from random-looking noise to a purposeful control pattern.

## 5. Common Reward Bugs

| Symptom | Likely bug | What to inspect |
| --- | --- | --- |
| Agent moves away from the target but reward is positive. | `progress` sign is inverted. | Compare `selectedFrame.observation` target-relative dimensions with `selectedFrame.rewardTerms.progress`. |
| Agent reaches the target but gets no bonus. | Target radius, success condition or `target_reached` condition does not match. | Inspect the terminal replay frame, `reason` and `target_reached.activeSteps`. |
| Training prefers crashing or leaving bounds. | Penalty is missing, too small or inactive. | Check `collision`, `out_of_bounds`, done `reason` and per-term totals. |
| Reward is dominated by time cost. | `step_penalty` is too large or progress reward is too weak. | Compare `step_penalty.total` with `progress.total` and episode `length`. |
| Reward terms appear only as one scalar. | Environment returned a raw scalar or unnamed aggregate instead of named terms. | Return `reward().add("term", value)` terms from the environment. |
| A term is always zero. | The condition is never true or the wrong state is used. | Inspect `activeSteps`, `lastValue` and whether the term uses `state` vs `nextState`. |
| Run succeeds in training but fails in inference. | Checkpoint policy, exploration settings or seed distribution differ. | Compare training run replay with checkpoint inference replay and reward terms. |

## 6. CLI Flow to Studio Flow

The current CLI commands map directly to Studio panels:

| CLI command | Current artifact | Future Studio panel |
| --- | --- | --- |
| `compare` | project report and run history rows | Experiment history |
| `replay` | `episode-replay` JSON | Replay timeline and frame inspector |
| `rewards --export` | `reward-debugger` JSON | Reward attribution panel |
| `run --export` | `studio-run-view` JSON | Selected run detail |
| `studio --export` | `studio-workspace-view` JSON | Workspace bootstrap |

Refresh the workspace after exporting reward debugger payloads:

```sh
bun run --cwd packages/cli start studio ./drone-target-demo.ignitionrl \
--run-id drone-target-linear-policy-search \
--score-by summary.bestReward \
--export \
--json
```

Refresh the selected run detail:

```sh
bun run --cwd packages/cli start run ./drone-target-demo.ignitionrl \
drone-target-linear-policy-search \
--export \
--json
```

The Studio should not recompute rewards to explain a run. It should read the
same trace, replay and reward-debugger payloads produced by the CLI.

## 7. Reward Authoring Checklist

Before trusting a learner result, check that:

- every reward cause has a stable name;
- success bonus and done success condition use the same threshold;
- penalties use `nextState` when they depend on the result of the action;
- shaping terms are positive for desired movement and negative for undesired movement;
- step penalty is large enough to discourage wandering but not larger than useful progress;
- terminal penalties and bonuses are large enough to dominate incidental shaping;
- replay frames explain the sign and magnitude of the reward terms.
Loading