diff --git a/README.md b/README.md index 99819f1..e7c028f 100644 --- a/README.md +++ b/README.md @@ -91,6 +91,8 @@ IgnitionRL ## Headless Demo +To author an environment from a blank TypeScript project, follow the first guide in [`docs/BUILD_YOUR_FIRST_ENVIRONMENT.md`](docs/BUILD_YOUR_FIRST_ENVIRONMENT.md). + After cloning and installing dependencies, generate a local project with traces, metrics and JSON exports: ```sh diff --git a/docs/BUILD_YOUR_FIRST_ENVIRONMENT.md b/docs/BUILD_YOUR_FIRST_ENVIRONMENT.md new file mode 100644 index 0000000..1c23d24 --- /dev/null +++ b/docs/BUILD_YOUR_FIRST_ENVIRONMENT.md @@ -0,0 +1,317 @@ +# Build Your First Environment + +This guide starts from an empty TypeScript project and ends with a working +IgnitionRL environment that can reset, step, emit observations, validate +actions, explain rewards and produce a replay trace. + +The example is a small `Target2D-v0` task: an agent starts at the origin and +must move toward a seeded target. + +## 1. Create a Blank Project + +```sh +mkdir target-2d-env +cd target-2d-env +bun init -y +bun add @ignitionrl/core +mkdir src +``` + +Inside this repository, the same APIs are available from +`packages/core/src/index.ts`; in a user project, import from +`@ignitionrl/core`. + +## 2. Define State and Helpers + +Create `src/target-2d.ts`. + +```ts +import { defineEnvironment, reward } from "@ignitionrl/core"; + +type Vec2 = { + x: number; + y: number; +}; + +export type Target2DState = { + agent: Vec2; + target: Vec2; + previousDistance: number; + steps: number; +}; + +const ACTIONS = ["up", "down", "left", "right"] as const; + +export type Target2DAction = typeof ACTIONS[number]; + +const STEP_SIZE = 0.25; +const TARGET_RADIUS = 0.5; +const MAX_STEPS = 120; + +function distance(a: Vec2, b: Vec2): number { + return Math.hypot(a.x - b.x, a.y - b.y); +} +``` + +The state can be any serializable shape that your environment owns. Keep the +state explicit: observations should be derived from it, rewards should be +computed from it and `step()` should return the next state without mutating the +previous one. + +## 3. Add Reset and Seeding + +In IgnitionRL, reset logic lives in `createInitialState()`. The runner calls it +when it is created and every time you call `runner.reset()`. + +```ts +export const Target2D = defineEnvironment({ + id: "Target2D-v0", + metadata: { + description: "A compact 2D target-reaching tutorial environment.", + maxSteps: MAX_STEPS, + tags: ["tutorial", "2d", "discrete"], + observationLabels: [ + "agent.x", + "agent.y", + "target.dx", + "target.dy", + "previousDistance", + ], + actionLabels: ACTIONS, + }, + + createInitialState: ({ rng }): Target2DState => { + const agent = { x: 0, y: 0 }; + const target = { + x: rng.float(-5, 5), + y: rng.float(-5, 5), + }; + + return { + agent, + target, + previousDistance: distance(agent, target), + steps: 0, + }; + }, +``` + +Use the provided `rng` instead of `Math.random()`. A runner created with the +same seed and stepped with the same action sequence should produce the same +observations, rewards and done results. + +## 4. Return Observations + +Observations are numeric vectors. Arrays and numeric typed arrays are accepted. + +```ts + observations: ({ state }) => [ + state.agent.x, + state.agent.y, + state.target.x - state.agent.x, + state.target.y - state.agent.y, + state.previousDistance, + ], +``` + +Keep the observation order stable. Learners treat dimension `0`, dimension `1` +and so on as separate signals, so reordering values between runs changes the +meaning of a checkpoint. + +## 5. Declare Actions + +This guide uses a discrete action space with literal values. + +```ts + actions: { + type: "discrete", + values: ACTIONS, + labels: ACTIONS, + }, +``` + +Other supported action spaces are: + +```ts +// Discrete index action: action is 0, 1, 2, ... +{ type: "discrete", n: 4, labels: ["up", "down", "left", "right"] } + +// Continuous action: action is a number array with shape size 4. +{ type: "continuous", shape: [4], low: -1, high: 1, labels: ["throttle", "yaw", "pitch", "roll"] } + +// Multi-discrete action: action is a number array with one integer per branch. +{ type: "multi-discrete", nvec: [3, 2], labels: ["move", "fire"] } +``` + +## 6. Step the Environment + +`step()` receives the current state and a validated action. Return the next +state. + +```ts + step: ({ state, action }) => { + const next: Target2DState = structuredClone(state); + + next.previousDistance = distance(state.agent, state.target); + + if (action === "up") next.agent.y += STEP_SIZE; + if (action === "down") next.agent.y -= STEP_SIZE; + if (action === "left") next.agent.x -= STEP_SIZE; + if (action === "right") next.agent.x += STEP_SIZE; + + next.steps += 1; + + return next; + }, +``` + +Avoid mutating `state` in place. Traces, reward debugging and comparisons are +easier to reason about when `state` and `nextState` are distinct snapshots. + +## 7. Name Reward Terms + +Rewards can be a scalar internally, but users debug them as named terms. Use +one term for each reason the agent gains or loses reward. + +```ts + reward: ({ state, nextState }) => { + const previousDistance = distance(state.agent, state.target); + const nextDistance = distance(nextState.agent, nextState.target); + + return reward() + .add("progress", previousDistance - nextDistance) + .add("target_reached", nextDistance < TARGET_RADIUS, 10) + .stepPenalty(-0.01); + }, +``` + +The reward builder also supports penalties: + +```ts +reward() + .add("progress", 0.1) + .penalty("collision", didCollide, 3) + .stepPenalty(-0.01) +``` + +## 8. End Episodes + +`done()` can return a boolean or a structured result. Prefer structured results +when you know whether the episode succeeded or was truncated. + +```ts + done: ({ nextState }) => { + const nextDistance = distance(nextState.agent, nextState.target); + + if (nextDistance < TARGET_RADIUS) { + return { done: true, reason: "target_reached", success: true }; + } + + if (nextState.steps >= MAX_STEPS) { + return { done: true, reason: "max_steps", truncated: true }; + } + + return false; + }, +}); +``` + +The full `src/target-2d.ts` file now exports `Target2D`, a typed environment +with reset, observation, action, step, reward and done behavior. + +## 9. Run a Smoke Script + +Create `src/smoke.ts`. + +```ts +import { randomPolicy } from "@ignitionrl/core"; +import { Target2D } from "./target-2d.ts"; + +const spec = Target2D.getSpec({ seed: "tutorial-spec" }); + +console.log("Environment:", spec.id); +console.log("Observation shape:", spec.observation.shape); +console.log("Action spec:", spec.actions); + +const runner = Target2D.createRunner({ + seed: "tutorial-run", + runId: "tutorial-target-2d", + maxSteps: 120, + collectTrace: true, +}); + +const reset = runner.reset({ seed: "tutorial-reset" }); + +console.log("Initial observation:", reset.observation); + +for (const action of ["right", "up", "right", "up"] as const) { + const step = runner.step(action); + + console.log({ + t: step.t, + action: step.action, + reward: step.reward.total, + terms: step.reward.terms, + done: step.done, + reason: step.reason, + }); + + if (step.done) break; +} + +const trace = await Target2D + .createRunner({ seed: "tutorial-random", maxSteps: 120, collectTrace: true }) + .runEpisode(randomPolicy(Target2D.actions, { seed: "tutorial-policy" }), { + reset: false, + maxSteps: 120, + }); + +console.log("Random baseline:", trace.summary); +console.log("Trace steps:", trace.steps.length); +``` + +Run it: + +```sh +bun src/smoke.ts +``` + +You should see: + +- an environment spec with a vector observation shape; +- an action spec with four discrete values; +- per-step rewards with named terms; +- a replay trace summary from the random baseline episode. + +## 10. What IgnitionRL Validates + +IgnitionRL validates the contract at definition time and step time: + +- `defineEnvironment()` checks that `id`, action spec and required functions are valid. +- `getSpec()` calls `createInitialState()` and `observations()` to infer observation shape. +- `runner.step(action)` validates the action before calling your `step()`. +- every observation is normalized into a finite number array. +- every reward is normalized into `{ total, terms }`. +- every done result is normalized into `{ done, reason?, success?, truncated? }`. + +## Troubleshooting + +| Symptom | Likely cause | Fix | +| --- | --- | --- | +| `[IgnitionRL] observations() must return a numeric array.` | `observations()` returned an object, scalar, `undefined` or mixed structure. | Return `number[]`, `Float32Array` or another numeric typed array. | +| `[IgnitionRL] observation[2] must be finite, got NaN.` | An observation dimension is `NaN` or `Infinity`. | Guard division by zero, missing state fields and invalid physics values. | +| `[IgnitionRL] Invalid discrete action: jump.` | The policy returned a value not listed in `actions.values`. | Return one of the exact literal values, or use `env.sampleAction()` while debugging. | +| `[IgnitionRL] Continuous action length must be 4, got 3.` | A continuous policy returned the wrong vector size. | Match the flattened size of `actions.shape`. | +| `[IgnitionRL] continuous action[0] must be in [-1, 1]` | A continuous policy exceeded `low` or `high`. | Clamp action outputs or widen the action spec intentionally. | +| `[IgnitionRL] reward() must return a RewardBuilder or RewardResult.` | `reward()` returned a raw number or a malformed object. | Return `reward().add(...)` or `{ total, terms }`. | +| `[IgnitionRL] Reward term names must be non-empty.` | A reward term name is `""` or whitespace. | Name every reward cause, such as `progress`, `collision` or `step_penalty`. | +| `[IgnitionRL] done() must return a boolean or DoneResult.` | `done()` returned `undefined`, a number or another shape. | Return `false`, `true` or `{ done: boolean, reason?: string }`. | +| `[IgnitionRL] Episode is done. Call reset() before stepping again.` | Code called `runner.step()` after termination. | Call `runner.reset()` before continuing. | +| Runs are not reproducible. | Randomness is coming from `Math.random()` or time-based state. | Use the provided `rng` in `createInitialState()` and policy code. | + +## Next Step + +Once the environment works under `@ignitionrl/core`, wire it into a local +project through the SDK or one of the CLI templates. The current first-class CLI +training command supports the built-in `GridWorld-v0`, `Target2D-v0` and +`DroneTarget-v0` environments while custom project registration matures.