Conversation
- Integrated Rolldown for hot module replacement (HMR) in `arkor dev`, allowing real-time updates to the training interface without page refresh. - Implemented `requestEarlyStop` and `replaceCallbacks` methods in the Trainer API to facilitate graceful stopping of training jobs and dynamic callback updates during execution. - Updated documentation to reflect new features and usage patterns for improved developer guidance. - Adjusted cleanup logic for better resource management during development sessions.
…l shutdown - Introduced `replaceCallbacks` method in the Trainer API to allow dynamic updates of lifecycle callbacks during training runs. - Enhanced signal handling for graceful early stopping, ensuring in-flight checkpoints are preserved during HMR rebuilds. - Added `registerCleanupHook` for streamlined resource management on process exit, improving cleanup logic across development commands. - Updated documentation to reflect new features and usage patterns for better developer guidance.
…internal callback swapping - Removed the public `replaceCallbacks` method from the Trainer interface to prevent exposure of the hot-swapping functionality. - Introduced an internal mechanism for callback swapping using a `Symbol.for`-keyed brand, allowing for dynamic updates during training runs without affecting the public API. - Updated signal handling to ensure seamless integration with the new callback swapping logic, enhancing the hot module replacement (HMR) experience. - Revised documentation to reflect changes in the Trainer API and clarify the internal callback management process.
…lement internal early-stop handling - Removed the public `requestEarlyStop` method from the Trainer interface to prevent exposure of the early-stop functionality. - Introduced an internal mechanism for early stopping using a `Symbol.for`-keyed brand, allowing for graceful stopping after the next checkpoint without affecting the public API. - Updated signal handling to ensure seamless integration with the new early-stop logic, enhancing the hot module replacement (HMR) experience. - Revised documentation to reflect changes in the Trainer API and clarify the internal early-stop management process.
…llback replacement - Updated the Trainer API to remove public exposure of `requestEarlyStop` and `replaceCallbacks` methods, enhancing encapsulation. - Implemented internal mechanisms for early stopping and callback swapping using `Symbol.for`-keyed brands, ensuring seamless integration during training runs. - Revised signal handling to improve the hot module replacement (HMR) experience and maintain clean resource management. - Updated documentation to reflect these changes and clarify the internal management processes for developers.
- Replaced esbuild with Rolldown for building the project, ensuring external dependencies are resolved correctly. - Implemented a cleanup hook system to manage resource disposal on process termination signals. - Enhanced the development experience with hot module replacement (HMR) capabilities, allowing for seamless updates during training runs. - Updated documentation to reflect changes in the development loop and HMR behavior. - Added tests for new signal handling and configuration hashing functionality.
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8abc594369
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const hotSwapTargets = activeTrains.notifyCallbackReload(nextHash); | ||
| const restartTargets = | ||
| activeTrains.requestEarlyStopOnMismatch(nextHash); |
There was a problem hiding this comment.
Dispatch train signals only once per rebuild
notifyCallbackReload/requestEarlyStopOnMismatch run inside each /api/dev/events subscriber callback, so one rebuild sends signals once per connected SSE client (and again on reconnect when createHmrCoordinator.subscribe replays lastEvent). In the common case of two open Studio tabs, the same child can receive two SIGTERMs for one config-changing rebuild; the second signal triggers the runner’s forced-exit path (exit(143)) instead of checkpoint-preserving early stop. Move signal dispatch out of per-subscriber delivery (or dedupe per rebuild hash) so each child is signaled at most once per rebuild event.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR upgrades the local Studio dev experience by migrating the trainer bundling pipeline from esbuild to Rolldown, adding an HMR/SSE channel to propagate rebuild events to the Studio SPA, and introducing signal-based “hot-swap callbacks vs early-stop + restart” behavior for in-flight training subprocesses.
Changes:
- Replace esbuild-based trainer bundling with Rolldown (one-shot build + watch/HMR).
- Add
/api/dev/eventsSSE stream + SPA client wiring to refresh manifest and coordinate restarts/hot-swaps. - Introduce internal trainer inspection/callback replacement/early-stop hooks (via
Symbol.forbrands), plus cleanup utilities and updated docs/tests.
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pnpm-lock.yaml | Removes esbuild and adds rolldown to the lockfile graph. |
| packages/arkor/package.json | Swaps dependency from esbuild to rolldown. |
| packages/arkor/src/core/rolldownConfig.ts | Centralizes Rolldown input options + entry/outDir resolution and Node target derivation. |
| packages/arkor/src/cli/commands/build.ts | Implements arkor build via Rolldown bundling instead of esbuild. |
| packages/arkor/src/studio/hmr.ts | Adds lazy Rolldown watcher coordinator that emits ready/rebuild/error events. |
| packages/arkor/src/studio/hmr.test.ts | Tests HMR coordinator behavior (ready/rebuild/error/replay/dispose). |
| packages/arkor/src/studio/manifest.ts | Adds configHash to manifest summary and splits “summarise built manifest” vs “build + summarise”. |
| packages/arkor/src/studio/trainRegistry.ts | Adds per-child registry + policy for SIGUSR2 hot-swap vs SIGTERM early-stop. |
| packages/arkor/src/studio/trainRegistry.test.ts | Tests TrainRegistry signaling decisions and error tolerance. |
| packages/arkor/src/studio/server.ts | Adds /api/dev/events SSE route and integrates TrainRegistry into rebuild handling. |
| packages/arkor/src/studio/server.test.ts | Adds tests for /api/dev/events token rules, loopback guard, and SSE framing. |
| packages/arkor/src/core/configHash.ts | Adds stable hashing of JobConfig for “callbacks-only vs restart” decisions. |
| packages/arkor/src/core/configHash.test.ts | Tests determinism and sensitivity of hashJobConfig. |
| packages/arkor/src/core/trainerInspection.ts | Introduces internal inspection + callback replacement + early-stop brands via Symbol.for. |
| packages/arkor/src/core/trainer.ts | Implements callback hot-swap support + early-stop latch; attaches internal brands. |
| packages/arkor/src/core/trainer.test.ts | Adds coverage for early-stop and callback hot-swap behavior via internal brands. |
| packages/arkor/src/core/runnerSignals.ts | Adds signal handlers: SIGTERM graceful early-stop + SIGUSR2 callback reload. |
| packages/arkor/src/core/runnerSignals.test.ts | Tests SIGTERM/SIGUSR2 handler behaviors with branded trainers. |
| packages/arkor/src/core/runner.ts | Installs/removes new signal handlers around trainer.start()/trainer.wait(). |
| packages/arkor/src/core/runner.test.ts | Adds a test for SIGTERM early-stop behavior in the runner. |
| packages/arkor/src/cli/cleanupHooks.ts | Adds reusable cleanup-hook registration for exit/signals. |
| packages/arkor/src/cli/commands/dev.ts | Wires HMR coordinator into Studio server and refactors shutdown cleanup via cleanup hooks. |
| packages/arkor/src/cli/commands/start.test.ts | Updates test comment to refer to Rolldown instead of esbuild. |
| packages/studio-app/src/lib/api.ts | Adds typed DevEvent + openDevEvents() for SSE notifications. |
| packages/studio-app/src/components/RunTraining.tsx | SPA listens to dev SSE events, refreshes manifest, and coordinates restart/hot-swap UI state. |
| docs/concepts/studio.mdx | Documents HMR workflow, including hash-based hot-swap vs restart behavior (EN). |
| docs/ja/concepts/studio.mdx | Documents HMR workflow (JA). |
| AGENTS.md | Updates repository policy/docs for new SSE allow-list, HMR internals, and Rolldown build pipeline. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const [log, setLog] = useState(""); | ||
| const [manifest, setManifest] = useState<ManifestResult | null>(null); | ||
| const [hmrStatus, setHmrStatus] = useState< | ||
| "idle" | "rebuilding" | "early-stopping" | "restarting" | "hot-swapped" |
| import { ensureProjectState } from "../core/projectState"; | ||
| import { readState } from "../core/state"; | ||
| import { readManifestSummary } from "./manifest"; | ||
| import { readManifestSummary, summariseBuiltManifest } from "./manifest"; |
| function scheduleHmrCleanup(hmr: { dispose: () => Promise<void> }): void { | ||
| // Registered before the studio-token cleanup so it runs first on | ||
| // shutdown — Node fires signal handlers in registration order, and we | ||
| // want the watcher to release file handles before the outermost | ||
| // process.exit. | ||
| registerCleanupHook({ cleanup: () => hmr.dispose() }); | ||
| } |
| requestEarlyStopOnMismatch( | ||
| nextConfigHash: string | null, | ||
| ): RestartTarget[] { | ||
| const targets: RestartTarget[] = []; | ||
| for (const [pid, entry] of this.entries) { | ||
| if ( | ||
| nextConfigHash === null || | ||
| entry.configHash === null || | ||
| entry.configHash !== nextConfigHash | ||
| ) { | ||
| try { | ||
| entry.child.kill("SIGTERM"); | ||
| } catch { | ||
| // child already exited; close handler will clean up. | ||
| } | ||
| targets.push({ pid, trainFile: entry.trainFile }); | ||
| } |
| await callbacks.onCheckpoint?.(ctx); | ||
| // Early-stop latch: a checkpoint just landed, so the in-flight work | ||
| // is durable. Cancel the cloud job and end `wait()` cleanly. | ||
| if (earlyStopRequested && earlyStopDeferred) { | ||
| await trainer.cancel(); | ||
| if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer); | ||
| earlyStopDeferred.resolve(); | ||
| earlyStopDeferred = null; | ||
| return { terminal: true, artifacts: terminalResult?.artifacts ?? [] }; | ||
| } |
| function startWatcher(): void { | ||
| if (watcher || disposed) return; | ||
| if (!existsSync(resolved.entry)) { | ||
| broadcast({ | ||
| type: "error", | ||
| message: `Build entry not found: ${resolved.entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`, | ||
| }); | ||
| return; | ||
| } |
…ous BUNDLE_END on macOS - Modified the test to assert that the late subscriber receives the same event as the prior subscriber, addressing issues with spurious BUNDLE_END events in rolldown@1.0.0-rc.17 on macOS. - Added comments to clarify the reasoning behind the change and noted a future task to revisit the assertion after rolldown stabilizes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 31f510869c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| restartPendingRef.current = true; | ||
| setHmrStatus(runningRef.current ? "early-stopping" : "idle"); |
There was a problem hiding this comment.
Ignore restart hints when no local run is active
restartPendingRef is set for every payload.restart event even when this tab is not running a training stream. Since /api/dev/events broadcasts restart events to all connected tabs, a passive tab can cache restartPendingRef=true from another tab’s run; the next time the user runs training in this tab, finally will auto-trigger an extra run, creating an unintended duplicate job and extra cloud/GPU spend. Only mark restart pending when runningRef.current is true (or otherwise correlate the restart target to the local run).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 28 changed files in this pull request and generated 4 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } catch { | ||
| // child already exited; close handler will clean up. | ||
| } | ||
| targets.push({ pid, trainFile: entry.trainFile }); |
| import { ensureProjectState } from "../core/projectState"; | ||
| import { readState } from "../core/state"; | ||
| import { readManifestSummary } from "./manifest"; | ||
| import { readManifestSummary, summariseBuiltManifest } from "./manifest"; |
| const [log, setLog] = useState(""); | ||
| const [manifest, setManifest] = useState<ManifestResult | null>(null); | ||
| const [hmrStatus, setHmrStatus] = useState< | ||
| "idle" | "rebuilding" | "early-stopping" | "restarting" | "hot-swapped" |
| // Early-stop latch: a checkpoint just landed, so the in-flight work | ||
| // is durable. Cancel the cloud job and end `wait()` cleanly. | ||
| if (earlyStopRequested && earlyStopDeferred) { | ||
| await trainer.cancel(); | ||
| if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer); | ||
| earlyStopDeferred.resolve(); | ||
| earlyStopDeferred = null; | ||
| return { terminal: true, artifacts: terminalResult?.artifacts ?? [] }; | ||
| } | ||
| return { terminal: false, artifacts: terminalResult?.artifacts ?? [] }; |
…ce failures - Added a test to verify that the SIGINT exit handler is armed even if the `persistStudioToken` function fails, preventing the dev server from idling indefinitely. - Updated the `runDev` function to register the studio-token cleanup hook unconditionally, ensuring it executes on process termination signals regardless of token persistence success. - Enhanced cleanup logic to maintain proper signal handling during development sessions, improving overall stability and user experience.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 55b4a2e8f6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (payload.restart) { | ||
| // `/api/dev/events` is a broadcast — every open Studio tab gets | ||
| // this event. Only flip the auto-restart latch when *this* tab | ||
| // is actually running a stream right now; otherwise a passive | ||
| // tab would silently auto-spawn an extra job the next time the | ||
| // user clicks Run training, doubling cloud spend. | ||
| if (runningRef.current) { | ||
| // Training run is early-stopping; the active stream will |
There was a problem hiding this comment.
Scope restart latch to runs actually signaled for restart
The handler sets restartPendingRef for any payload.restart while this tab is running, but rebuild events are broadcast across all tabs and can contain mixed outcomes (restartTargets and hotSwapTargets) for different subprocesses. In a multi-run setup, a tab whose run was only hot-swapped can still latch restartPendingRef=true, then auto-spawn an extra training job when its current run eventually finishes, increasing unintended cloud/GPU usage.
Useful? React with 👍 / 👎.
| entry.child.kill("SIGTERM"); | ||
| entry.earlyStopRequested = true; |
There was a problem hiding this comment.
Check signal delivery before marking restart targets
ChildProcess.kill() returns a boolean and does not throw when a process is already gone, but this path treats any non-throw as success. If kill("SIGTERM") returns false during an exit race, the entry is still marked earlyStopRequested and added to targets, so the UI may wait for/restart a run that was never actually signaled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 28 out of 29 changed files in this pull request and generated 4 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| useEffect(() => { | ||
| const es = openDevEvents(); | ||
| const onMessage = (raw: MessageEvent) => { |
| setHmrStatus("hot-swapped"); | ||
| window.setTimeout(() => { | ||
| setHmrStatus((s) => (s === "hot-swapped" ? "idle" : s)); | ||
| }, 1500); | ||
| } else { |
| }, | ||
| cancel() { | ||
| activeTrains.unregister(child.pid); | ||
| child.kill(); |
| // Synchronous wrapper so signal handlers preserve "cleanup landed | ||
| // before this function returns" — important for sync cleanups (e.g. | ||
| // `unlinkSync`) and for tests that assert the side effect right after | ||
| // invoking the handler. Async cleanups are fire-and-forget with a | ||
| // catch so a hung dispose doesn't block exit. | ||
| const run = (): void => { | ||
| if (done) return; | ||
| done = true; | ||
| try { | ||
| const result = options.cleanup(); | ||
| if (result && typeof (result as Promise<void>).catch === "function") { | ||
| (result as Promise<void>).catch(() => { | ||
| // best-effort: shutdown is racing other cleanup paths | ||
| }); | ||
| } | ||
| } catch { | ||
| // best-effort | ||
| } | ||
| }; | ||
|
|
||
| process.on("exit", run); | ||
|
|
||
| for (const sig of TERMINATING_SIGNALS) { | ||
| process.on(sig, () => { | ||
| run(); | ||
| if (options.exitOnSignal) process.exit(0); |
Code Coverage OverviewLanguages: TypeScript TypeScript / code-coverage/arkorThe overall coverage in the branch is 97%. The coverage in the branch is 99%. Show a code coverage summary of the most impacted files.
TypeScript / code-coverage/create-arkorThe overall coverage remains at 56%, unchanged from the branch. TypeScript / code-coverage/cli-internalThe overall coverage remains at 97%, unchanged from the branch. TypeScript / code-coverage/studio-appThe overall coverage in the branch is 52%. The coverage in the branch is 53%. Show a code coverage summary of the most impacted files.
Updated |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa6a826f27
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const configHash: string | null = options.hmr | ||
| ? options.hmr.getCurrentConfigHash() | ||
| : null; |
There was a problem hiding this comment.
Route explicit train files with their own HMR hash
When /api/train is called with body.file, the spawned arkor start <file> rebuilds that explicit entry into .arkor/build/index.mjs, but this snapshot still records the HMR coordinator's current hash for the default src/arkor/index.ts. In HMR-enabled Studio, a run launched from an alternate file can therefore be classified for SIGUSR2/restart based on a different artifact, and the default watcher may overwrite the same index.mjs that the child reloads from; e.g. editing only the default entry can hot-swap callbacks into an alternate-entry run from the wrong bundle. Treat explicit files as unhashable/restart-only or build/watch a separate artifact for that entry.
Useful? React with 👍 / 👎.
| try { | ||
| await trainer.cancel(); | ||
| } catch { | ||
| // intentionally ignored; see comment above. | ||
| } |
There was a problem hiding this comment.
Propagate fallback cancel failures
For hand-rolled trainers that the runner accepts but that lack the internal early-stop brand, a SIGTERM/HMR stop falls back to trainer.cancel() here. If that cancel rejects (for example, a transient cloud/network failure), this catch resolves requestTrainerEarlyStop() successfully, so installShutdownHandlers exits with code 0 even though the remote job may still be running; this reintroduces the orphaned-job false-success path for the unbranded trainer shape.
Useful? React with 👍 / 👎.
| const manifest = await readManifestSummary(trainCwd, { | ||
| prebuiltOutFile: hmrOutFile, | ||
| }); |
There was a problem hiding this comment.
Avoid serving stale artifacts before HMR is ready
On HMR-enabled startup, this always passes the prebuilt path to readManifestSummary, which imports it whenever .arkor/build/index.mjs already exists. If that file is left over from a previous build and the watcher has not emitted its first ready yet, the first /api/manifest poll can show an old trainer/config; because arkor start also skips rebuilding when the artifact exists, a quick Run click can launch the stale artifact until the watcher catches up. Gate this fast path on the coordinator having a successful current build, not just file existence.
Useful? React with 👍 / 👎.
| if (process.platform !== "win32") { | ||
| const r = safeKill(entry.child, "SIGUSR2"); | ||
| if (r === "ok") { | ||
| hotSwapTargets.push(target); |
There was a problem hiding this comment.
Delay hot-swap signals until the child is ready
A same-hash rebuild can hit this branch immediately after /api/train registers a newly spawned child, before that child has imported runner.ts and installed its SIGUSR2 callback-reload handler. In that startup window on Unix, child.kill("SIGUSR2") can terminate the Node process with the default signal action while the server reports it as a hot-swap target, so the SPA does not schedule a restart and the user sees the run die instead of continue. Add a child-ready marker before allowing SIGUSR2, or route pre-ready children through restart.
Useful? React with 👍 / 👎.
| "@types/node": "^24", | ||
| "@vitest/coverage-v8": "^4.1.5", | ||
| "tsdown": "^0.22.0", | ||
| "typescript": "^5", | ||
| "vitest": "^4.1.5" |
There was a problem hiding this comment.
Keep published devDependencies catalog-bound
The repository's AGENTS.md dependency-catalog rule says the published packages' runtime dependencies are the only literal-semver carve-out and that their devDependencies stay catalog:-bound. Replacing these with literals here (and the same pattern in packages/create-arkor/package.json) breaks the single-source-of-truth catalog policy, so future bumps of TypeScript/Vitest/tsdown can drift between the published packages and the rest of the workspace instead of being one-line catalog updates.
Useful? React with 👍 / 👎.
… for process termination
- Introduced a NO_OP_DISPOSE function for consistent signal registration failure handling. - Enhanced shutdown signal handling to ensure accurate exit codes based on received signals. - Updated requestTrainerEarlyStop to handle errors more gracefully and ensure proper exit codes. - Improved callback reload handler to avoid crashing on unsupported platforms. - Refactored trainer test cases to utilize Response.json for consistency and clarity. - Updated various promises to use void for better readability and to avoid unhandled promise rejections. - Enhanced type safety and clarity in several areas, including the use of readonly arrays and improved error handling. - Cleaned up unnecessary comments and improved code readability across multiple files.
WalkthroughThis PR implements Studio Hot Module Replacement (HMR) with graceful early-stop and callback hot-swap for in-flight training runs. It migrates the build system from esbuild to rolldown, adds deterministic config hashing for rebuild routing decisions, implements signal-coordinated trainer lifecycle control (SIGUSR2 for callbacks, SIGTERM for graceful restart), and provides frontend auto-restart orchestration with configurable grace windows. The changes span build tooling, runtime primitives, server-side train process orchestration, CLI shutdown integration, frontend SSE streaming, and comprehensive testing. ChangesStudio HMR End-to-End Flow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes This PR introduces multiple interconnected subsystems (HMR coordinator, train registry, signal coordination, frontend state machine) with dense logic, cross-platform signal handling, race-condition defenses (nonce spoofing, pre-ready backfill, sequence gating), and extensive test coverage across build, runtime, server, and frontend layers. The changes span heterogeneous domains requiring separate reasoning for each checkpoint. Suggested reviewers
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
✨ Simplify code
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d4030b9c43
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Clearing here makes the user's broken-state edit the | ||
| // source of truth: no auto-restart fires until the next | ||
| // successful rebuild re-arms the latch. | ||
| restartPendingRef.current = false; |
There was a problem hiding this comment.
Preserve restart after a fixed HMR error
When an HMR error arrives after a prior rebuild already put this child into the registry's earlyStopRequested state, clearing restartPendingRef here loses the only restart signal for that child. The server's rebuild dispatch skips entries with earlyStopRequested, so if the user fixes the compile error before the early-stop child exits, the successful rebuild will not include this pid in restartTargets and the SPA cannot re-arm the latch; the child then exits and stays idle instead of restarting with the fixed bundle.
Useful? React with 👍 / 👎.
| void fetchManifest() | ||
| .then(setManifest) |
There was a problem hiding this comment.
Ignore stale manifest fetches after HMR errors
Fresh evidence beyond the server-side /api/manifest guard: a ready/rebuild event can start this async fetchManifest(), then a newer HMR error event can set the manifest to { error } before this request resolves. Because the promise is not versioned or cancelled, its later setManifest can overwrite the error with the last-good manifest, re-enabling the Run button for stale code until a later poll restores the error state.
Useful? React with 👍 / 👎.
| async function inspectBundle(outFile: string): Promise<InspectionResult> { | ||
| try { | ||
| const mod = (await import(moduleCacheBustUrl(outFile))) as Record< | ||
| string, | ||
| unknown | ||
| >; | ||
| const inspection = findInspectableTrainer(mod); | ||
| if (!inspection) return null; | ||
| return { | ||
| configHash: hashJobConfig(inspection.config), | ||
| trainerName: inspection.name, | ||
| }; | ||
| } catch { | ||
| return null; |
There was a problem hiding this comment.
Propagate inspection failures to HMR error state
If a rebuilt module throws during evaluation (for example top-level user code throws), Rolldown still emits BUNDLE_END, but this catch converts the failed import into configHash: null and a successful ready/rebuild event. That means /api/manifest will not stay in the HMR error state and the registry may SIGTERM-restart active runs against an artifact that arkor start cannot import, leaving the SPA to churn into a failed restart instead of surfacing the build/import failure as an error.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
packages/arkor/src/core/trainerInspection.test.ts (1)
91-121: ⚡ Quick winAdd the missing
default: trainerinspection case.This block calls out shape
#4in the regression note, but it never assertsfindInspectableTrainer({ default: trainer }). A directexport default createTrainer(...)path can regress without failing this suite.As per coding guidelines, "Add vitest test cases under `packages/*/src/**/*.test.ts` for SDK/CLI/scaffold logic changes."Proposed test addition
const trainerC = brandedTrainer("default-arkor"); const inspectionC = findInspectableTrainer({ default: createArkor({ trainer: trainerC }), }); expect(inspectionC?.name).toBe("default-arkor"); - const trainerD = brandedTrainer("default-nested"); + const trainerD = brandedTrainer("default-direct"); + const inspectionD = findInspectableTrainer({ default: trainerD }); + expect(inspectionD?.name).toBe("default-direct"); + + const trainerE = brandedTrainer("default-nested"); + const inspectionE = findInspectableTrainer({ + default: { trainer: trainerE }, + }); + expect(inspectionE?.name).toBe("default-nested"); - const inspectionD = findInspectableTrainer({ - default: { trainer: trainerD }, - }); - expect(inspectionD?.name).toBe("default-nested");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/arkor/src/core/trainerInspection.test.ts` around lines 91 - 121, Add a new assertion to cover the missing "default: trainer" shape by calling findInspectableTrainer with an object whose default property is the bare trainer; specifically create a trainer via brandedTrainer (e.g., brandedTrainer("default-bare")) and assert that findInspectableTrainer({ default: trainer }) returns the expected inspection (check .name), similar to the other cases in the test that use findInspectableTrainer, brandedTrainer, and createArkor.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/arkor/src/core/configHash.ts`:
- Around line 64-85: stableStringifyRec currently calls a value's toJSON before
adding the object to the seen set and uses Array.prototype.map which skips
sparse array holes; change it so that when maybeToJSON is a function you add the
current value to seen before invoking (so circular returns from toJSON are
detected like JSON.stringify), then call (maybeToJSON as (key: string) =>
unknown).call(value, key) and recurse on that result; for arrays, stop using
value.map and instead iterate numeric indices 0..value.length-1, treat missing
slots (i in value === false) and slots that serialize to undefined as "null"
(i.e. use stableStringifyRec per index and fall back to "null") to match
JSON.stringify behavior and ensure circulars are detected via the seen set.
In `@packages/arkor/src/core/trainer.test.ts`:
- Around line 1817-1849: The test's fetcher mock expects events for job id
"j-falsy" but the create-job stub returns minimalJobRow.id (currently "j-stop"),
causing wait() to hit the wrong stream URL; update the create-job response to
return a job with id "j-falsy" (i.e. ensure minimalJobRow.id or the object
returned in the POST branch matches "j-falsy") so the GET to
/v1/jobs/j-falsy/events/stream in the fetcher mock is exercised by the
Trainer.wait() flow.
---
Nitpick comments:
In `@packages/arkor/src/core/trainerInspection.test.ts`:
- Around line 91-121: Add a new assertion to cover the missing "default:
trainer" shape by calling findInspectableTrainer with an object whose default
property is the bare trainer; specifically create a trainer via brandedTrainer
(e.g., brandedTrainer("default-bare")) and assert that findInspectableTrainer({
default: trainer }) returns the expected inspection (check .name), similar to
the other cases in the test that use findInspectableTrainer, brandedTrainer, and
createArkor.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 429d904f-75a0-497d-bfbb-9ea6225bd3e4
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (43)
AGENTS.mddocs/concepts/studio.mdxdocs/ja/concepts/studio.mdxdocs/ja/studio/jobs.mdxe2e/cli/vitest.config.tse2e/studio/src/specs/hmr.spec.tseslint.config.tspackages/arkor/package.jsonpackages/arkor/src/cli/cleanupHooks.test.tspackages/arkor/src/cli/cleanupHooks.tspackages/arkor/src/cli/commands/build.tspackages/arkor/src/cli/commands/dev.test.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/cli/commands/start.test.tspackages/arkor/src/core/configHash.test.tspackages/arkor/src/core/configHash.tspackages/arkor/src/core/moduleCacheBust.test.tspackages/arkor/src/core/moduleCacheBust.tspackages/arkor/src/core/projectState.test.tspackages/arkor/src/core/rolldownConfig.tspackages/arkor/src/core/runner.test.tspackages/arkor/src/core/runner.tspackages/arkor/src/core/runnerSignals.test.tspackages/arkor/src/core/runnerSignals.tspackages/arkor/src/core/schemas.test.tspackages/arkor/src/core/signalExit.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/core/trainerInspection.test.tspackages/arkor/src/core/trainerInspection.tspackages/arkor/src/studio/hmr.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/studio/manifest.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/studio/server.tspackages/arkor/src/studio/trainRegistry.test.tspackages/arkor/src/studio/trainRegistry.tspackages/cli-internal/src/templates.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/studio-app/src/lib/api.tspackages/studio-app/vite.config.tspnpm-workspace.yaml
…andling fix: improve inspectBundle error handling and manifest fetching logic in HMR test: add assertions for default-exported trainer shapes in trainerInspection tests feat: implement monotonic counter for HMR SSE handler to prevent stale state
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e726548bca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // restored the error. The polling effect's own fetches don't need | ||
| // this because they observe `cancelled` synchronously inside the | ||
| // `tick()` closure; this counter only protects the cross-handler |
There was a problem hiding this comment.
Version manifest polling against HMR errors
When HMR is enabled, a regular 5s manifest poll that started before a later error SSE can still resolve after the error handler has set { error }, because this code explicitly leaves polling fetches outside the manifestFetchSeqRef guard. In projects whose manifest import/build is slow (for example top-level await or filesystem contention), that stale poll can overwrite the current build-error state and re-enable the Run button for the last-good artifact until the next poll restores the error.
Useful? React with 👍 / 👎.
| if (restartPendingRef.current) { | ||
| // Fast path: SSE event already landed before exit. Fire the | ||
| // restart synchronously without waiting for the grace | ||
| // window so the common case has no perceptible delay. | ||
| restartPendingRef.current = false; |
There was a problem hiding this comment.
Suppress auto-restart after failed early-stop exits
When an HMR-triggered early-stop fails to cancel the cloud job, the runner now exits nonzero (exit=143), but streamTraining() still resolves normally for any /api/train stream that closed after a 200 response. If restartPendingRef is set, this branch auto-spawns a new run without checking that exit marker, so the fixed nonzero exit path is still treated as restartable and can overlap a new cloud job with the one whose cancel failed.
Useful? React with 👍 / 👎.
| const { baseUrl: rpcBaseUrl, token: rpcToken } = | ||
| await resolveCredentialsAndBaseUrl(); |
There was a problem hiding this comment.
Capture cancel credentials at spawn time
This server-side Stop path re-reads credentials/base URL at cancel time, while the child created the cloud job using whatever credentials it loaded at trainer.start(). If the user logs in/out or otherwise changes ~/.arkor credentials while a run is in flight, Stop can send the cancel POST to the wrong account/control plane, swallow the failure, and then SIGKILL the child that still held the original credentials, leaving the original cloud job running.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e41acc7f28
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| // Hash mismatch (or one side is null): graceful restart. | ||
| const r = safeKill(entry.child, "SIGTERM"); |
There was a problem hiding this comment.
Avoid SIGTERM restarts on Windows
When Studio HMR runs on Windows and a rebuild requires a restart, this SIGTERM does not reach the runner's installShutdownHandlers; Node's subprocess.kill() docs state that on Windows these signals cause the process to be killed "forcefully and abruptly" (https://nodejs.org/api/child_process.html#subprocesskillsignal). In that environment the child cannot early-stop or send cancel(), so a config-changing rebuild can orphan the cloud job and lose the checkpoint-preservation behavior this path is meant to provide.
Useful? React with 👍 / 👎.
|
|
||
| async function emitBuildSucceeded(): Promise<void> { | ||
| if (disposed) return; | ||
| const seq = ++buildSeq; |
There was a problem hiding this comment.
Invalidate stale inspections when rebuild starts
This sequence guard only advances after a later build reaches BUNDLE_END (or ERROR), so if the previous bundle's inspectBundle() is slow (for example user top-level await during dynamic import) and a new edit has already started compiling but has not ended yet, the older inspection still passes seq === buildSeq and broadcasts/signals active children for stale source. Rolldown's watch cycle includes BUNDLE_START before BUNDLE_END, so the in-flight inspection should be invalidated when the newer rebuild starts, not only when it finishes.
Useful? React with 👍 / 👎.
| } | ||
| watcher = watch({ | ||
| ...rolldownInputOptions(resolved), | ||
| output: { file: resolved.outFile, format: "esm" }, |
There was a problem hiding this comment.
Avoid importing while the watcher writes
With HMR enabled, the watcher writes .arkor/build/index.mjs at this same path while /api/train can concurrently spawn arkor start, whose runStart skips rebuilding whenever that artifact already exists. If the user clicks Run during the watcher's non-atomic write window, the child can import a partially-written bundle and fail the run instead of starting from a coherent artifact. Gate training while a rebuild is in progress or publish the watched bundle via temp-file-and-rename before making it visible to arkor start.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
packages/arkor/src/studio/server.ts (1)
1620-1624:⚠️ Potential issue | 🟠 Major | ⚡ Quick winConstrain static asset reads to
assetsDir.
readAsset()joins the request path directly ontoassetsDir, andapp.get("*")feeds itc.req.pathverbatim. Any traversal sequence that survives routing normalization can therefore escape the bundled assets tree and read arbitrary local files. Because these GETs are only host-guarded, not Studio-token gated, this is a local file disclosure on the loopback server.🔒 Suggested hardening
async function readAsset(relPath: string): Promise<Response | null> { const cleaned = relPath.replace(/^\/+/, ""); try { - const file = await readFile(join(assetsDir, cleaned)); + const filePath = resolve(assetsDir, cleaned); + if ( + filePath !== assetsDir && + !filePath.startsWith(`${assetsDir}${sep}`) + ) { + return null; + } + const file = await readFile(filePath); const ext = cleaned.slice(cleaned.lastIndexOf(".") + 1); if (ext === "html") {Also applies to: 1647-1655
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/arkor/src/studio/server.ts` around lines 1620 - 1624, readAsset currently concatenates the request path onto assetsDir and can be abused for path traversal; to fix, decode and normalize the incoming relPath, then compute the absolute target with path.resolve(assetsDir, cleaned) (use the same resolved base: const base = path.resolve(assetsDir)), and verify the resulting target is inside base (e.g. use path.relative(base, target) and ensure it does not start with ".." and does not contain path.sep at the front); if the check fails return null (or an appropriate error) and do not call readFile. Also reject suspicious inputs (null bytes, absolute paths) before resolving. Apply this same guard to the other occurrence around lines 1647-1655.packages/arkor/src/cli/commands/dev.test.ts (1)
663-682:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThese persist-failure tests do not actually fail on Windows.
Line 671 and Line 738 rely on
chmodSync(..., 0o555)to block writes to~/.arkor, but NTFS does not enforce directory writes that way. On Windows both tests can go green without ever hitting thepersistStudioTokencatch path, so this regression coverage disappears on part of the CI matrix. Use a platform-neutral failure trigger here, such as makingstudioTokenPath()a directory or mocking the write boundary.Also applies to: 726-747
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/arkor/src/cli/commands/dev.test.ts` around lines 663 - 682, The tests relying on chmodSync to make ~/.arkor read-only are flaky on Windows; replace the platform-specific permission trick with a platform-neutral write-failure trigger: locate the token path via studioTokenPath(...) (or the exported persistStudioToken/writeCredentials function) and either (a) create a directory at that exact path before calling runDev({ port: 4203 }) so the token write will fail with ENOTDIR, or (b) mock/spyon the module method that writes the token (persistStudioToken or writeCredentials) to throw an EACCES-like error for this test; ensure you restore the original behavior in the finally block and keep assertions (expect(runDev...).resolves.toBeUndefined and expect(serve).toHaveBeenCalledTimes(1)) unchanged.packages/studio-app/src/components/RunTraining.tsx (2)
141-167:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe poller can still overwrite an HMR error with stale manifest data.
Line 151 is only gated by
cancelled, which flips on unmount, not on newer HMR events. If a/api/manifestpoll started before an SSEerrorframe resolves afterward, this effect will restore the last-good manifest and re-enable Run against broken code until the next refresh. Reuse the same seq/generation gate for poll fetches, or invalidate in-flight polls when HMR events land.Possible fix
async function tick() { + const seq = ++manifestFetchSeqRef.current; try { const m = await fetchManifest(); - if (!cancelled) setManifest(m); + if (!cancelled && seq === manifestFetchSeqRef.current) { + setManifest(m); + } } catch (err: unknown) { - if (!cancelled) { + if (!cancelled && seq === manifestFetchSeqRef.current) { setManifest({ error: err instanceof Error ? err.message : String(err), }); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/studio-app/src/components/RunTraining.tsx` around lines 141 - 167, The poller can overwrite a newer HMR error because it only checks cancelled; modify the effect to use a generation/sequence gate: create a ref (e.g., manifestSeqRef) that you increment whenever an SSE/HMR error frame or any HMR event invalidates the manifest, capture the current seq at the start of tick (before calling fetchManifest), then before calling setManifest or scheduling the next timer ensure the captured seq still equals manifestSeqRef.current; if it differs, discard the poll result (and do not schedule another timeout). Keep using the existing symbols (useEffect, cancelled, timer, tick, fetchManifest, setManifest) but add and check the manifestSeqRef around the async fetch and before state updates so in-flight polls are invalidated when HMR events arrive.
334-510: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy liftAdd a jsdom test suite for this restart state machine.
This component now owns the risky HMR logic - buffered pre-spawn events, late-SSE grace windows, and "Stop beats auto-restart" arbitration - but the provided coverage only hits API helpers and a caption-refresh E2E. A small Testing Library suite here would catch regressions before they become flaky cross-process Playwright failures. As per coding guidelines: Include jsdom-based Testing Library cases for Studio components. Based on learnings: Do not split implementation across multiple PRs; land docs, tests, and code in the same PR unless the user explicitly says otherwise.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/studio-app/src/components/RunTraining.tsx` around lines 334 - 510, Add a jsdom Testing Library suite for the RunTraining component exercising the run() state machine: write tests to simulate streamTraining callbacks and SSE-like pre-spawn events to validate that pendingPreSpawnEventsRef is drained, restartPendingRef wins over hot-swap, user Abort (trainingAbortRef)/ac.signal.aborted prevents logging and cancels restarts, and the 250ms restart grace window behavior (currentPidRef, restartGraceTimerRef) toggles hmrStatus correctly; stub/mock streamTraining and isHmrEnabled, drive onSpawn via the pid callback, and assert UI/state transitions (setHmrStatus outcomes "idle"/"early-stopping"/"restarting"/"hot-swapped") and that run(file) re-invokes when appropriate. Ensure tests run under jsdom (Testing Library + jest) and are added alongside the component.Sources: Coding guidelines, Learnings
packages/arkor/src/cli/commands/dev.ts (1)
199-230:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftRead-then-unlink still lets a concurrent
arkor devinstance lose its token.Line 224 and Line 230 are a TOCTOU pair: another process can rewrite
~/.arkor/studio-tokenafter the read succeeds but before this process unlinks it. In that interleaving, we compare our bytes and then delete the other process's fresh token anyway, so the shared-path 403 regression is still reachable. This needs an atomic ownership/lease scheme, not a best-effort content check.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/arkor/src/cli/commands/dev.ts` around lines 199 - 230, The current read-then-unlink in registerCleanupHook (readFileSync → tokensEqual → unlinkSync) is TOCTOU and can delete a freshly written token from another process; replace the best-effort content check with an explicit ownership/lease: when creating the token (persistStudioToken or wherever the token is written) also atomically acquire a per-process lease by creating a companion lock/ownership file using fs.openSync(lockPath, 'wx') (or writeFileSync with flag 'wx') containing a unique owner id (pid/uuid); on cleanup, instead of just reading the token file, read the lock file and compare the owner id to our stored id and only then unlink the token and remove the lock; update registerCleanupHook, persistStudioToken (or the token-writing site), and use tokensEqual only for debug — use lockPath, persistStudioToken, registerCleanupHook, and the stored owner id to locate the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@packages/arkor/src/cli/commands/dev.test.ts`:
- Around line 663-682: The tests relying on chmodSync to make ~/.arkor read-only
are flaky on Windows; replace the platform-specific permission trick with a
platform-neutral write-failure trigger: locate the token path via
studioTokenPath(...) (or the exported persistStudioToken/writeCredentials
function) and either (a) create a directory at that exact path before calling
runDev({ port: 4203 }) so the token write will fail with ENOTDIR, or (b)
mock/spyon the module method that writes the token (persistStudioToken or
writeCredentials) to throw an EACCES-like error for this test; ensure you
restore the original behavior in the finally block and keep assertions
(expect(runDev...).resolves.toBeUndefined and
expect(serve).toHaveBeenCalledTimes(1)) unchanged.
In `@packages/arkor/src/cli/commands/dev.ts`:
- Around line 199-230: The current read-then-unlink in registerCleanupHook
(readFileSync → tokensEqual → unlinkSync) is TOCTOU and can delete a freshly
written token from another process; replace the best-effort content check with
an explicit ownership/lease: when creating the token (persistStudioToken or
wherever the token is written) also atomically acquire a per-process lease by
creating a companion lock/ownership file using fs.openSync(lockPath, 'wx') (or
writeFileSync with flag 'wx') containing a unique owner id (pid/uuid); on
cleanup, instead of just reading the token file, read the lock file and compare
the owner id to our stored id and only then unlink the token and remove the
lock; update registerCleanupHook, persistStudioToken (or the token-writing
site), and use tokensEqual only for debug — use lockPath, persistStudioToken,
registerCleanupHook, and the stored owner id to locate the changes.
In `@packages/arkor/src/studio/server.ts`:
- Around line 1620-1624: readAsset currently concatenates the request path onto
assetsDir and can be abused for path traversal; to fix, decode and normalize the
incoming relPath, then compute the absolute target with path.resolve(assetsDir,
cleaned) (use the same resolved base: const base = path.resolve(assetsDir)), and
verify the resulting target is inside base (e.g. use path.relative(base, target)
and ensure it does not start with ".." and does not contain path.sep at the
front); if the check fails return null (or an appropriate error) and do not call
readFile. Also reject suspicious inputs (null bytes, absolute paths) before
resolving. Apply this same guard to the other occurrence around lines 1647-1655.
In `@packages/studio-app/src/components/RunTraining.tsx`:
- Around line 141-167: The poller can overwrite a newer HMR error because it
only checks cancelled; modify the effect to use a generation/sequence gate:
create a ref (e.g., manifestSeqRef) that you increment whenever an SSE/HMR error
frame or any HMR event invalidates the manifest, capture the current seq at the
start of tick (before calling fetchManifest), then before calling setManifest or
scheduling the next timer ensure the captured seq still equals
manifestSeqRef.current; if it differs, discard the poll result (and do not
schedule another timeout). Keep using the existing symbols (useEffect,
cancelled, timer, tick, fetchManifest, setManifest) but add and check the
manifestSeqRef around the async fetch and before state updates so in-flight
polls are invalidated when HMR events arrive.
- Around line 334-510: Add a jsdom Testing Library suite for the RunTraining
component exercising the run() state machine: write tests to simulate
streamTraining callbacks and SSE-like pre-spawn events to validate that
pendingPreSpawnEventsRef is drained, restartPendingRef wins over hot-swap, user
Abort (trainingAbortRef)/ac.signal.aborted prevents logging and cancels
restarts, and the 250ms restart grace window behavior (currentPidRef,
restartGraceTimerRef) toggles hmrStatus correctly; stub/mock streamTraining and
isHmrEnabled, drive onSpawn via the pid callback, and assert UI/state
transitions (setHmrStatus outcomes
"idle"/"early-stopping"/"restarting"/"hot-swapped") and that run(file)
re-invokes when appropriate. Ensure tests run under jsdom (Testing Library +
jest) and are added alongside the component.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: be0f3b24-0b07-46b0-87c3-4a8c45682ee7
📒 Files selected for processing (18)
AGENTS.mddocs/concepts/studio.mdxe2e/cli/vitest.config.tse2e/studio/src/specs/hmr.spec.tseslint.config.tspackages/arkor/src/cli/commands/dev.test.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/cli/commands/start.test.tspackages/arkor/src/core/runner.test.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/studio/server.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/studio-app/src/lib/api.tspackages/studio-app/vite.config.ts
📜 Review details
🧰 Additional context used
📓 Path-based instructions (19)
docs/**/*.mdx
📄 CodeRabbit inference engine (AGENTS.md)
Verify Mintlify heading slugs with curl before adding cross-page anchor links. Mintlify preserves
/,=, and full-width JP parens()in ids, strips ASCII parens()and backticks`, and converts spaces to-. Confirm actual rendered ids before committing links.
Files:
docs/concepts/studio.mdx
**
⚙️ CodeRabbit configuration file
**: # Arkor Development GuideNote: Claude Code automatically loads this file.
Repository shape
pnpm + Turbo monorepo. Workspaces are declared in
pnpm-workspace.yaml(packages/*,e2e/*,docs).
Path Role packages/arkor Published arkorSDK + CLI + bundled local Studio server (Hono).bin/arkor→dist/bin.mjs. Library entry →dist/index.mjs.packages/create-arkor Published create-arkorscaffolder (pnpm create arkor).packages/cli-internal Private workspace package. Source is bundled into arkorandcreate-arkorvia tsdown'sdeps.alwaysBundle. Never appears as a runtime dependency on npm.packages/studio-app Private Vite + React 19 SPA. pnpm --filter@arkor/studio-appbundlebuilds it;packages/arkor/scripts/copy-studio-assets.mjscopiesdist/intopackages/arkor/dist/assets/.e2e/cli Private vitest suite that spawns the built dist/bin.mjsof both CLIs in temp dirs.e2e/studio Private Playwright suite that spawns arkor devagainst an in-process fake cloud-api and drives the Studio SPA in Chromium.docs Mintlify sources for docs.arkor.ai. Common commands
Root scripts fan out via Turbo (which respects
^builddeps in turbo.json):pnpm install pnpm build # turbo run build across all packages pnpm typecheck # tsc --noEmit across all packages pnpm lint # oxlint --deny-warnings, then strict ESLint 10 (single root config each) pnpm format # oxfmt --write across the repo (config in oxfmt.config.ts) pnpm format:check # oxfmt --check; CI gates on this (no writes) pnpm test # vitest run across all packages (incl. e2e) pnpm test:coverage # writes lcov + cobertura + junit per package; CI uploads lcov to Codecov, cobertura to GitHub nativ...
Files:
docs/concepts/studio.mdxpackages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tsAGENTS.mdeslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs}
📄 CodeRabbit inference engine (CONTRIBUTING.md)
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs}: Avoid the em dash (U+2014) in code comments, string literals, and template literals. Use colons, periods, commas, parentheses, spaced hyphens (" - "), or reworded sentences instead
Runpnpm lint(oxlint --deny-warnings followed by strict ESLint 10 with type-aware rules) before submitting code
Files:
packages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tseslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx,js,jsx,mts,mjs,cts,cjs,json,css,html}
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Run
pnpm formatusing oxfmt to format all JS/TS/JSON/CSS/HTML files before submitting code
Files:
packages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tseslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.test.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (CONTRIBUTING.md)
Include vitest cases for SDK/CLI/scaffolder logic, jsdom-based Testing Library cases for Studio components, or visual UI documentation when submitting PRs
Files:
packages/arkor/src/cli/commands/start.test.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tspackages/arkor/src/core/trainer.test.tspackages/studio-app/src/lib/api.test.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx,js,jsx,mts,mjs}
📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)
Do not use em-dashes (U+2014) anywhere in code comments, strings, and template literals. Instead use colons, periods, commas, parentheses, space-separated hyphens (
-), or rephrase sentences.
Files:
packages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tseslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
packages/{arkor,create-arkor,cli-internal}/**/*.{ts,tsx,js,mts}
📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)
CLI runtime messages, generated file templates, and test names must also follow the no em-dash rule
Files:
packages/arkor/src/cli/commands/start.test.tspackages/arkor/src/studio/server.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{ts,tsx}: Preferinterfacefor defining object shapes in TypeScript (TypeScript-specific type definition pattern)
Use camelCase for variable names in TypeScript/JavaScript code
Always use async/await for promises in TypeScript/JavaScript code
UsetimingSafeEqualfor security-critical token comparisons (e.g., Studio CSRF token validation)
Files:
packages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tseslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/components/RunTraining.tsxpackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
packages/arkor/src/**/*.test.ts
📄 CodeRabbit inference engine (AGENTS.md)
Add vitest test cases for SDK/CLI/scaffold logic changes
Files:
packages/arkor/src/cli/commands/start.test.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/cli/commands/dev.test.ts
**/*.ts
📄 CodeRabbit inference engine (AGENTS.md)
When bumping
oxlintoroxfmtin the catalog, runpnpm install --config.minimum-release-age=0once so platform bindings land inpnpm-lock.yaml. Regularpnpm installthereafter will resolve them normally.
Files:
packages/arkor/src/cli/commands/start.test.tspackages/studio-app/vite.config.tse2e/cli/vitest.config.tspackages/studio-app/src/lib/api.tspackages/arkor/src/studio/server.tseslint.config.tspackages/arkor/src/studio/server.test.tspackages/arkor/src/core/runner.test.tse2e/studio/src/specs/hmr.spec.tspackages/arkor/src/core/trainer.test.tspackages/arkor/src/core/trainer.tspackages/arkor/src/cli/commands/dev.tspackages/studio-app/src/lib/api.test.tspackages/arkor/src/studio/hmr.tspackages/arkor/src/cli/commands/dev.test.ts
packages/studio-app/vite.config.ts
📄 CodeRabbit inference engine (AGENTS.md)
Configure the
arkor-studio-tokenVite plugin withapply: "serve"(dev-only) to inject<meta name="arkor-studio-token">into index.html on each request. Never apply this plugin duringvite buildto prevent baking a stale token into production.
Files:
packages/studio-app/vite.config.ts
e2e/cli/**/*.{ts,tsx,js,mts}
📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)
E2E CLI tests must consume the built dist/bin.mjs from arkor and create-arkor packages, ensuring pnpm build is run first when executing tests standalone
Files:
e2e/cli/vitest.config.ts
packages/arkor/src/studio/server.ts
📄 CodeRabbit inference engine (AGENTS.md)
Validate every
/api/*request viaX-Arkor-Studio-Tokenheader (or?studioToken=query forEventSource) usingtimingSafeEqualcomparison. Only GET stream-only routes can be added to theeventStreamPathPatternallow-list regex; never add mutation endpoints.
Files:
packages/arkor/src/studio/server.ts
AGENTS.md
📄 CodeRabbit inference engine (CLAUDE.md)
Maintain AGENTS.md as a living document with current agent status and architectural decisions
Files:
AGENTS.md
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Maintain English/Japanese documentation pairs:
README.md↔README.ja.md,CONTRIBUTING.md↔CONTRIBUTING.ja.md, anddocs/↔docs/ja/. Update both sides in the same PR.
Files:
AGENTS.md
eslint.config.ts
📄 CodeRabbit inference engine (AGENTS.md)
ESLint 10 reads
eslint.config.tsflat config at repo root (type-aware viaprojectService). Configure strict rules: typescript-eslintstrictTypeChecked+stylisticTypeChecked, unicorn, import-x, promise, n, react-hooks, jsx-a11y, vitest. Add per-file overrides for SPA (React), tests, Playwright fixtures, and scripts. Keepauto-fixunsafe rules (unicorn/no-typeof-undefined,no-unnecessary-type-assertion) set tooffwith comments explaining the failure modes.
Files:
eslint.config.ts
e2e/studio/**/*.{ts,tsx,js,mts}
📄 CodeRabbit inference engine (CONTRIBUTING.ja.md)
E2E Studio tests must consume the built dist/bin.mjs from arkor and the bundled Studio assets, requiring playwright chromium installation and pnpm build as prerequisites
Files:
e2e/studio/src/specs/hmr.spec.ts
packages/arkor/src/cli/commands/dev.ts
📄 CodeRabbit inference engine (AGENTS.md)
Generate a 32-byte base64url CSRF token per dev server launch and pass it to
buildStudioApp({ studioToken, hmr })
Files:
packages/arkor/src/cli/commands/dev.ts
packages/arkor/src/studio/hmr.ts
📄 CodeRabbit inference engine (AGENTS.md)
Keep a Rolldown watcher over
src/arkor/to track rebuilds and push events over/api/dev/events(SSE). Dynamic-import artifacts on success, extractTrainerInspectionsnapshots viaSymbol.for("arkor.trainer.inspect")brand, and compute stableconfigHashfrom cloud-sideJobConfig.
Files:
packages/arkor/src/studio/hmr.ts
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Use the Repository pattern for data access
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Treat `createArkor` as a function that returns a frozen, opaque manifest tagged with `_kind: "arkor"`; pass it to tooling, not as a programmable client for user code
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Both e2e/cli and e2e/studio declare `arkor` (and `create-arkor`) as `workspace:*` devDependencies so Turbo's `^build` produces `dist/bin.mjs` exactly once before test runs (no pretest hooks, no concurrent races). Standalone runs need a prior `pnpm build`.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: E2E coverage uses `c8` wrapping vitest (not vitest's native coverage) so child CLI V8 coverage is captured and remapped through tsdown sourcemaps back to `src/`. Set `CREATE_ARKOR_BUILD_SOURCEMAP=1` for E2E coverage builds; CI declares it in turbo.json so it propagates and busts the no-sourcemap cache variant. Release builds do NOT set this and ship without `.map` files.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Release SBOMs are generated with `pnpm sbom --prod --sbom-spec-version 1.7` (CycloneDX 1.7 and SPDX 2.3 formats), attested under the trusted builder's OIDC identity, and uploaded to both Codecov and GitHub native code coverage. Bundled runtime inputs of `arkor/studio-app` (e.g. `react`, `fontsource-variable/geist*`) must be `dependencies`, not `devDependencies`, so they appear in the production SBOM.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Do NOT split implementation (features, schema, behavior changes) across multiple PRs. Land docs (English + Japanese), tests, and code in the same PR unless the user explicitly says otherwise.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Node version: published packages declare `engines.node >=22.22.0` (raised from 22.6 to dodge the Jan 2026 async-hooks CVE). Use Node 24 (latest preferred) for development.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Preserve the SIGTERM-and-let-the-child-handle-it pattern in HMR subprocess lifecycle. Do not replace it with SIGKILL escalation (orphans Cloud jobs) or always-hot-swap (risks stale JobConfig). Do not surface `requestEarlyStop()` or `replaceCallbacks()` as public Trainer interface methods; keep them behind `Symbol.for` brands (dev-only HMR primitives).
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: HMR Symbols (`Symbol.for("arkor.trainer.inspect")`, `Symbol.for("arkor.trainer.replaceCallbacks")`, `Symbol.for("arkor.trainer.requestEarlyStop")`) are dev-only HMR primitives and must never be exposed in the published Trainer public interface. User code that wants similar semantics should compose `abortSignal` + `cancel()` per the cookbook.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: When touching the Studio server or SPA fetch layer, preserve: token via header for `fetch`, query param for `EventSource`, host-header guard, no CORS, timing-safe compare. The Vite plugin is dev-only (`apply: "serve"`); running it during `vite build` would bake a stale per-launch token into production.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: CORS is intentionally NOT configured on `/api/*`. The SPA is same-origin so reflecting `*` would let "simple" cross-origin POSTs reach handlers. The token check rejects those; cross-origin tabs cannot read the SPA's `<meta>`. This prevents another browser tab on the same machine from POSTing `/api/train` (RCE-grade exposure).
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Generated files copied into package directories are gitignored: `packages/*/CONTRIBUTING.md` (from root), `packages/arkor/docs/` (from root `docs/`). Always edit the source under repo root, not the copies.
Learnt from: CR
Repo: arkorlab/arkor
Timestamp: 2026-06-11T14:01:54.065Z
Learning: Do not call a HuggingFace model name "non-existent" based on training data alone. Verify (e.g. WebFetch) before flagging in issues or PR comments. If unverifiable, hedge ("could not confirm") rather than asserting absence.
🪛 LanguageTool
AGENTS.md
[style] ~100-~100: Consider using the typographical ellipsis character here instead.
Context: ...emoved. The build / coverage lanes set ARKOR_INTERNAL_SCAFFOLD_ARKOR_SPEC=file:.../packages/arkor so scaffolded fixtures ...
(ELLIPSIS)
[uncategorized] ~132-~132: The official name of this software platform is spelled with a capital “H”.
Context: ...TOCOL`. - Release SBOMs (build.yaml): each release sh...
(GITHUB)
[uncategorized] ~132-~132: The official name of this software platform is spelled with a capital “H”.
Context: ...ifact, and postflight in release.yaml attaches them t...
(GITHUB)
[uncategorized] ~132-~132: The official name of this software platform is spelled with a capital “H”.
Context: ...npm emits 2.3), so release-dry-run.yaml regener...
(GITHUB)
🔇 Additional comments (11)
packages/arkor/src/core/trainer.ts (1)
32-32: LGTM!Also applies to: 157-157, 248-248
eslint.config.ts (1)
4-4: LGTM!Also applies to: 31-38, 40-98, 154-167, 176-176, 334-334, 357-357
packages/arkor/src/core/runner.test.ts (1)
171-171: LGTM!packages/arkor/src/core/trainer.test.ts (1)
279-279: LGTM!Also applies to: 674-675, 856-858, 1841-1841
packages/arkor/src/studio/hmr.ts (1)
108-108: LGTM!Also applies to: 111-111, 347-347
packages/studio-app/src/components/RunTraining.tsx (1)
241-251: Em-dash check: no U+2014 characters in RunTraining.tsx
packages/studio-app/src/components/RunTraining.tsxcontains no U+2014 em-dash characters, so the em-dash guideline issue is not present here.> Likely an incorrect or invalid review comment.packages/studio-app/src/lib/api.ts (1)
174-183: LGTM!Also applies to: 262-268, 414-424, 447-453, 476-485, 510-515
packages/studio-app/vite.config.ts (1)
61-63: LGTM!AGENTS.md (1)
100-100: LGTM!Also applies to: 132-132
docs/concepts/studio.mdx (1)
15-15: LGTM!e2e/cli/vitest.config.ts (1)
22-25: LGTM!Also applies to: 45-46
This pull request migrates the Arkor build and development workflow from esbuild to Rolldown, introduces a robust hot module replacement (HMR) system for the Studio dev server, and refines the cleanup and shutdown logic for long-lived resources. Documentation has been updated to explain the new HMR behavior in both English and Japanese, and dependency management has been adjusted accordingly. These changes streamline the developer experience, enable live code updates during training runs, and improve resource handling during shutdown.
Build system migration and HMR integration:
arkor build, ensuring bundles target the current Node version and keeping external dependencies unbundled. (packages/arkor/package.json,packages/arkor/src/cli/commands/build.ts,packages/arkor/src/core/rolldownConfig) [1] [2] [3]src/arkor/in dev mode, enabling HMR for trainer code. The watcher notifies the SPA via SSE (/api/dev/events), and the dev server can hot-swap callbacks or gracefully early-stop training runs based on config changes. (packages/arkor/src/cli/commands/dev.ts,packages/arkor/src/studio/hmr.ts,packages/arkor/src/studio/trainRegistry.ts) [1] [2] [3] [4] [5] [6]Cleanup and shutdown improvements:
registerCleanupHookutility to handle resource cleanup on process exit and signals, ensuring proper teardown order (e.g., HMR watcher before token file removal). (packages/arkor/src/cli/cleanupHooks.ts,packages/arkor/src/cli/commands/dev.ts) [1] [2]Documentation updates:
docs/concepts/studio.mdx,docs/ja/concepts/studio.mdx,AGENTS.md) [1] [2] [3]AGENTS.md)Dependency and test adjustments:
packages/arkor/package.json,packages/arkor/src/cli/commands/start.test.ts) [1] [2]Security and internal API notes:
Symbol.forbrands, not exposed on the public SDK surface. (AGENTS.md) [1] [2]Summary by CodeRabbit
Release Notes
New Features
Documentation