From 4139e5c6d8dccc90085a1a12322a95cc84d9fb86 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 14:29:18 -0400 Subject: [PATCH 01/15] docs: add SO-101 remediation design spec 8-PR sequence prioritizing arm/gripper correctness fixes (lifecycle, speed/accel, opMgr) before calibration sensor and discovery work. Defers the singleton-vs-arm-as-core architecture decision to a post-remediation revisit. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-04-27-so101-remediation-design.md | 170 ++++++++++++++++++ 1 file changed, 170 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-27-so101-remediation-design.md diff --git a/docs/superpowers/specs/2026-04-27-so101-remediation-design.md b/docs/superpowers/specs/2026-04-27-so101-remediation-design.md new file mode 100644 index 0000000..ba71f7b --- /dev/null +++ b/docs/superpowers/specs/2026-04-27-so101-remediation-design.md @@ -0,0 +1,170 @@ +# SO-101 Module Remediation Design + +**Date:** 2026-04-27 +**Status:** Draft +**Author:** Brainstorm with Nick Hehr + +## Context + +A full review of the `so_arm` module surfaced ~30 issues spanning critical bugs (silent calibration corruption, lifecycle leaks, a misleading speed/acceleration config), missing features (`IsHoldingSomething`, `MoveOptions`), structural problems (3-4 implementations of the same conversion math, type-explosion around calibration, `runtime.Caller`-based reference tracking), and gaps in testing/CI/docs. + +This spec sequences the remediation work into 8 risk-ordered PRs, prioritizing arm and gripper fixes before calibration sensor and discovery, and explicitly defers a possible architectural shift (singleton shared-controller → arm-as-core with gripper using DoCommand) until the in-place fixes are complete. + +## Goals + +- Land the most user-visible correctness fixes (the silent calibration write bug, the speed/accel lie, the broken `Stop`) within the first two PRs. +- Make every PR independently reviewable: behavior changes and refactors do not share a PR. +- Add test infrastructure (mock-bus harness, conversion-math round-trip tests) inside the PRs that need it, rather than as a separate phase, so each PR ships covered. +- Establish PR-time CI so future regressions of this class are caught at review. +- Leave the codebase in a state where the singleton-vs-arm-as-core architectural question can be re-evaluated honestly post-remediation. + +## Non-goals + +- **No architectural shift in this spec.** The shared-controller singleton stays. Whether to flip to arm-as-core (gripper communicates with arm via DoCommand) is a deferred decision, re-evaluated after PR8 lands. +- **No package-layout reshuffle** (`internal/`, sub-packages). Discussed in the review, intentionally out of scope here. +- **No changes to the setup-app SvelteKit application.** Calibration sensor stays one Viam resource — the web app's UX depends on it. +- **No new features beyond filling already-promised contracts.** `IsHoldingSomething`/`MoveOptions`/`Get3DModels` are already-promised interfaces returning errors; we make them work. We do not add servo diagnostics, runtime joint limits, or park-on-close in this spec. + +## Architecture decisions + +### Defer arm-as-core refactor + +The current shared-controller singleton has real boundary problems (gripper reaches across files into the controller's unexported state; calibration sensor reads `cs.controller.bus` directly), but the most painful manifestations — `runtime.Caller` PC tracking, `s.initCtx` reuse, the gripper-closes-first race — are independently fixable in place. After PR1-PR4 we will have a clean baseline against which the arm-as-core question is a more honest one. The arm-as-core option is not better than the singleton in the abstract; it is only better than the *broken* singleton. + +### Same-pointer registry semantics + +`getExistingController` currently returns a fresh `*SafeSoArmController` struct copied from the registry's cached one. The fix is to cache one `*SafeSoArmController` per port and return that exact pointer to all callers. When the bus closes, an internal `closed` flag is set; subsequent method calls return `ErrClosed`. This eliminates the gripper-closes-first race without introducing a new handle type. + +### Per-move speed, init-time acceleration + +Speed flows through `ServoGroup.SetPositionsWithSpeed` on every move, with the configured `speed_degs_per_sec` as the default and `MoveOptions.MaxVelocityRPS` as a per-call override. Acceleration is written to the `RegAcceleration` register at init/reconfigure (and re-written if reconfigure changes it). This matches Viam's expectation that `MoveOptions` overrides per call, and avoids the surprise of acceleration silently changing mid-move. + +### Free conversion helpers, not methods + +A new `conversion.go` provides `radiansFor(cal *MotorCalibration, raw int) (float64, error)` and `rawFor(cal *MotorCalibration, rad float64) (int, error)`. The gripper's "percent encoded as `[-π, π]` so the Viam gripper API can pretend it's a joint" convention lives in these helpers, not on `MotorCalibration` itself — that keeps `MotorCalibration` as a pure servo-math type. The DriveMode flip stays inside `Normalize`/`Denormalize` (single source of truth), eliminating the asymmetry bug where the gripper applied DriveMode at the radians layer while `MotorCalibration` applied it at the normalized layer. + +### Strict opMgr semantics on the arm + +`opMgr.New(ctx)` at the top of `MoveToJointPositions` cancels any in-flight move; `opMgr.CancelRunning(ctx)` from `Stop` actually stops; `IsMoving` becomes `opMgr.OpRunning()`. `moveLock` and `isMoving` are deleted. This is a user-observable behavior change: a second move now cancels the first instead of queueing behind a mutex. This is intentional — it is the correct arm semantics, matches RDK convention, and unblocks `Stop`. + +## PR sequence + +### PR1 — Lifecycle correctness + +Refactor only; no user-visible behavior change. + +- Delete `runtime.Caller`/`callerPorts` machinery in `registry.go:332-360`. Replace with explicit port tracking on each component struct (`s.controllerPort = config.Port`); `Close` calls `globalRegistry.ReleaseController(s.controllerPort)` directly. +- Stop reusing `s.initCtx` (`arm.go:111`). Pass per-call `ctx` through to `doServoInitialization`, `diagnoseConnection`, `verifyServoConfig`. Remove the `initCtx` field. +- Fix gripper-closes-first race: in `getExistingController`, cache one `*SafeSoArmController` per port in `ControllerEntry` and return that exact pointer (`registry.go:98-104`). Add `closed atomic.Bool` to `SafeSoArmController`; method calls return a sentinel `ErrControllerClosed` when set. `ReleaseController` at refcount zero sets the flag and closes the bus. +- Tests: introduce mock-bus harness based on `feetech/_examples/mock_transport`. Cover refcount-0 close, double-acquire returns same pointer, post-close method call returns sentinel error. + +### PR2 — Move-path correctness + +User-visible behavior change. Fixes the speed/accel lie, `Stop` actually stopping, and the silent calibration corruption bug. + +- Wire `speed_degs_per_sec` through `ServoGroup.SetPositionsWithSpeed`. Default from config; `MoveOptions.MaxVelocityRPS` per-call override. +- Write `RegAcceleration` per-servo at init and on reconfigure (when value changes). +- Replace `time.Sleep(moveTimeSeconds)` block (`arm.go:380-389`) with `group.WaitForStop(ctx, timeoutMs)`. `timeoutMs` derived from the current speed estimate but used only as upper-bound safety net. Honors `ctx.Done()`. +- Drop the unused `speed, acc int` parameters from `MoveServosToPositions` and `MoveToJointPositions` controller methods (the latter is dead anyway, deleted in PR3). +- Fix `writeHomingOffset` (`calibration.go:872`): correct register name (`position_offset`), use the actual `homingOffset` argument, encode as 2-byte sign-magnitude using `feetech.RegPositionOffset.SignBit`. +- Tests: mock-bus test that `MoveToJointPositions` with non-default speed produces the expected sync-write packet; mock-bus test that `Stop` cancels an in-flight move (context observed); regression test for `writeHomingOffset` correct register + payload. + +### PR3 — Conversion math consolidation + dead code removal + +Refactor only; no user-visible behavior change. Protected by tests added in this PR. + +- New `conversion.go` with `radiansFor(cal, raw)` and `rawFor(cal, rad)` covering both arm (degrees) and gripper (percent → `[-π, π]`) conventions. +- Replace inline math in: `MoveServosToPositions` (`manager.go:73-88`), `GetJointPositionsForServos` (`manager.go:150-153`), `gripper.percentToRadians` and `radiansToPercent` (`gripper.go:370-424`), `arm.calculateJointLimits` (`arm.go:130-167`), and the broken estimator at `calibration.go:838`. +- Delete dead methods: `SafeSoArmController.MoveToJointPositions` (`manager.go:31`), `GetJointPositions` (`manager.go:94`), `getCalibrationForServo` (`manager.go:253`), `CalibratedServo.SetPositionWithSpeed` (`calibrated_servo.go:213`), `GetCurrentCalibration` (`manager.go:356`). +- Delete dead config fields: `SoArm101Config.SpeedDegsPerSec`, `AccelerationDegsPerSec` (`config.go:23-24`). +- Consolidate the three calibration-update loops (`registry.go:78-90, 185-198`, `manager.go:228-244`) into one method on `SafeSoArmController`. +- Delete commented-out radians-conversion blocks at `calibration.go:427-457` and `:573-603`. +- Tests: round-trip property tests for `Normalize`/`Denormalize` over all four `NormMode` × DriveMode toggle × range edges. Round-trip for new `radiansFor`/`rawFor`. Unit tests for `calculateJointLimits` over hand-built calibrations. + +### PR4 — Surface fixes + concurrency cleanup + +Mostly behavior-correcting; the `opMgr` change is user-observable. + +- `IsHoldingSomething` (`gripper.go:358`): extract the position-vs-threshold inference from `Grab` into a private helper, reuse from `IsHoldingSomething`. Cache last `Grab` outcome optionally. +- `MoveThroughJointPositions` (`arm.go:394`): honor `MoveOptions` per step — `MaxVelocityRPS` overrides default speed for that step's `SetPositionsWithSpeed` call. +- `Get3DModels` (`arm.go:444`): return `nil, errors.ErrUnsupported` instead of a hard error. +- `arm.DoCommand` `default:` branch (`arm.go:550-600`): replace with explicit `case "set_speed"`, `case "set_acceleration"`, `case "get_motion_params"`. The default case becomes a clean unknown-command error that lists valid commands. +- `opMgr` integration: `s.opMgr.New(ctx)` at top of `MoveToJointPositions`, `s.opMgr.CancelRunning(ctx)` from `Stop`, `IsMoving` returns `s.opMgr.OpRunning()`. Delete `moveLock` and `isMoving`. +- Cache `calculateJointLimits` result on `so101` struct; invalidate on `SetCalibration` and on successful `reload_calibration` DoCommand. +- Replace `CalibratedServo.mu` with `atomic.Pointer[MotorCalibration]`. The bus already serializes wire access; the per-servo lock was redundant. +- Tests: `IsHoldingSomething` returns true after a `Grab` that succeeded and false after one that didn't (using mock bus to set position); `MoveOptions.MaxVelocityRPS` override propagates to the sync-write packet; `Stop` cancels in-flight move via opMgr (verified via context cancellation in the mock). + +### PR5 — Calibration sensor bug-fixes + +Sensor stays one resource (web app constraint). No structural extraction. + +- State-machine race teardown: consistent `if cs.recordingCancel != nil` checks in `Close`, `abortCalibration`, `resetCalibration`, `stopRangeRecording`. Audit for double-call. +- Replace `positionHistory []map[int]int` (`calibration.go:540-630`) with `var sampleCount atomic.Int64`. The history was only ever read for `len()`. Saves ~100KB churn during recording. +- Delete commented-out radians-conversion blocks (`:427-457`, `:573-603` — already covered in PR3 if we get there first; otherwise here). +- Replace emoji status strings (`calibration.go:1077-1080`) with plain text. +- Tests: state-machine progression test (idle → started → homing_position → range_recording → completed), backed by mock bus. Asserts on register writes and state transitions. Catches the writeHomingOffset class of regressions. + +### PR6 — Discovery improvements + canonical joint mapping + +- Worker-pool parallelize port scanning (`discovery.go:82-93`). Cap concurrency at 6 to avoid pathological hub behavior. Use `errgroup.WithContext` for cancellation propagation. +- Multi-baudrate discovery: either delegate to `feetech.Bus.Discover` (which sweeps internally) or explicit sweep `[1000000, 500000, 115200, 57600]`. Decision driven by what `Bus.Discover` actually does — verify in implementation. +- Extract canonical joint mapping into a shared file (likely `config.go` or new `joints.go`): + ```go + var SO101Joints = []struct{ ID int; Name string }{ + {1, "shoulder_pan"}, {2, "shoulder_lift"}, {3, "elbow_flex"}, + {4, "wrist_flex"}, {5, "wrist_roll"}, {6, "gripper"}, + } + ``` +- Replace `expectedMotors` literals at `calibration.go:1026-1033, :1105` and similar in discovery/arm with references to the canonical mapping. + +### PR7 — Documentation + +- Add `speed_degs_per_sec`, `acceleration_degs_per_sec_per_sec`, `motion` to the arm attributes table in README. +- Document gripper `calibrate_positions`, `set_motion_params`, `get_motion_params`. +- Document arm `set_speed`, `set_acceleration`, `get_motion_params`. +- Document the missing ~half of calibration sensor DoCommands. +- Fix the dead `MOTOR_SETUP.md` reference (line 432). +- Lift the duplicated "Communication" boilerplate (~25 lines × 3 sites) into one section; cross-link from each model. +- Add a table of contents. +- Add package-level godoc in a new `doc.go`. Backfill exported-symbol comments to ~80% coverage (today: ~30%). + +### PR8 — CI + build hygiene + +- New `.github/workflows/ci.yml` running on PRs: `go vet ./...`, `go test -race ./...`, `make`. Would have caught both the `writeHomingOffset` SA4006-class issue (unused parameter) and the `cmd/cli/` build break. +- Add `staticcheck` to `make lint`. +- Hardware build tag: `//go:build hardware` on tests currently using `t.Skip("hardware-dependent")`. Add `make test-hardware` target. +- Fix `cmd/cli/`: split the kept tools (`debug_cli`, `position_reader`, `read_servo`, `torque_disable`) into per-directory `cmd//main.go`. Delete throwaways (`simple_test_try`, `sync_test_again`, `gentle_move`, `raw_servo`). Add a `make tools` target that builds the kept ones. + +## Test strategy + +Test infrastructure is built incrementally inside PRs that need it: + +- **PR1** introduces the mock-bus harness based on `feetech/_examples/mock_transport`. From this PR onward, lifecycle/move/calibration tests can use a real `feetech.Bus` over an in-process transport without hardware. +- **PR2** adds the first move-path tests using the harness. +- **PR3** adds pure-logic tests for conversion math (no harness needed). +- **PR4** adds surface tests using the harness (gripper inference, opMgr cancellation). +- **PR5** adds the first state-machine tests using the harness. + +By PR8, the existing `t.Skip("hardware-dependent")` tests in `registry_test.go` should be re-evaluated. Many can be re-enabled with the harness; the rest are tagged `//go:build hardware` and excluded from the default run. + +## Migration & compatibility + +- **PR2** is the first user-observable behavior change: speed/accel actually take effect. Users who configured these expecting them to do nothing (unlikely but possible) will see slower motion. Document in release notes. +- **PR4** is the second: a second `MoveToJointPositions` request cancels the first instead of queueing. Document in release notes. Likely to surface in the motion service's corrective-move path; this is the *correct* behavior per RDK convention. +- All other PRs are refactors or surface fixes with no user-observable contract changes. + +## Open questions + +- `cmd/cli/` triage: the 4 "kept" tools listed in PR8 are a guess — Nick may want a different cut. Settled at implementation time by skimming each tool's actual utility. +- Whether to extract `internal/` packages comes after PR8 in the architecture revisit, not in this spec. +- `Reconfigure` support (today everything is `AlwaysRebuild`) is not in scope; revisit when the architecture decision is made. + +## Architecture revisit (post-PR8) + +After PR8, evaluate: + +1. Are the singleton's remaining boundary issues (cross-file access from calibration sensor into `cs.controller.bus`, the controller still being a separate type from the arm) painful enough to justify the arm-as-core flip? +2. Would the gripper-as-DoCommand-client model actually solve those issues, or just relocate them? +3. What's the cost of the migration vs. the cost of living with the cleaned-up singleton? + +Decision and ADR live in a separate spec written at that time. From ce9d699f8fdcb43a76a4da7f938ad97a313a6adc Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 14:46:48 -0400 Subject: [PATCH 02/15] docs: refine SO-101 remediation spec per Nick's review - Move commented radians-block deletion fully into PR3 (no overlap with PR5) - Commit PR6 to delegating multi-baudrate sweep to feetech.Bus.Discover - Narrow PR8 cmd/cli triage to keeping only debug_cli; delete the other 8 Co-Authored-By: Claude Opus 4.7 (1M context) --- .../specs/2026-04-27-so101-remediation-design.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/superpowers/specs/2026-04-27-so101-remediation-design.md b/docs/superpowers/specs/2026-04-27-so101-remediation-design.md index ba71f7b..b8ac4ab 100644 --- a/docs/superpowers/specs/2026-04-27-so101-remediation-design.md +++ b/docs/superpowers/specs/2026-04-27-so101-remediation-design.md @@ -100,14 +100,14 @@ Sensor stays one resource (web app constraint). No structural extraction. - State-machine race teardown: consistent `if cs.recordingCancel != nil` checks in `Close`, `abortCalibration`, `resetCalibration`, `stopRangeRecording`. Audit for double-call. - Replace `positionHistory []map[int]int` (`calibration.go:540-630`) with `var sampleCount atomic.Int64`. The history was only ever read for `len()`. Saves ~100KB churn during recording. -- Delete commented-out radians-conversion blocks (`:427-457`, `:573-603` — already covered in PR3 if we get there first; otherwise here). - Replace emoji status strings (`calibration.go:1077-1080`) with plain text. +- (Commented-out radians-conversion blocks at `:427-457` and `:573-603` are deleted in PR3.) - Tests: state-machine progression test (idle → started → homing_position → range_recording → completed), backed by mock bus. Asserts on register writes and state transitions. Catches the writeHomingOffset class of regressions. ### PR6 — Discovery improvements + canonical joint mapping - Worker-pool parallelize port scanning (`discovery.go:82-93`). Cap concurrency at 6 to avoid pathological hub behavior. Use `errgroup.WithContext` for cancellation propagation. -- Multi-baudrate discovery: either delegate to `feetech.Bus.Discover` (which sweeps internally) or explicit sweep `[1000000, 500000, 115200, 57600]`. Decision driven by what `Bus.Discover` actually does — verify in implementation. +- Multi-baudrate discovery: delegate to `feetech.Bus.Discover`, which sweeps the package's `DefaultBaudRates` internally. Discovery loop opens a bus per candidate port at a placeholder baudrate, then calls `Discover(ctx)` and lets feetech do the sweep. If `Discover` returns no servos, the port is dropped from the result set. - Extract canonical joint mapping into a shared file (likely `config.go` or new `joints.go`): ```go var SO101Joints = []struct{ ID int; Name string }{ @@ -133,7 +133,7 @@ Sensor stays one resource (web app constraint). No structural extraction. - New `.github/workflows/ci.yml` running on PRs: `go vet ./...`, `go test -race ./...`, `make`. Would have caught both the `writeHomingOffset` SA4006-class issue (unused parameter) and the `cmd/cli/` build break. - Add `staticcheck` to `make lint`. - Hardware build tag: `//go:build hardware` on tests currently using `t.Skip("hardware-dependent")`. Add `make test-hardware` target. -- Fix `cmd/cli/`: split the kept tools (`debug_cli`, `position_reader`, `read_servo`, `torque_disable`) into per-directory `cmd//main.go`. Delete throwaways (`simple_test_try`, `sync_test_again`, `gentle_move`, `raw_servo`). Add a `make tools` target that builds the kept ones. +- Fix `cmd/cli/`: keep only `debug_cli` — move it to `cmd/debug/main.go`. Delete the other 8 files (`main.go`, `gentle_move.go`, `position_reader.go`, `raw_servo.go`, `read_servo.go`, `simple_test_try.go`, `sync_test_again.go`, `torque_disable.go`). Add a `make tools` target that builds `cmd/debug/`. ## Test strategy @@ -155,7 +155,6 @@ By PR8, the existing `t.Skip("hardware-dependent")` tests in `registry_test.go` ## Open questions -- `cmd/cli/` triage: the 4 "kept" tools listed in PR8 are a guess — Nick may want a different cut. Settled at implementation time by skimming each tool's actual utility. - Whether to extract `internal/` packages comes after PR8 in the architecture revisit, not in this spec. - `Reconfigure` support (today everything is `AlwaysRebuild`) is not in scope; revisit when the architecture decision is made. From 9c2d4daf0f0fe614f69aa234823490b26ee66a1a Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 15:02:39 -0400 Subject: [PATCH 03/15] docs: add PR1 lifecycle-correctness implementation plan Includes mock-bus harness scaffolding for reuse across PR2-PR5. Reviewer feedback applied: drop unused atomic import in test, keep strings import in registry.go, scope verification to package path until cmd/cli is fixed in PR8, comment harness design choice. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-04-27-pr1-lifecycle-correctness.md | 1045 +++++++++++++++++ 1 file changed, 1045 insertions(+) create mode 100644 docs/superpowers/plans/2026-04-27-pr1-lifecycle-correctness.md diff --git a/docs/superpowers/plans/2026-04-27-pr1-lifecycle-correctness.md b/docs/superpowers/plans/2026-04-27-pr1-lifecycle-correctness.md new file mode 100644 index 0000000..9d44f85 --- /dev/null +++ b/docs/superpowers/plans/2026-04-27-pr1-lifecycle-correctness.md @@ -0,0 +1,1045 @@ +# PR1: Lifecycle Correctness Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Eliminate three lifecycle bugs in the SO-101 module — `runtime.Caller`-based reference tracking, `s.initCtx` reuse after construction, and the gripper-closes-first race when arm + gripper share a controller — without changing user-observable behavior. + +**Architecture:** Replace caller-PC tracking with explicit port handles stored on each component. Stop reusing the constructor's context after `NewSO101` returns; thread per-call ctx through all controller-touching helpers. Make the registry return the same `*SafeSoArmController` pointer to all callers per port, with an internal `closed` flag that all controller methods check. + +**Tech Stack:** Go 1.25, Viam RDK (`go.viam.com/rdk`), `github.com/hipsterbrown/feetech-servo` v0.4.0, `feetech.MockTransport` for tests. + +**Spec reference:** `docs/superpowers/specs/2026-04-27-so101-remediation-design.md` § PR1. + +--- + +## File Structure + +**Created:** +- `mock_bus_test.go` — test-only helper that builds a `feetech.Bus` with a `MockTransport` and wires it into a `SafeSoArmController` via the registry. Reused by PR2-PR5. +- `lifecycle_test.go` — test file for this PR's correctness behaviors (refcount, same-pointer, post-close error, port-explicit release). + +**Modified:** +- `manager.go` — add `closed atomic.Bool` field + `ErrControllerClosed` sentinel; check at the top of every method; remove `releaseFromCaller` plumbing. +- `registry.go` — delete `callerPorts`, `callerMu`, `trackCaller`, `releaseFromCaller`; cache the controller pointer in `ControllerEntry` and return that exact pointer from `getExistingController` / `createNewController`; `ReleaseController` sets `closed` before closing the bus. +- `arm.go` — store `controllerPort string` on the `so101` struct; `Close` calls `globalRegistry.ReleaseController(s.controllerPort)`; remove `initCtx` field; `doServoInitialization`, `diagnoseConnection`, `verifyServoConfig` take `ctx context.Context` parameters. +- `gripper.go` — store `controllerPort string` on `so101Gripper`; `Close` calls `globalRegistry.ReleaseController(g.controllerPort)`. +- `calibration.go` — store `controllerPort string` on `so101CalibrationSensor`; `Close` calls `globalRegistry.ReleaseController(cs.controllerPort)`. + +**Deleted (within `manager.go`):** `ReleaseSharedController()` (the caller-PC variant) is removed; callers move to the explicit port form. + +**Pre-existing tests touched:** +- `registry_test.go` — drop now-meaningless tests of `callerPorts`/`releaseFromCaller`; the new tests in `lifecycle_test.go` cover the replacement contract. + +--- + +## Task 1: Mock-bus test harness + +This harness is the foundation for every test in this PR (and PR2-PR5). It must come first. + +**Files:** +- Create: `mock_bus_test.go` + +- [ ] **Step 1: Confirm `MockTransport` is exported and usable** + +Run: `grep -n "type MockTransport" $HOME/go/pkg/mod/github.com/hipsterbrown/feetech-servo@v0.4.0/feetech/transport.go` +Expected: line showing `type MockTransport struct {` (already verified during planning, but re-confirm). + +- [ ] **Step 2: Write the harness file** + +```go +package so_arm + +import ( + "sync" + "testing" + "time" + + "github.com/hipsterbrown/feetech-servo/feetech" + "go.viam.com/rdk/logging" +) + +// scriptedMockTransport is a custom Transport (not feetech.MockTransport) whose +// Read responses are queued per-request. We use a custom type rather than the +// upstream MockTransport because tests in PR2-PR5 need to script multiple +// round-trips with different responses, and MockTransport's single ReadData +// buffer doesn't support that pattern cleanly. +type scriptedMockTransport struct { + mu sync.Mutex + written []byte + responses [][]byte + closed bool +} + +func (m *scriptedMockTransport) Read(p []byte) (int, error) { + m.mu.Lock() + defer m.mu.Unlock() + if len(m.responses) == 0 { + return 0, nil + } + resp := m.responses[0] + n := copy(p, resp) + if n >= len(resp) { + m.responses = m.responses[1:] + } else { + m.responses[0] = resp[n:] + } + return n, nil +} + +func (m *scriptedMockTransport) Write(p []byte) (int, error) { + m.mu.Lock() + defer m.mu.Unlock() + m.written = append(m.written, p...) + return len(p), nil +} + +func (m *scriptedMockTransport) Close() error { + m.mu.Lock() + defer m.mu.Unlock() + m.closed = true + return nil +} + +func (m *scriptedMockTransport) SetReadTimeout(timeout time.Duration) error { + return nil +} + +func (m *scriptedMockTransport) Flush() error { + return nil +} + +// queueResponse appends a raw frame to the mock's response queue. +func (m *scriptedMockTransport) queueResponse(b []byte) { + m.mu.Lock() + defer m.mu.Unlock() + m.responses = append(m.responses, b) +} + +// reset clears the written-data buffer (responses queue is preserved). +func (m *scriptedMockTransport) resetWritten() { + m.mu.Lock() + defer m.mu.Unlock() + m.written = nil +} + +// snapshotWritten returns a copy of all bytes written since the last reset. +func (m *scriptedMockTransport) snapshotWritten() []byte { + m.mu.Lock() + defer m.mu.Unlock() + out := make([]byte, len(m.written)) + copy(out, m.written) + return out +} + +// pingResponse builds the ping ACK frame for a given servo ID using the STS protocol. +// Frame: 0xFF 0xFF 0x02 0x00 +func pingResponse(servoID byte) []byte { + checksum := ^(servoID + 0x02 + 0x00) + return []byte{0xFF, 0xFF, servoID, 0x02, 0x00, checksum} +} + +// newMockBus returns a *feetech.Bus backed by a scriptedMockTransport. +// Caller is responsible for queuing responses before issuing bus operations. +func newMockBus(t *testing.T) (*feetech.Bus, *scriptedMockTransport) { + t.Helper() + mock := &scriptedMockTransport{} + bus, err := feetech.NewBus(feetech.BusConfig{ + Transport: mock, + Protocol: feetech.ProtocolSTS, + Timeout: 100 * time.Millisecond, + }) + if err != nil { + t.Fatalf("newMockBus: NewBus failed: %v", err) + } + t.Cleanup(func() { _ = bus.Close() }) + return bus, mock +} + +// newTestLogger returns a logger that discards output unless the test fails. +func newTestLogger(t *testing.T) logging.Logger { + t.Helper() + return logging.NewTestLogger(t) +} +``` + +- [ ] **Step 3: Build to confirm it compiles** + +Run: `go build ./...` +Expected: no output (success). The harness has no test functions yet; it's pure infrastructure. + +- [ ] **Step 4: Sanity test the harness — write a tiny test that pings via mock** + +Append to `mock_bus_test.go`: + +```go +func TestMockBus_PingRoundTrip(t *testing.T) { + bus, mock := newMockBus(t) + mock.queueResponse(pingResponse(1)) + + if _, err := bus.Ping(t.Context(), 1); err != nil { + t.Fatalf("ping failed: %v", err) + } + + written := mock.snapshotWritten() + if len(written) < 6 { + t.Fatalf("expected ping packet to be written, got %d bytes: %X", len(written), written) + } + // Ping packet: 0xFF 0xFF + if written[0] != 0xFF || written[1] != 0xFF || written[2] != 0x01 { + t.Errorf("malformed ping packet: %X", written[:6]) + } +} +``` + +- [ ] **Step 5: Run the harness sanity test** + +Run: `go test -run TestMockBus_PingRoundTrip -v` +Expected: PASS. + +If `t.Context()` is not available (Go < 1.24), substitute `context.Background()` and add the `context` import. + +- [ ] **Step 6: Commit** + +```bash +git add mock_bus_test.go +git commit -m "test: add scripted mock-bus harness for so-101 tests" +``` + +--- + +## Task 2: Add `ErrControllerClosed` sentinel and `closed` flag + +User contract: any controller method called after `Close` returns `ErrControllerClosed`. Internal methods check via a single helper. + +**Files:** +- Modify: `manager.go` (add field, sentinel, helper; gate every public method) +- Test: `lifecycle_test.go` (new file) + +- [ ] **Step 1: Write failing test for the closed-flag contract** + +Create `lifecycle_test.go`: + +```go +package so_arm + +import ( + "context" + "errors" + "testing" +) + +// TestController_PostCloseReturnsSentinel verifies that calling any controller +// method after the bus has been closed returns ErrControllerClosed rather than +// a panic or a serial-port error. +func TestController_PostCloseReturnsSentinel(t *testing.T) { + bus, _ := newMockBus(t) + + ctrl := &SafeSoArmController{ + bus: bus, + logger: newTestLogger(t), + } + ctrl.closed.Store(true) + + if err := ctrl.Ping(context.Background()); !errors.Is(err, ErrControllerClosed) { + t.Errorf("Ping after close: expected ErrControllerClosed, got %v", err) + } + if err := ctrl.SetTorqueEnable(context.Background(), true); !errors.Is(err, ErrControllerClosed) { + t.Errorf("SetTorqueEnable after close: expected ErrControllerClosed, got %v", err) + } + if _, err := ctrl.GetJointPositionsForServos(context.Background(), []int{1}); !errors.Is(err, ErrControllerClosed) { + t.Errorf("GetJointPositionsForServos after close: expected ErrControllerClosed, got %v", err) + } +} +``` + +- [ ] **Step 2: Run the test to see it fail** + +Run: `go test -run TestController_PostCloseReturnsSentinel -v` +Expected: BUILD FAIL — `ErrControllerClosed` undefined, `closed` field missing on `SafeSoArmController`. + +- [ ] **Step 3: Add the sentinel and field to `manager.go`** + +In `manager.go`, in the imports block ensure `errors` and `sync/atomic` are imported (they already are; `sync/atomic` is present). Add at the top of the file, after the imports: + +```go +// ErrControllerClosed is returned by SafeSoArmController methods after the +// underlying bus has been closed via the registry. Callers holding a stale +// reference should treat this as a permanent failure for that controller. +var ErrControllerClosed = errors.New("so101: controller is closed") +``` + +Add `closed atomic.Bool` to the `SafeSoArmController` struct (currently lines 22-29): + +```go +type SafeSoArmController struct { + bus *feetech.Bus + group *feetech.ServoGroup + calibratedServos map[int]*CalibratedServo + logger logging.Logger + calibration SO101FullCalibration + mu sync.RWMutex + closed atomic.Bool +} +``` + +Add a helper just below the struct definition: + +```go +// checkClosed returns ErrControllerClosed if the controller has been released. +func (s *SafeSoArmController) checkClosed() error { + if s.closed.Load() { + return ErrControllerClosed + } + return nil +} +``` + +- [ ] **Step 4: Gate every public method on `SafeSoArmController`** + +For each of these methods in `manager.go`, add `if err := s.checkClosed(); err != nil { return err }` (or `return nil, err` for two-return methods, or `return SO101FullCalibration{}, err` for `GetCalibration`) as the *first* statement of the function body, **before** any lock acquisition: + +- `MoveToJointPositions` (line 31) +- `MoveServosToPositions` (line 60) +- `GetJointPositions` (line 94) +- `GetJointPositionsForServos` (line 132) +- `SetTorqueEnable` (line 161) +- `Stop` (line 177) +- `Ping` (line 199) +- `WriteServoRegister` (line 212) +- `SetCalibration` (line 224) + +Do NOT gate `GetCalibration` (line 246) or `getCalibrationForServo` (line 253) — these are read-only, lock-protected, and a stale read is acceptable. Do NOT gate `Close` (line 189); it is the legitimate path to set the flag (see next task). + +Example for `Ping`: + +```go +func (s *SafeSoArmController) Ping(ctx context.Context) error { + if err := s.checkClosed(); err != nil { + return err + } + s.mu.RLock() + defer s.mu.RUnlock() + // ... existing body +} +``` + +- [ ] **Step 5: Run the failing test to verify it now passes** + +Run: `go test -run TestController_PostCloseReturnsSentinel -v` +Expected: PASS. + +- [ ] **Step 6: Run the full test suite to confirm nothing regressed** + +Run: `go test ./...` +Expected: PASS (existing tests should be unaffected; if any registry tests fail, they're test-fixture issues addressed in Task 5). + +- [ ] **Step 7: Commit** + +```bash +git add manager.go lifecycle_test.go +git commit -m "feat(controller): add ErrControllerClosed sentinel and gate methods" +``` + +--- + +## Task 3: Same-pointer registry semantics + +Make `getExistingController` return the cached pointer instead of building a fresh struct copy. `ReleaseController` sets `closed` on the cached controller before closing the bus. + +**Files:** +- Modify: `registry.go` (cache pointer in entry, return it from getters, set closed on release) +- Modify: `lifecycle_test.go` (add same-pointer test) + +- [ ] **Step 1: Write failing test for same-pointer contract** + +Append to `lifecycle_test.go`: + +```go +// TestRegistry_SamePointerForSamePort verifies that two callers acquiring a +// controller for the same port receive the *same* *SafeSoArmController, so +// that close-state propagates correctly across all consumers. +func TestRegistry_SamePointerForSamePort(t *testing.T) { + registry := NewControllerRegistry() + port := "/dev/test-port" + cfg := testConfig(port) + + // Inject a pre-built entry so we don't need a real bus. + bus, _ := newMockBus(t) + ctrl := &SafeSoArmController{ + bus: bus, + logger: cfg.Logger, + } + registry.entries[port] = &ControllerEntry{ + controller: ctrl, + config: cfg, + calibration: DefaultSO101FullCalibration, + refCount: 0, + } + + first, err := registry.GetController(port, cfg, DefaultSO101FullCalibration, false) + if err != nil { + t.Fatalf("first GetController: %v", err) + } + second, err := registry.GetController(port, cfg, DefaultSO101FullCalibration, false) + if err != nil { + t.Fatalf("second GetController: %v", err) + } + + if first != second { + t.Errorf("expected same pointer for same port; got %p and %p", first, second) + } + if first != ctrl { + t.Errorf("expected cached controller pointer to be returned") + } +} +``` + +- [ ] **Step 2: Run the test to see it fail** + +Run: `go test -run TestRegistry_SamePointerForSamePort -v` +Expected: FAIL — current `getExistingController` builds a new `&SafeSoArmController{...}` (registry.go:98-104), so the pointers differ. + +- [ ] **Step 3: Modify `getExistingController` to return the cached pointer** + +In `registry.go`, replace the return statement at lines 95-104: + +```go + atomic.AddInt64(&entry.refCount, 1) + r.trackCaller(entry.config.Port) + + return &SafeSoArmController{ + bus: entry.controller.bus, + group: entry.controller.group, + calibratedServos: entry.controller.calibratedServos, + logger: config.Logger, + calibration: entry.calibration, + }, nil +} +``` + +with: + +```go + atomic.AddInt64(&entry.refCount, 1) + + // Return the cached pointer so that all consumers observe close-state + // (and any future calibration updates) atomically. + return entry.controller, nil +} +``` + +The `r.trackCaller` line goes too — it's deleted in Task 4. + +- [ ] **Step 4: Modify `createNewController` to also return the cached pointer** + +In `registry.go`, the bottom of `createNewController` (lines 220-226) currently returns a new struct. Change it to return the same pointer it just stored on `entry.controller`: + +```go + entry.controller = &SafeSoArmController{ + bus: bus, + group: group, + calibratedServos: calibratedServos, + logger: config.Logger, + calibration: finalCalibration, + } + entry.calibration = finalCalibration + entry.lastError = nil + atomic.StoreInt64(&entry.refCount, 1) + + r.entries[portPath] = entry + + if config.Logger != nil { + config.Logger.Debugf("Created new feetech servo bus with %d servos for port %s", len(calibratedServos), portPath) + } + + return entry.controller, nil +} +``` + +(The `r.trackCaller(portPath)` call goes too — deleted in Task 4.) + +- [ ] **Step 5: Run the test to verify it passes** + +Run: `go test -run TestRegistry_SamePointerForSamePort -v` +Expected: PASS. + +- [ ] **Step 6: Add and run the close-propagation test** + +Append to `lifecycle_test.go`: + +```go +// TestRegistry_ReleaseClosesAllConsumers verifies that ReleaseController +// at refcount zero closes the bus and sets the closed flag on the shared +// controller, so other holders observe ErrControllerClosed on next call. +func TestRegistry_ReleaseClosesAllConsumers(t *testing.T) { + registry := NewControllerRegistry() + port := "/dev/test-port" + cfg := testConfig(port) + + bus, _ := newMockBus(t) + ctrl := &SafeSoArmController{ + bus: bus, + logger: cfg.Logger, + } + registry.entries[port] = &ControllerEntry{ + controller: ctrl, + config: cfg, + calibration: DefaultSO101FullCalibration, + refCount: 2, // simulate arm + gripper both holding + } + + // First release: refcount drops to 1, controller stays alive. + registry.ReleaseController(port) + if ctrl.closed.Load() { + t.Fatalf("controller closed prematurely at refcount > 0") + } + + // Second release: refcount drops to 0, controller closes. + registry.ReleaseController(port) + if !ctrl.closed.Load() { + t.Errorf("expected controller.closed=true after final release") + } + if err := ctrl.Ping(t.Context()); !errors.Is(err, ErrControllerClosed) { + t.Errorf("Ping after final release: expected ErrControllerClosed, got %v", err) + } +} +``` + +Run: `go test -run TestRegistry_ReleaseClosesAllConsumers -v` +Expected: FAIL — `ReleaseController` doesn't yet set `closed`. + +- [ ] **Step 7: Set `closed` in `ReleaseController` before closing the bus** + +In `registry.go`, in `ReleaseController` (lines 229-259), modify the refcount-zero branch: + +```go + currentRefCount := atomic.AddInt64(&entry.refCount, -1) + if currentRefCount <= 0 { + if entry.controller != nil { + entry.controller.closed.Store(true) + if entry.controller.bus != nil { + if err := entry.controller.bus.Close(); err != nil && entry.config != nil && entry.config.Logger != nil { + entry.config.Logger.Warnf("error closing shared controller for port %s: %v", portPath, err) + } + } + } + + r.mu.Lock() + delete(r.entries, portPath) + r.mu.Unlock() + + entry.controller = nil + entry.config = nil + entry.calibration = SO101FullCalibration{} + atomic.StoreInt64(&entry.refCount, 0) + entry.lastError = nil + } +``` + +Also update `ForceCloseController` (lines 261-287) to set `closed` similarly: + +```go + if entry.controller != nil { + entry.controller.closed.Store(true) + err = entry.controller.bus.Close() + entry.controller = nil + // ... rest unchanged + } +``` + +- [ ] **Step 8: Run the test to verify it passes** + +Run: `go test -run TestRegistry_ReleaseClosesAllConsumers -v` +Expected: PASS. + +- [ ] **Step 9: Run the full suite** + +Run: `go test ./...` +Expected: PASS (some pre-existing `registry_test.go` tests will skip; that's fine). + +- [ ] **Step 10: Commit** + +```bash +git add registry.go lifecycle_test.go +git commit -m "feat(registry): cache controller pointer; propagate close to all holders" +``` + +--- + +## Task 4: Delete `runtime.Caller` machinery and add explicit-port release + +Replace `releaseFromCaller` with explicit `ReleaseController(port)` calls. Components store `controllerPort` on their struct. + +**Files:** +- Modify: `registry.go` (delete `callerPorts`, `callerMu`, `trackCaller`, `releaseFromCaller`) +- Modify: `manager.go` (delete `ReleaseSharedController()`) +- Modify: `arm.go` (store `controllerPort`, call explicit `Release` in `Close`) +- Modify: `gripper.go` (same pattern) +- Modify: `calibration.go` (same pattern) +- Modify: `lifecycle_test.go` (add explicit-port release test) + +- [ ] **Step 1: Write failing test for the explicit-port contract** + +Append to `lifecycle_test.go`: + +```go +// TestRegistry_ExplicitPortReleaseDecrementsRefcount verifies that callers +// can release a controller by passing the port path directly, with no +// dependence on runtime.Caller PC tracking. +func TestRegistry_ExplicitPortReleaseDecrementsRefcount(t *testing.T) { + registry := NewControllerRegistry() + port := "/dev/test-port" + cfg := testConfig(port) + bus, _ := newMockBus(t) + registry.entries[port] = &ControllerEntry{ + controller: &SafeSoArmController{bus: bus, logger: cfg.Logger}, + config: cfg, + refCount: 3, + } + + registry.ReleaseController(port) + + got := atomic.LoadInt64(®istry.entries[port].refCount) + if got != 2 { + t.Errorf("expected refCount=2 after release, got %d", got) + } +} +``` + +Run: `go test -run TestRegistry_ExplicitPortReleaseDecrementsRefcount -v` +Expected: PASS (this already works — `ReleaseController(port)` exists; the test guards against future regression). + +- [ ] **Step 2: Delete `callerPorts`, `callerMu`, `trackCaller`, `releaseFromCaller` from `registry.go`** + +In `registry.go`: + +1. From the `ControllerRegistry` struct (lines 24-31), remove the `callerPorts` and `callerMu` fields: + +```go +type ControllerRegistry struct { + entries map[string]*ControllerEntry // port path -> entry + mu sync.RWMutex +} +``` + +2. From `NewControllerRegistry` (lines 33-38), remove the `callerPorts: ...` initializer: + +```go +func NewControllerRegistry() *ControllerRegistry { + return &ControllerRegistry{ + entries: make(map[string]*ControllerEntry), + } +} +``` + +3. Delete the `trackCaller` method (lines 332-341 — already-removed call sites in Task 3). + +4. Delete the `releaseFromCaller` method (lines 343-360). + +5. Remove the now-unused `runtime` import from the imports block. **Keep `strings`** — it's still used by `compareConfigs`. + +- [ ] **Step 3: Delete `ReleaseSharedController` from `manager.go`** + +In `manager.go`, delete the `ReleaseSharedController` function (lines 299-301): + +```go +func ReleaseSharedController() { + globalRegistry.releaseFromCaller() +} +``` + +Keep `GetSharedController`, `GetSharedControllerWithCalibration`, `ForceCloseSharedController`, `GetControllerStatus`, `GetCurrentCalibrationForPort`. Delete `GetCurrentCalibration` too — its godoc admits it's wrong, and PR3 was going to remove it anyway. Removing it now keeps the API surface honest. + +```go +// DELETE THIS FUNCTION: +// With multiple controllers, this returns the default calibration +// Use GetCurrentCalibrationForPort for port-specific calibration +func GetCurrentCalibration() SO101FullCalibration { + return DefaultSO101FullCalibration +} +``` + +- [ ] **Step 4: Add `controllerPort` field to `so101` and update `Close`** + +In `arm.go`: + +1. Add a field to the `so101` struct (lines 87-112), removing `initCtx`: + +```go +type so101 struct { + resource.AlwaysRebuild + + name resource.Name + logger logging.Logger + cfg *SO101ArmConfig + opMgr *operation.SingleOperationManager + controller *SafeSoArmController + controllerPort string // port path used to acquire the shared controller + + mu sync.RWMutex + moveLock sync.Mutex + isMoving atomic.Bool + model referenceframe.Model + + armServoIDs []int + + defaultSpeed float32 + defaultAcc float32 + + motion motion.Service + + cancelCtx context.Context + cancelFunc func() +} +``` + +2. In `NewSO101` (line 252-266), set `controllerPort` and remove `initCtx`: + +```go + arm := &so101{ + name: name, + cfg: conf, + opMgr: operation.NewSingleOperationManager(), + logger: logger, + controller: controller, + controllerPort: controllerConfig.Port, + model: model, + armServoIDs: conf.ServoIDs, + defaultSpeed: speedDegsPerSec, + defaultAcc: accelerationDegsPerSec, + motion: ms, + cancelCtx: cancelCtx, + cancelFunc: cancelFunc, + } +``` + +3. In the two `ReleaseSharedController()` cleanup-on-error sites (lines 230 and 274), replace with the explicit call: + +```go + model, err := makeSO101ModelFrame() + if err != nil { + globalRegistry.ReleaseController(controllerConfig.Port) + return nil, fmt.Errorf("failed to create kinematic model: %w", err) + } +``` + +and: + +```go + if err := arm.initializeServos(ctx); err != nil { + globalRegistry.ReleaseController(controllerConfig.Port) + return nil, fmt.Errorf("failed to initialize servos: %w", err) + } +``` + +(Note `initializeServos(ctx)` — fixed in Task 5.) + +4. In `Close` (lines 620-624): + +```go +func (s *so101) Close(context.Context) error { + s.cancelFunc() + globalRegistry.ReleaseController(s.controllerPort) + return nil +} +``` + +- [ ] **Step 5: Same pattern in `gripper.go`** + +In `gripper.go`: + +1. Add `controllerPort string` to `so101Gripper` struct (lines 58-76). +2. In `newSO101Gripper`, set `controllerPort: controllerConfig.Port` in the struct literal. +3. In `Close` (lines 341-344): + +```go +func (g *so101Gripper) Close(ctx context.Context) error { + globalRegistry.ReleaseController(g.controllerPort) + return nil +} +``` + +- [ ] **Step 6: Same pattern in `calibration.go`** + +In `calibration.go`: + +1. Add `controllerPort string` to `so101CalibrationSensor` struct (lines 108-135). +2. In `NewSO101CalibrationSensor`, set `controllerPort: controllerConfig.Port` in the struct literal. +3. In `Close` (lines 1221-1237), replace `ReleaseSharedController()` with `globalRegistry.ReleaseController(cs.controllerPort)`. + +- [ ] **Step 7: Build and confirm** + +Run: `go build ./...` +Expected: success. If there are remaining `ReleaseSharedController()` references (likely in `cmd/cli/`), leave them for PR8 — `cmd/cli/` is already broken on disk and not in the build path. + +If the build complains about `cmd/cli/` references during `./...`, scope the build to the module package: `go build .` + +- [ ] **Step 8: Run the full test suite** + +Run: `go test ./...` +Expected: PASS. Pre-existing `registry_test.go` tests that touched `callerPorts` need updating — see next task. + +- [ ] **Step 9: Commit** + +```bash +git add registry.go manager.go arm.go gripper.go calibration.go lifecycle_test.go +git commit -m "feat(registry): replace runtime.Caller tracking with explicit port handles" +``` + +--- + +## Task 5: Stop reusing `s.initCtx` + +Thread per-call `ctx` through the constructor's helpers. Remove the `initCtx` field. Update DoCommand handlers to pass their own ctx. + +**Files:** +- Modify: `arm.go` (remove `initCtx` field; helper signatures take ctx) +- Modify: `lifecycle_test.go` (add cancellation test if feasible without hardware) + +- [ ] **Step 1: Write a sentinel test that the helper accepts ctx** + +Append to `lifecycle_test.go`: + +```go +// TestArmHelpers_AcceptContext is a compile-time guard that the diagnose/ +// verify/initialize helpers take an explicit ctx parameter, not s.initCtx. +// This test documents the contract; the assertion is that the file compiles +// after the refactor. +func TestArmHelpers_AcceptContext(t *testing.T) { + // We can't construct a real *so101 without hardware, but we can assert + // the method set via type checks at the call site. If the methods change + // signature, this file will fail to compile. + var s *so101 + if s == nil { + t.Skip("documentation test; behavior covered indirectly via hardware tests") + } + _ = s.doServoInitialization + _ = s.diagnoseConnection + _ = s.verifyServoConfig +} +``` + +- [ ] **Step 2: Run the test to confirm it currently passes (the methods exist, just with wrong signatures)** + +Run: `go test -run TestArmHelpers_AcceptContext -v` +Expected: PASS (skipped at runtime). + +- [ ] **Step 3: Update `doServoInitialization` to take ctx** + +In `arm.go`: + +1. Change the signature of `doServoInitialization` (line 659): + +```go +func (s *so101) doServoInitialization(ctx context.Context) error { + s.logger.Debug("Pinging all servos...") + if err := s.controller.Ping(ctx); err != nil { + return fmt.Errorf("servo ping failed: %w", err) + } + // ... existing body, with `s.initCtx` replaced by `ctx` +} +``` + +Find every `s.initCtx` reference inside this function and replace with `ctx`. + +2. Change the signature of `initializeServosWithRetry` (line 632): + +```go +func (s *so101) initializeServosWithRetry(ctx context.Context, maxRetries int) error { + // ... existing body + for attempt := 1; attempt <= maxRetries; attempt++ { + // ... + if err := s.doServoInitialization(ctx); err != nil { + // ... + } + } +} +``` + +3. Change the signature of `initializeServos` (line 627): + +```go +func (s *so101) initializeServos(ctx context.Context) error { + return s.initializeServosWithRetry(ctx, 3) +} +``` + +4. Change `diagnoseConnection` (line 693) and `verifyServoConfig` (line 721) to take `ctx`: + +```go +func (s *so101) diagnoseConnection(ctx context.Context) error { + s.logger.Debug("Starting SO-101 arm connection diagnosis...") + // replace s.initCtx with ctx throughout +} + +func (s *so101) verifyServoConfig(ctx context.Context) error { + // replace s.initCtx with ctx throughout +} +``` + +- [ ] **Step 4: Update DoCommand handlers in `arm.go` to pass their own `ctx`** + +In `arm.go` `DoCommand` (line 448), the cases that call these helpers need to pass `ctx` (the function's own parameter): + +- `case "diagnose":` (line 472) — change `err := s.diagnoseConnection()` to `err := s.diagnoseConnection(ctx)`. +- `case "verify_config":` (line 479) — change `err := s.verifyServoConfig()` to `err := s.verifyServoConfig(ctx)`. +- `case "reinitialize":` (line 486) — change `err := s.initializeServosWithRetry(retries)` to `err := s.initializeServosWithRetry(ctx, retries)`. + +- [ ] **Step 5: Update the constructor call site** + +In `arm.go` `NewSO101` (line 273), change: + +```go + if err := arm.initializeServos(); err != nil { +``` + +to: + +```go + if err := arm.initializeServos(ctx); err != nil { +``` + +(The `ctx` here is `NewSO101`'s parameter, which is correct for construction-time init.) + +- [ ] **Step 6: Remove the `initCtx` field from the struct** + +The struct change was already made in Task 4 step 4 (when we listed the new struct without `initCtx`). Confirm that no compilation references to `s.initCtx` remain: + +Run: `grep -n "initCtx" arm.go` +Expected: no output. + +- [ ] **Step 7: Build and run tests** + +Run: `go build . && go test ./...` +Expected: PASS. + +- [ ] **Step 8: Commit** + +```bash +git add arm.go lifecycle_test.go +git commit -m "fix(arm): pass per-call ctx to init/diagnose/verify helpers" +``` + +--- + +## Task 6: Update pre-existing `registry_test.go` for new contract + +Some pre-existing tests reference `callerPorts` indirectly or assert state that no longer exists. Update them to test the new contract. + +**Files:** +- Modify: `registry_test.go` + +- [ ] **Step 1: Identify failing tests** + +Run: `go test -run TestRegistry -v` +Expected: most pass. Specifically inspect `TestRegistryCreation` (lines 30-48) — it asserts `registry.callerPorts != nil`, which no longer exists. + +- [ ] **Step 2: Update `TestRegistryCreation`** + +In `registry_test.go`, the test `TestRegistryCreation` currently asserts `callerPorts` is initialized. Remove that assertion: + +```go +func TestRegistryCreation(t *testing.T) { + registry := NewControllerRegistry() + + if registry == nil { + t.Fatal("NewControllerRegistry returned nil") + } + + if registry.entries == nil { + t.Fatal("Registry entries map not initialized") + } + + if len(registry.entries) != 0 { + t.Fatal("Registry should start empty") + } +} +``` + +- [ ] **Step 3: Run the full registry test file** + +Run: `go test -run TestRegistry -v` +Expected: PASS for non-skipped tests. + +- [ ] **Step 4: Commit** + +```bash +git add registry_test.go +git commit -m "test(registry): remove assertions on deleted callerPorts field" +``` + +--- + +## Task 7: Final verification + +Confirm the PR's intent end-to-end before declaring done. + +- [ ] **Step 1: Confirm no `runtime.Caller` references remain** + +Run: `grep -rn "runtime.Caller\|callerPorts\|releaseFromCaller\|trackCaller" *.go` +Expected: no output. + +- [ ] **Step 2: Confirm no `initCtx` references remain** + +Run: `grep -rn "initCtx" *.go` +Expected: no output. + +- [ ] **Step 3: Confirm the legacy `ReleaseSharedController` is gone** + +Run: `grep -n "func ReleaseSharedController" *.go` +Expected: no output. + +Run: `grep -n "ReleaseSharedController" *.go` +Expected: no output. (If output appears in `cmd/cli/`, leave it for PR8.) + +- [ ] **Step 4: Run the full test suite with race detector** + +Run: `go test -race .` +Expected: PASS, no race warnings. + +(Use the package-only path `.` rather than `./...` because `cmd/cli/` is currently broken on disk — multiple `package main` declarations in one directory. PR8 fixes it; until then, scope verification to the so_arm package.) + +- [ ] **Step 5: Build the module binary** + +Run: `make bin/arm` +Expected: success. + +- [ ] **Step 6: Verify the `make` target still works end-to-end** + +Run: `make` +Expected: success (builds module.tar.gz). + +- [ ] **Step 7: Open the PR** + +```bash +git push -u origin nhehr/speed_accel_setting +gh pr create --title "PR1: lifecycle correctness — explicit port handles, ctx, same-pointer registry" --body "$(cat <<'EOF' +## Summary +- Delete `runtime.Caller`-based reference tracking; replace with explicit `controllerPort` stored on each component +- Stop reusing `s.initCtx` after construction; thread per-call `ctx` through `doServoInitialization`, `initializeServosWithRetry`, `initializeServos`, `diagnoseConnection`, `verifyServoConfig`, and the corresponding DoCommand handlers +- Make `ControllerRegistry` return the cached `*SafeSoArmController` pointer for a port (instead of a fresh struct copy), so `Release` setting `closed=true` is observed by all holders — eliminates the gripper-closes-first race +- Add `ErrControllerClosed` sentinel; gate every public controller method +- Add scripted-mock-bus test harness (`mock_bus_test.go`); foundation for PR2-PR5 tests +- Delete `GetCurrentCalibration()` (its godoc admitted it returned the wrong value); callers should use `GetCurrentCalibrationForPort` + +Spec: docs/superpowers/specs/2026-04-27-so101-remediation-design.md § PR1. + +## Test plan +- [x] `go test -race ./...` passes +- [x] `make` succeeds +- [x] New tests cover: post-close sentinel returns, same-pointer-per-port, refcount-zero close propagates, explicit-port release decrements +- [ ] Manual smoke: rebuild module, run on hardware with arm + gripper sharing port, confirm both close cleanly without "controller leak" warnings +EOF +)" +``` + +(Skip the `gh pr create` if you'd rather review locally first.) + +--- + +## Done criteria + +- All tasks ticked. +- `go test -race ./...` passes. +- `make` produces a working module.tar.gz. +- Manual smoke on hardware confirms arm + gripper can both Close without leaving stale registry entries (verifiable via the `controller_status` DoCommand). From 5ee8cf27a210c49b302962f5e011b5a73f4598f6 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 15:53:12 -0400 Subject: [PATCH 04/15] test: add scripted mock-bus harness for so-101 tests Co-Authored-By: Claude Opus 4.7 (1M context) --- mock_bus_test.go | 135 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 mock_bus_test.go diff --git a/mock_bus_test.go b/mock_bus_test.go new file mode 100644 index 0000000..e47749a --- /dev/null +++ b/mock_bus_test.go @@ -0,0 +1,135 @@ +package so_arm + +import ( + "sync" + "testing" + "time" + + "github.com/hipsterbrown/feetech-servo/feetech" + "go.viam.com/rdk/logging" +) + +// scriptedMockTransport is a custom Transport (not feetech.MockTransport) whose +// Read responses are queued per-request. We use a custom type rather than the +// upstream MockTransport because tests in PR2-PR5 need to script multiple +// round-trips with different responses, and MockTransport's single ReadData +// buffer doesn't support that pattern cleanly. +type scriptedMockTransport struct { + mu sync.Mutex + written []byte + responses [][]byte + closed bool +} + +func (m *scriptedMockTransport) Read(p []byte) (int, error) { + m.mu.Lock() + defer m.mu.Unlock() + if len(m.responses) == 0 { + return 0, nil + } + resp := m.responses[0] + n := copy(p, resp) + if n >= len(resp) { + m.responses = m.responses[1:] + } else { + m.responses[0] = resp[n:] + } + return n, nil +} + +func (m *scriptedMockTransport) Write(p []byte) (int, error) { + m.mu.Lock() + defer m.mu.Unlock() + m.written = append(m.written, p...) + return len(p), nil +} + +func (m *scriptedMockTransport) Close() error { + m.mu.Lock() + defer m.mu.Unlock() + m.closed = true + return nil +} + +func (m *scriptedMockTransport) SetReadTimeout(timeout time.Duration) error { + return nil +} + +func (m *scriptedMockTransport) Flush() error { + return nil +} + +// queueResponse appends a raw frame to the mock's response queue. +func (m *scriptedMockTransport) queueResponse(b []byte) { + m.mu.Lock() + defer m.mu.Unlock() + m.responses = append(m.responses, b) +} + +// reset clears the written-data buffer (responses queue is preserved). +func (m *scriptedMockTransport) resetWritten() { + m.mu.Lock() + defer m.mu.Unlock() + m.written = nil +} + +// snapshotWritten returns a copy of all bytes written since the last reset. +func (m *scriptedMockTransport) snapshotWritten() []byte { + m.mu.Lock() + defer m.mu.Unlock() + out := make([]byte, len(m.written)) + copy(out, m.written) + return out +} + +// pingResponse builds the ping ACK frame for a given servo ID using the STS protocol. +// Frame: 0xFF 0xFF 0x02 0x00 +func pingResponse(servoID byte) []byte { + checksum := ^(servoID + 0x02 + 0x00) + return []byte{0xFF, 0xFF, servoID, 0x02, 0x00, checksum} +} + +// newMockBus returns a *feetech.Bus backed by a scriptedMockTransport. +// Caller is responsible for queuing responses before issuing bus operations. +func newMockBus(t *testing.T) (*feetech.Bus, *scriptedMockTransport) { + t.Helper() + mock := &scriptedMockTransport{} + bus, err := feetech.NewBus(feetech.BusConfig{ + Transport: mock, + Protocol: feetech.ProtocolSTS, + Timeout: 100 * time.Millisecond, + }) + if err != nil { + t.Fatalf("newMockBus: NewBus failed: %v", err) + } + t.Cleanup(func() { _ = bus.Close() }) + return bus, mock +} + +// newTestLogger returns a logger that discards output unless the test fails. +func newTestLogger(t *testing.T) logging.Logger { + t.Helper() + return logging.NewTestLogger(t) +} + +func TestMockBus_PingRoundTrip(t *testing.T) { + bus, mock := newMockBus(t) + // Bus.Ping does two round trips: a ping ACK, then a model-number register read. + // Queue both responses so the call completes. + mock.queueResponse(pingResponse(1)) + // Model number 777 (0x0309) read response: FF FF + mock.queueResponse([]byte{0xFF, 0xFF, 0x01, 0x04, 0x00, 0x09, 0x03, 0xEE}) + + if _, err := bus.Ping(t.Context(), 1); err != nil { + t.Fatalf("ping failed: %v", err) + } + + written := mock.snapshotWritten() + if len(written) < 6 { + t.Fatalf("expected ping packet to be written, got %d bytes: %X", len(written), written) + } + // Ping packet: 0xFF 0xFF + if written[0] != 0xFF || written[1] != 0xFF || written[2] != 0x01 { + t.Errorf("malformed ping packet: %X", written[:6]) + } +} From faa8a72f8ef86631c53114e6981a17d3d4096676 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 15:59:15 -0400 Subject: [PATCH 05/15] test: return io.EOF on empty mock response queue Code-review follow-up to 5ee8cf2. Returning (0, nil) caused feetech.Bus.readRawBytesLocked to busy-spin under the bus mutex until the deadline; io.EOF triggers its 1ms-sleep retry path instead. Important for PR5's planned 100Hz calibration reader. Also documents Flush no-op semantics. Co-Authored-By: Claude Opus 4.7 (1M context) --- mock_bus_test.go | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mock_bus_test.go b/mock_bus_test.go index e47749a..d283ec2 100644 --- a/mock_bus_test.go +++ b/mock_bus_test.go @@ -1,6 +1,7 @@ package so_arm import ( + "io" "sync" "testing" "time" @@ -25,7 +26,11 @@ func (m *scriptedMockTransport) Read(p []byte) (int, error) { m.mu.Lock() defer m.mu.Unlock() if len(m.responses) == 0 { - return 0, nil + // Match feetech.MockTransport semantics: returning io.EOF on an empty + // queue lets feetech.Bus.readRawBytesLocked fall into its 1ms-sleep + // retry path instead of busy-spinning under the bus mutex. Important + // for PR5's planned 100Hz calibration-sensor reader. + return 0, io.EOF } resp := m.responses[0] n := copy(p, resp) @@ -56,6 +61,9 @@ func (m *scriptedMockTransport) SetReadTimeout(timeout time.Duration) error { } func (m *scriptedMockTransport) Flush() error { + // No-op: tests that need to drop unconsumed responses should clear + // m.responses explicitly. Mirroring SerialTransport.Flush would risk + // silently swallowing scripted frames between operations. return nil } From 3eebff68e5d847206bb37c9a7219ac3470856fb0 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 16:01:32 -0400 Subject: [PATCH 06/15] feat(controller): add ErrControllerClosed sentinel and gate methods --- lifecycle_test.go | 30 ++++++++++++++++++++++++++++++ manager.go | 42 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 72 insertions(+) create mode 100644 lifecycle_test.go diff --git a/lifecycle_test.go b/lifecycle_test.go new file mode 100644 index 0000000..ab5693c --- /dev/null +++ b/lifecycle_test.go @@ -0,0 +1,30 @@ +package so_arm + +import ( + "context" + "errors" + "testing" +) + +// TestController_PostCloseReturnsSentinel verifies that calling any controller +// method after the bus has been closed returns ErrControllerClosed rather than +// a panic or a serial-port error. +func TestController_PostCloseReturnsSentinel(t *testing.T) { + bus, _ := newMockBus(t) + + ctrl := &SafeSoArmController{ + bus: bus, + logger: newTestLogger(t), + } + ctrl.closed.Store(true) + + if err := ctrl.Ping(context.Background()); !errors.Is(err, ErrControllerClosed) { + t.Errorf("Ping after close: expected ErrControllerClosed, got %v", err) + } + if err := ctrl.SetTorqueEnable(context.Background(), true); !errors.Is(err, ErrControllerClosed) { + t.Errorf("SetTorqueEnable after close: expected ErrControllerClosed, got %v", err) + } + if _, err := ctrl.GetJointPositionsForServos(context.Background(), []int{1}); !errors.Is(err, ErrControllerClosed) { + t.Errorf("GetJointPositionsForServos after close: expected ErrControllerClosed, got %v", err) + } +} diff --git a/manager.go b/manager.go index 0eb0d07..59a809e 100644 --- a/manager.go +++ b/manager.go @@ -2,6 +2,7 @@ package so_arm import ( "context" + "errors" "fmt" "math" "sync" @@ -12,6 +13,11 @@ import ( "go.viam.com/rdk/utils" ) +// ErrControllerClosed is returned by SafeSoArmController methods after the +// underlying bus has been closed via the registry. Callers holding a stale +// reference should treat this as a permanent failure for that controller. +var ErrControllerClosed = errors.New("so101: controller is closed") + // isGripperServo checks if a servo ID is the gripper (servo 6) func isGripperServo(servoID int) bool { return servoID == 6 @@ -26,9 +32,21 @@ type SafeSoArmController struct { logger logging.Logger calibration SO101FullCalibration mu sync.RWMutex + closed atomic.Bool +} + +// checkClosed returns ErrControllerClosed if the controller has been released. +func (s *SafeSoArmController) checkClosed() error { + if s.closed.Load() { + return ErrControllerClosed + } + return nil } func (s *SafeSoArmController) MoveToJointPositions(ctx context.Context, jointAngles []float64, speed, acc int) error { + if err := s.checkClosed(); err != nil { + return err + } s.mu.Lock() defer s.mu.Unlock() @@ -58,6 +76,9 @@ func (s *SafeSoArmController) MoveToJointPositions(ctx context.Context, jointAng } func (s *SafeSoArmController) MoveServosToPositions(ctx context.Context, servoIDs []int, jointAngles []float64, speed, acc int) error { + if err := s.checkClosed(); err != nil { + return err + } s.mu.Lock() defer s.mu.Unlock() @@ -92,6 +113,9 @@ func (s *SafeSoArmController) MoveServosToPositions(ctx context.Context, servoID } func (s *SafeSoArmController) GetJointPositions(ctx context.Context) ([]float64, error) { + if err := s.checkClosed(); err != nil { + return nil, err + } s.mu.RLock() defer s.mu.RUnlock() @@ -130,6 +154,9 @@ func (s *SafeSoArmController) GetJointPositions(ctx context.Context) ([]float64, } func (s *SafeSoArmController) GetJointPositionsForServos(ctx context.Context, servoIDs []int) ([]float64, error) { + if err := s.checkClosed(); err != nil { + return nil, err + } s.mu.RLock() defer s.mu.RUnlock() @@ -159,6 +186,9 @@ func (s *SafeSoArmController) GetJointPositionsForServos(ctx context.Context, se } func (s *SafeSoArmController) SetTorqueEnable(ctx context.Context, enable bool) error { + if err := s.checkClosed(); err != nil { + return err + } s.mu.Lock() defer s.mu.Unlock() @@ -175,6 +205,9 @@ func (s *SafeSoArmController) SetTorqueEnable(ctx context.Context, enable bool) } func (s *SafeSoArmController) Stop(ctx context.Context) error { + if err := s.checkClosed(); err != nil { + return err + } s.mu.Lock() defer s.mu.Unlock() @@ -197,6 +230,9 @@ func (s *SafeSoArmController) Close() error { } func (s *SafeSoArmController) Ping(ctx context.Context) error { + if err := s.checkClosed(); err != nil { + return err + } s.mu.RLock() defer s.mu.RUnlock() @@ -210,6 +246,9 @@ func (s *SafeSoArmController) Ping(ctx context.Context) error { // WriteServoRegister writes to a specific servo register by name func (s *SafeSoArmController) WriteServoRegister(ctx context.Context, servoID int, registerName string, data []byte) error { + if err := s.checkClosed(); err != nil { + return err + } s.mu.Lock() defer s.mu.Unlock() @@ -222,6 +261,9 @@ func (s *SafeSoArmController) WriteServoRegister(ctx context.Context, servoID in } func (s *SafeSoArmController) SetCalibration(calibration SO101FullCalibration) error { + if err := s.checkClosed(); err != nil { + return err + } s.mu.Lock() defer s.mu.Unlock() From ec1b03445b9010159d68375374d5b4eddb5eea68 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 16:05:50 -0400 Subject: [PATCH 07/15] test(controller): table-drive close-sentinel test across all 9 methods Code-review follow-up to 3eebff6. Covers every gated method so a silently-added 10th method without checkClosed gating is a visible omission. Pattern reused by Task 3-5 tests. Co-Authored-By: Claude Opus 4.7 (1M context) --- lifecycle_test.go | 62 ++++++++++++++++++++++++++++++++++------------- 1 file changed, 45 insertions(+), 17 deletions(-) diff --git a/lifecycle_test.go b/lifecycle_test.go index ab5693c..dbc41a6 100644 --- a/lifecycle_test.go +++ b/lifecycle_test.go @@ -6,25 +6,53 @@ import ( "testing" ) -// TestController_PostCloseReturnsSentinel verifies that calling any controller -// method after the bus has been closed returns ErrControllerClosed rather than -// a panic or a serial-port error. +// TestController_PostCloseReturnsSentinel verifies that every gated controller +// method returns ErrControllerClosed when the controller's closed flag is set, +// rather than panicking or hitting the (closed) bus. Table-driven so adding a +// new method to the gated list without an entry here is a visible omission. func TestController_PostCloseReturnsSentinel(t *testing.T) { - bus, _ := newMockBus(t) - - ctrl := &SafeSoArmController{ - bus: bus, - logger: newTestLogger(t), + ctx := context.Background() + cases := []struct { + name string + call func(*SafeSoArmController) error + }{ + {"Ping", func(c *SafeSoArmController) error { return c.Ping(ctx) }}, + {"SetTorqueEnable", func(c *SafeSoArmController) error { return c.SetTorqueEnable(ctx, true) }}, + {"Stop", func(c *SafeSoArmController) error { return c.Stop(ctx) }}, + {"MoveToJointPositions", func(c *SafeSoArmController) error { + return c.MoveToJointPositions(ctx, []float64{0, 0, 0, 0, 0}, 0, 0) + }}, + {"MoveServosToPositions", func(c *SafeSoArmController) error { + return c.MoveServosToPositions(ctx, []int{1}, []float64{0}, 0, 0) + }}, + {"WriteServoRegister", func(c *SafeSoArmController) error { + return c.WriteServoRegister(ctx, 1, "goal_position", []byte{0, 0}) + }}, + {"SetCalibration", func(c *SafeSoArmController) error { + return c.SetCalibration(SO101FullCalibration{}) + }}, + {"GetJointPositions", func(c *SafeSoArmController) error { + _, err := c.GetJointPositions(ctx) + return err + }}, + {"GetJointPositionsForServos", func(c *SafeSoArmController) error { + _, err := c.GetJointPositionsForServos(ctx, []int{1}) + return err + }}, } - ctrl.closed.Store(true) - if err := ctrl.Ping(context.Background()); !errors.Is(err, ErrControllerClosed) { - t.Errorf("Ping after close: expected ErrControllerClosed, got %v", err) - } - if err := ctrl.SetTorqueEnable(context.Background(), true); !errors.Is(err, ErrControllerClosed) { - t.Errorf("SetTorqueEnable after close: expected ErrControllerClosed, got %v", err) - } - if _, err := ctrl.GetJointPositionsForServos(context.Background(), []int{1}); !errors.Is(err, ErrControllerClosed) { - t.Errorf("GetJointPositionsForServos after close: expected ErrControllerClosed, got %v", err) + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + bus, _ := newMockBus(t) + ctrl := &SafeSoArmController{ + bus: bus, + logger: newTestLogger(t), + } + ctrl.closed.Store(true) + + if err := tc.call(ctrl); !errors.Is(err, ErrControllerClosed) { + t.Errorf("%s after close: expected ErrControllerClosed, got %v", tc.name, err) + } + }) } } From 16140373f8aa16ce7540c19f277fc068fd8b0f73 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 16:10:11 -0400 Subject: [PATCH 08/15] feat(registry): cache controller pointer; propagate close to all holders GetController now returns the cached *SafeSoArmController pointer instead of building a fresh struct copy per caller. ReleaseController and ForceCloseController both set closed=true on the cached controller before closing the bus, so all consumers (arm + gripper sharing a port) observe ErrControllerClosed atomically on next call. Also bundles two Task 2 follow-ups flagged by the code-quality review: - Document the checkClosed race window: it is best-effort and callers must hold a registry refcount for the duration of any controller call. - Make Close idempotent: set s.closed.Store(true) at the top so direct callers (tests bypassing the registry) get the same close contract as registry-driven release. The trackCaller function definition is left in place for Task 4 to delete; only the two call sites are removed here. --- lifecycle_test.go | 74 +++++++++++++++++++++++++++++++++++++++++++++++ manager.go | 2 ++ registry.go | 31 +++++++------------- 3 files changed, 87 insertions(+), 20 deletions(-) diff --git a/lifecycle_test.go b/lifecycle_test.go index dbc41a6..2370c52 100644 --- a/lifecycle_test.go +++ b/lifecycle_test.go @@ -56,3 +56,77 @@ func TestController_PostCloseReturnsSentinel(t *testing.T) { }) } } + +// TestRegistry_SamePointerForSamePort verifies that two callers acquiring a +// controller for the same port receive the *same* *SafeSoArmController, so +// that close-state propagates correctly across all consumers. +func TestRegistry_SamePointerForSamePort(t *testing.T) { + registry := NewControllerRegistry() + port := "/dev/test-port" + cfg := testConfig(port) + + // Inject a pre-built entry so we don't need a real bus. + bus, _ := newMockBus(t) + ctrl := &SafeSoArmController{ + bus: bus, + logger: cfg.Logger, + } + registry.entries[port] = &ControllerEntry{ + controller: ctrl, + config: cfg, + calibration: DefaultSO101FullCalibration, + refCount: 0, + } + + first, err := registry.GetController(port, cfg, DefaultSO101FullCalibration, false) + if err != nil { + t.Fatalf("first GetController: %v", err) + } + second, err := registry.GetController(port, cfg, DefaultSO101FullCalibration, false) + if err != nil { + t.Fatalf("second GetController: %v", err) + } + + if first != second { + t.Errorf("expected same pointer for same port; got %p and %p", first, second) + } + if first != ctrl { + t.Errorf("expected cached controller pointer to be returned") + } +} + +// TestRegistry_ReleaseClosesAllConsumers verifies that ReleaseController +// at refcount zero closes the bus and sets the closed flag on the shared +// controller, so other holders observe ErrControllerClosed on next call. +func TestRegistry_ReleaseClosesAllConsumers(t *testing.T) { + registry := NewControllerRegistry() + port := "/dev/test-port" + cfg := testConfig(port) + + bus, _ := newMockBus(t) + ctrl := &SafeSoArmController{ + bus: bus, + logger: cfg.Logger, + } + registry.entries[port] = &ControllerEntry{ + controller: ctrl, + config: cfg, + calibration: DefaultSO101FullCalibration, + refCount: 2, // simulate arm + gripper both holding + } + + // First release: refcount drops to 1, controller stays alive. + registry.ReleaseController(port) + if ctrl.closed.Load() { + t.Fatalf("controller closed prematurely at refcount > 0") + } + + // Second release: refcount drops to 0, controller closes. + registry.ReleaseController(port) + if !ctrl.closed.Load() { + t.Errorf("expected controller.closed=true after final release") + } + if err := ctrl.Ping(t.Context()); !errors.Is(err, ErrControllerClosed) { + t.Errorf("Ping after final release: expected ErrControllerClosed, got %v", err) + } +} diff --git a/manager.go b/manager.go index 59a809e..c37532e 100644 --- a/manager.go +++ b/manager.go @@ -36,6 +36,7 @@ type SafeSoArmController struct { } // checkClosed returns ErrControllerClosed if the controller has been released. +// checkClosed is best-effort: callers must hold a registry refcount for the duration of any controller call. A concurrent ReleaseController can race the unlocked Load() with an in-flight method. func (s *SafeSoArmController) checkClosed() error { if s.closed.Load() { return ErrControllerClosed @@ -223,6 +224,7 @@ func (s *SafeSoArmController) Close() error { s.mu.Lock() defer s.mu.Unlock() + s.closed.Store(true) if s.bus != nil { return s.bus.Close() } diff --git a/registry.go b/registry.go index 2493377..5a87377 100644 --- a/registry.go +++ b/registry.go @@ -93,15 +93,10 @@ func (r *ControllerRegistry) getExistingController(entry *ControllerEntry, confi } atomic.AddInt64(&entry.refCount, 1) - r.trackCaller(entry.config.Port) - return &SafeSoArmController{ - bus: entry.controller.bus, - group: entry.controller.group, - calibratedServos: entry.controller.calibratedServos, - logger: config.Logger, - calibration: entry.calibration, - }, nil + // Return the cached pointer so that all consumers observe close-state + // (and any future calibration updates) atomically. + return entry.controller, nil } func (r *ControllerRegistry) createNewController(portPath string, config *SoArm101Config, calibration SO101FullCalibration, fromFile bool) (*SafeSoArmController, error) { @@ -211,19 +206,11 @@ func (r *ControllerRegistry) createNewController(portPath string, config *SoArm1 r.entries[portPath] = entry - r.trackCaller(portPath) - if config.Logger != nil { config.Logger.Debugf("Created new feetech servo bus with %d servos for port %s", len(calibratedServos), portPath) } - return &SafeSoArmController{ - bus: bus, - group: group, - calibratedServos: calibratedServos, - logger: config.Logger, - calibration: finalCalibration, - }, nil + return entry.controller, nil } func (r *ControllerRegistry) ReleaseController(portPath string) { @@ -240,9 +227,12 @@ func (r *ControllerRegistry) ReleaseController(portPath string) { currentRefCount := atomic.AddInt64(&entry.refCount, -1) if currentRefCount <= 0 { - if entry.controller != nil && entry.controller.bus != nil { - if err := entry.controller.bus.Close(); err != nil && entry.config != nil && entry.config.Logger != nil { - entry.config.Logger.Warnf("error closing shared controller for port %s: %v", portPath, err) + if entry.controller != nil { + entry.controller.closed.Store(true) + if entry.controller.bus != nil { + if err := entry.controller.bus.Close(); err != nil && entry.config != nil && entry.config.Logger != nil { + entry.config.Logger.Warnf("error closing shared controller for port %s: %v", portPath, err) + } } } @@ -275,6 +265,7 @@ func (r *ControllerRegistry) ForceCloseController(portPath string) error { var err error if entry.controller != nil { + entry.controller.closed.Store(true) err = entry.controller.bus.Close() entry.controller = nil entry.config = nil From a30449c49c5e099911253995905b29d9e8e7f019 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 16:16:21 -0400 Subject: [PATCH 09/15] fix(registry): guard nil entry.config in not-available error path Code-review follow-up to 1614037. ReleaseController nils entry.config at refcount zero; a concurrent GetController that blocked on entry.mu during release would observe a nil config and panic on the error formatting path. Prefer the caller's config.Port as fallback. Co-Authored-By: Claude Opus 4.7 (1M context) --- registry.go | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/registry.go b/registry.go index 5a87377..ce62a73 100644 --- a/registry.go +++ b/registry.go @@ -57,7 +57,13 @@ func (r *ControllerRegistry) getExistingController(entry *ControllerEntry, confi if entry.lastError != nil { return nil, fmt.Errorf("cached controller creation error: %w", entry.lastError) } - return nil, fmt.Errorf("controller not available for port %s", entry.config.Port) + // entry.config may have been nil'd by a concurrent ReleaseController + // at refcount zero. Prefer the caller's config to avoid a nil deref. + port := config.Port + if entry.config != nil { + port = entry.config.Port + } + return nil, fmt.Errorf("controller not available for port %s", port) } if !configsEqual(entry.config, config) { From 40e7c287879a7f6696235cc40944aa88d991c820 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 16:21:04 -0400 Subject: [PATCH 10/15] feat(registry): replace runtime.Caller tracking with explicit port handles Components now store their controllerPort and call globalRegistry.ReleaseController(port) directly, replacing the brittle runtime.Caller(2)/Caller(3) PC-based release path. Removes callerPorts, callerMu, trackCaller, releaseFromCaller from registry.go and the ReleaseSharedController + GetCurrentCalibration shims from manager.go. Updates TestRegistryCreation to no longer reference the deleted callerPorts field. Adds TestRegistry_ExplicitPortReleaseDecrementsRefcount to lock in the new contract. Also removes the so101.initCtx field; the three call sites that read it (doServoInitialization, diagnoseConnection, verifyServoConfig) now use s.cancelCtx as a temporary placeholder. Task 5 will thread per-call ctx properly. Co-Authored-By: Claude Opus 4.7 (1M context) --- arm.go | 56 +++++++++++++++++++++++------------------------ calibration.go | 12 +++++----- gripper.go | 14 +++++++----- lifecycle_test.go | 23 +++++++++++++++++++ manager.go | 10 --------- registry.go | 38 +------------------------------- registry_test.go | 4 ---- 7 files changed, 67 insertions(+), 90 deletions(-) diff --git a/arm.go b/arm.go index 505b607..25538ed 100644 --- a/arm.go +++ b/arm.go @@ -87,11 +87,12 @@ func (cfg *SO101ArmConfig) Validate(path string) ([]string, []string, error) { type so101 struct { resource.AlwaysRebuild - name resource.Name - logger logging.Logger - cfg *SO101ArmConfig - opMgr *operation.SingleOperationManager - controller *SafeSoArmController + name resource.Name + logger logging.Logger + cfg *SO101ArmConfig + opMgr *operation.SingleOperationManager + controller *SafeSoArmController + controllerPort string // port path used to acquire the shared controller mu sync.RWMutex moveLock sync.Mutex @@ -108,7 +109,6 @@ type so101 struct { cancelCtx context.Context cancelFunc func() - initCtx context.Context // Context for initialization operations } func makeSO101ModelFrame() (referenceframe.Model, error) { @@ -227,7 +227,7 @@ func NewSO101(ctx context.Context, deps resource.Dependencies, name resource.Nam model, err := makeSO101ModelFrame() if err != nil { - ReleaseSharedController() // Clean up on error + globalRegistry.ReleaseController(controllerConfig.Port) return nil, fmt.Errorf("failed to create kinematic model: %w", err) } @@ -250,19 +250,19 @@ func NewSO101(ctx context.Context, deps resource.Dependencies, name resource.Nam cancelCtx, cancelFunc := context.WithCancel(context.Background()) arm := &so101{ - name: name, - cfg: conf, - opMgr: operation.NewSingleOperationManager(), - logger: logger, - controller: controller, - model: model, - armServoIDs: conf.ServoIDs, // Store which servos this arm controls - defaultSpeed: speedDegsPerSec, - defaultAcc: accelerationDegsPerSec, - motion: ms, - cancelCtx: cancelCtx, - cancelFunc: cancelFunc, - initCtx: ctx, // Store initialization context + name: name, + cfg: conf, + opMgr: operation.NewSingleOperationManager(), + logger: logger, + controller: controller, + controllerPort: controllerConfig.Port, + model: model, + armServoIDs: conf.ServoIDs, // Store which servos this arm controls + defaultSpeed: speedDegsPerSec, + defaultAcc: accelerationDegsPerSec, + motion: ms, + cancelCtx: cancelCtx, + cancelFunc: cancelFunc, } logger.Debugf("SO-101 configured with speed: %.1f deg/s, acceleration: %.1f deg/s²", @@ -271,7 +271,7 @@ func NewSO101(ctx context.Context, deps resource.Dependencies, name resource.Nam // Initialize and verify servo connections if err := arm.initializeServos(); err != nil { - ReleaseSharedController() // Clean up on error + globalRegistry.ReleaseController(controllerConfig.Port) return nil, fmt.Errorf("failed to initialize servos: %w", err) } @@ -619,7 +619,7 @@ func (s *so101) Geometries(ctx context.Context, extra map[string]interface{}) ([ func (s *so101) Close(context.Context) error { s.cancelFunc() - ReleaseSharedController() + globalRegistry.ReleaseController(s.controllerPort) return nil } @@ -657,8 +657,8 @@ func (s *so101) initializeServosWithRetry(maxRetries int) error { // doServoInitialization performs the actual initialization steps func (s *so101) doServoInitialization() error { - // Use stored initialization context instead of creating new one - ctx := s.initCtx + // Use cancelCtx as a temporary placeholder; Task 5 will thread per-call ctx properly. + ctx := s.cancelCtx // Ping all servos to ensure they're responding s.logger.Debug("Pinging all servos...") @@ -691,8 +691,8 @@ func (s *so101) doServoInitialization() error { // diagnoseConnection provides detailed diagnostics for troubleshooting func (s *so101) diagnoseConnection() error { - // Use stored initialization context instead of creating new one - ctx := s.initCtx + // Use cancelCtx as a temporary placeholder; Task 5 will thread per-call ctx properly. + ctx := s.cancelCtx s.logger.Debug("Starting SO-101 arm connection diagnosis...") @@ -719,8 +719,8 @@ func (s *so101) diagnoseConnection() error { // verifyServoConfig checks servo configuration func (s *so101) verifyServoConfig() error { - // Use stored initialization context instead of creating new one - ctx := s.initCtx + // Use cancelCtx as a temporary placeholder; Task 5 will thread per-call ctx properly. + ctx := s.cancelCtx s.logger.Debug("Verifying arm servo configuration...") diff --git a/calibration.go b/calibration.go index aa1bb75..f0cdc59 100644 --- a/calibration.go +++ b/calibration.go @@ -108,10 +108,11 @@ func (cfg *SO101CalibrationSensorConfig) Validate(path string) ([]string, []stri type so101CalibrationSensor struct { resource.AlwaysRebuild - name resource.Name - logger logging.Logger - cfg *SO101CalibrationSensorConfig - controller *SafeSoArmController + name resource.Name + logger logging.Logger + cfg *SO101CalibrationSensorConfig + controller *SafeSoArmController + controllerPort string // port path used to acquire the shared controller // Calibration state mu sync.RWMutex @@ -205,6 +206,7 @@ func NewSO101CalibrationSensor( logger: logger, cfg: conf, controller: controller, + controllerPort: controllerConfig.Port, state: StateIdle, joints: joints, servoNames: servoNames, @@ -1230,7 +1232,7 @@ func (cs *so101CalibrationSensor) Close(ctx context.Context) error { cs.recordingActive = false if cs.controller != nil { - ReleaseSharedController() + globalRegistry.ReleaseController(cs.controllerPort) } return nil diff --git a/gripper.go b/gripper.go index 0c32f5a..5c18695 100644 --- a/gripper.go +++ b/gripper.go @@ -58,11 +58,12 @@ func (cfg *SO101GripperConfig) Validate(path string) ([]string, []string, error) type so101Gripper struct { resource.AlwaysRebuild - name resource.Name - logger logging.Logger - controller *SafeSoArmController - geometries []spatialmath.Geometry - servoID int + name resource.Name + logger logging.Logger + controller *SafeSoArmController + controllerPort string // port path used to acquire the shared controller + geometries []spatialmath.Geometry + servoID int mu sync.Mutex isMoving atomic.Bool @@ -131,6 +132,7 @@ func newSO101Gripper(ctx context.Context, deps resource.Dependencies, conf resou name: conf.ResourceName(), logger: logger, controller: controller, + controllerPort: controllerConfig.Port, geometries: geometries, servoID: cfg.ServoID, speed: 30, @@ -339,7 +341,7 @@ func (g *so101Gripper) DoCommand(ctx context.Context, cmd map[string]interface{} } func (g *so101Gripper) Close(ctx context.Context) error { - ReleaseSharedController() + globalRegistry.ReleaseController(g.controllerPort) return nil } diff --git a/lifecycle_test.go b/lifecycle_test.go index 2370c52..3aec893 100644 --- a/lifecycle_test.go +++ b/lifecycle_test.go @@ -3,6 +3,7 @@ package so_arm import ( "context" "errors" + "sync/atomic" "testing" ) @@ -130,3 +131,25 @@ func TestRegistry_ReleaseClosesAllConsumers(t *testing.T) { t.Errorf("Ping after final release: expected ErrControllerClosed, got %v", err) } } + +// TestRegistry_ExplicitPortReleaseDecrementsRefcount verifies that callers +// can release a controller by passing the port path directly, with no +// dependence on runtime.Caller PC tracking. +func TestRegistry_ExplicitPortReleaseDecrementsRefcount(t *testing.T) { + registry := NewControllerRegistry() + port := "/dev/test-port" + cfg := testConfig(port) + bus, _ := newMockBus(t) + registry.entries[port] = &ControllerEntry{ + controller: &SafeSoArmController{bus: bus, logger: cfg.Logger}, + config: cfg, + refCount: 3, + } + + registry.ReleaseController(port) + + got := atomic.LoadInt64(®istry.entries[port].refCount) + if got != 2 { + t.Errorf("expected refCount=2 after release, got %d", got) + } +} diff --git a/manager.go b/manager.go index c37532e..d91b085 100644 --- a/manager.go +++ b/manager.go @@ -340,10 +340,6 @@ func GetSharedControllerWithCalibration(config *SoArm101Config, calibration SO10 return globalRegistry.GetController(config.Port, config, calibration, fromFile) } -func ReleaseSharedController() { - globalRegistry.releaseFromCaller() -} - func ForceCloseSharedController() error { globalRegistry.mu.RLock() portPaths := make([]string, 0, len(globalRegistry.entries)) @@ -395,12 +391,6 @@ func GetControllerStatus() (int64, bool, string) { return totalRefCount, hasController, configSummary } -// With multiple controllers, this returns the default calibration -// Use GetCurrentCalibrationForPort for port-specific calibration -func GetCurrentCalibration() SO101FullCalibration { - return DefaultSO101FullCalibration -} - func GetCurrentCalibrationForPort(portPath string) SO101FullCalibration { return globalRegistry.GetCurrentCalibration(portPath) } diff --git a/registry.go b/registry.go index ce62a73..5479d1f 100644 --- a/registry.go +++ b/registry.go @@ -3,7 +3,6 @@ package so_arm import ( "context" "fmt" - "runtime" "strings" "sync" "sync/atomic" @@ -24,16 +23,11 @@ type ControllerEntry struct { type ControllerRegistry struct { entries map[string]*ControllerEntry // port path -> entry mu sync.RWMutex - - // For backward API compatibility - track which caller uses which port - callerPorts map[uintptr]string // caller pointer -> port path - callerMu sync.RWMutex } func NewControllerRegistry() *ControllerRegistry { return &ControllerRegistry{ - entries: make(map[string]*ControllerEntry), - callerPorts: make(map[uintptr]string), + entries: make(map[string]*ControllerEntry), } } @@ -326,36 +320,6 @@ func (r *ControllerRegistry) GetCurrentCalibration(portPath string) SO101FullCal return entry.calibration } -func (r *ControllerRegistry) trackCaller(portPath string) { - pc, _, _, ok := runtime.Caller(3) // 3 levels up to get the actual caller - if !ok { - return - } - - r.callerMu.Lock() - r.callerPorts[pc] = portPath - r.callerMu.Unlock() -} - -func (r *ControllerRegistry) releaseFromCaller() { - pc, _, _, ok := runtime.Caller(2) // 2 levels up to get the actual caller - if !ok { - return - } - - r.callerMu.RLock() - portPath, exists := r.callerPorts[pc] - r.callerMu.RUnlock() - - if exists { - r.ReleaseController(portPath) - - r.callerMu.Lock() - delete(r.callerPorts, pc) - r.callerMu.Unlock() - } -} - // compareConfigs returns a string describing the differences between two configs func compareConfigs(a, b *SoArm101Config) string { diffs := []string{} diff --git a/registry_test.go b/registry_test.go index da03fa7..c0a2696 100644 --- a/registry_test.go +++ b/registry_test.go @@ -38,10 +38,6 @@ func TestRegistryCreation(t *testing.T) { t.Fatal("Registry entries map not initialized") } - if registry.callerPorts == nil { - t.Fatal("Registry callerPorts map not initialized") - } - if len(registry.entries) != 0 { t.Fatal("Registry should start empty") } From a03873d1a18a3af7b67125225bdd7ca5d90c4a15 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 16:28:54 -0400 Subject: [PATCH 11/15] fix(arm): plug controller leak in motion.FromProvider error paths Code-review follow-up to 40e7c28. Three error returns between controller acquisition and arm-struct construction were missing the explicit ReleaseController. Pre-existing leak that became visible once Task 4 made every other release path explicit. Also strengthens placeholder comments at the three doServoInit/ diagnose/verify sites to document the cancellation regression Task 5 will close, and adds a regression test that releasing a never-registered port is a safe no-op. Co-Authored-By: Claude Opus 4.7 (1M context) --- arm.go | 15 ++++++++++++--- lifecycle_test.go | 7 +++++++ 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/arm.go b/arm.go index 25538ed..9df7d61 100644 --- a/arm.go +++ b/arm.go @@ -234,15 +234,18 @@ func NewSO101(ctx context.Context, deps resource.Dependencies, name resource.Nam var ms motion.Service if conf.Motion != "" { if deps == nil { + globalRegistry.ReleaseController(controllerConfig.Port) return nil, fmt.Errorf("no deps") } ms, err = motion.FromProvider(deps, conf.Motion) if err != nil { + globalRegistry.ReleaseController(controllerConfig.Port) return nil, err } } else { ms, err = motion.FromProvider(deps, "builtin") if err != nil { + globalRegistry.ReleaseController(controllerConfig.Port) return nil, err } } @@ -657,7 +660,9 @@ func (s *so101) initializeServosWithRetry(maxRetries int) error { // doServoInitialization performs the actual initialization steps func (s *so101) doServoInitialization() error { - // Use cancelCtx as a temporary placeholder; Task 5 will thread per-call ctx properly. + // Task 5 will thread the construction ctx through this helper. Until then, + // cancelCtx (background-derived, lives until Close) cannot abort init on + // caller cancellation — a regression vs. the previous initCtx capture. ctx := s.cancelCtx // Ping all servos to ensure they're responding @@ -691,7 +696,9 @@ func (s *so101) doServoInitialization() error { // diagnoseConnection provides detailed diagnostics for troubleshooting func (s *so101) diagnoseConnection() error { - // Use cancelCtx as a temporary placeholder; Task 5 will thread per-call ctx properly. + // Task 5 will thread the DoCommand caller's ctx through this helper. + // Until then we ignore the per-call ctx that's available on the stack + // at the DoCommand site and use cancelCtx instead. ctx := s.cancelCtx s.logger.Debug("Starting SO-101 arm connection diagnosis...") @@ -719,7 +726,9 @@ func (s *so101) diagnoseConnection() error { // verifyServoConfig checks servo configuration func (s *so101) verifyServoConfig() error { - // Use cancelCtx as a temporary placeholder; Task 5 will thread per-call ctx properly. + // Task 5 will thread the DoCommand caller's ctx through this helper. + // Until then we ignore the per-call ctx that's available on the stack + // at the DoCommand site and use cancelCtx instead. ctx := s.cancelCtx s.logger.Debug("Verifying arm servo configuration...") diff --git a/lifecycle_test.go b/lifecycle_test.go index 3aec893..d00818f 100644 --- a/lifecycle_test.go +++ b/lifecycle_test.go @@ -153,3 +153,10 @@ func TestRegistry_ExplicitPortReleaseDecrementsRefcount(t *testing.T) { t.Errorf("expected refCount=2 after release, got %d", got) } } + +// TestRegistry_ReleaseUnknownPortIsNoop verifies that releasing a port that +// was never registered does not panic and does not affect other entries. +func TestRegistry_ReleaseUnknownPortIsNoop(t *testing.T) { + registry := NewControllerRegistry() + registry.ReleaseController("/dev/never-existed") +} From 2803bc7b5bfb47f41e501bfa6f4be581b244863b Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 16:31:59 -0400 Subject: [PATCH 12/15] fix(arm): pass per-call ctx to init/diagnose/verify helpers --- arm.go | 37 +++++++++++-------------------------- lifecycle_test.go | 17 +++++++++++++++++ 2 files changed, 28 insertions(+), 26 deletions(-) diff --git a/arm.go b/arm.go index 9df7d61..fe39689 100644 --- a/arm.go +++ b/arm.go @@ -273,7 +273,7 @@ func NewSO101(ctx context.Context, deps resource.Dependencies, name resource.Nam logger.Debugf("Arm controlling servo IDs: %v", arm.armServoIDs) // Initialize and verify servo connections - if err := arm.initializeServos(); err != nil { + if err := arm.initializeServos(ctx); err != nil { globalRegistry.ReleaseController(controllerConfig.Port) return nil, fmt.Errorf("failed to initialize servos: %w", err) } @@ -473,14 +473,14 @@ func (s *so101) DoCommand(ctx context.Context, cmd map[string]interface{}) (map[ }, nil case "diagnose": - err := s.diagnoseConnection() + err := s.diagnoseConnection(ctx) return map[string]interface{}{ "success": err == nil, "error": fmt.Sprintf("%v", err), }, nil case "verify_config": - err := s.verifyServoConfig() + err := s.verifyServoConfig(ctx) return map[string]interface{}{ "success": err == nil, "error": fmt.Sprintf("%v", err), @@ -491,7 +491,7 @@ func (s *so101) DoCommand(ctx context.Context, cmd map[string]interface{}) (map[ if r, ok := cmd["retries"].(float64); ok { retries = int(r) } - err := s.initializeServosWithRetry(retries) + err := s.initializeServosWithRetry(ctx, retries) return map[string]interface{}{ "success": err == nil, "error": fmt.Sprintf("%v", err), @@ -627,19 +627,19 @@ func (s *so101) Close(context.Context) error { } // initializeServos pings each servo and enables torque to ensure proper communication -func (s *so101) initializeServos() error { - return s.initializeServosWithRetry(3) +func (s *so101) initializeServos(ctx context.Context) error { + return s.initializeServosWithRetry(ctx, 3) } // initializeServosWithRetry attempts servo initialization with retries -func (s *so101) initializeServosWithRetry(maxRetries int) error { +func (s *so101) initializeServosWithRetry(ctx context.Context, maxRetries int) error { s.logger.Debug("Initializing SO-101 arm servos...") var lastErr error for attempt := 1; attempt <= maxRetries; attempt++ { s.logger.Debugf("Arm servo initialization attempt %d/%d", attempt, maxRetries) - if err := s.doServoInitialization(); err != nil { + if err := s.doServoInitialization(ctx); err != nil { lastErr = err s.logger.Warnf("Initialization attempt %d failed: %v", attempt, err) @@ -659,12 +659,7 @@ func (s *so101) initializeServosWithRetry(maxRetries int) error { } // doServoInitialization performs the actual initialization steps -func (s *so101) doServoInitialization() error { - // Task 5 will thread the construction ctx through this helper. Until then, - // cancelCtx (background-derived, lives until Close) cannot abort init on - // caller cancellation — a regression vs. the previous initCtx capture. - ctx := s.cancelCtx - +func (s *so101) doServoInitialization(ctx context.Context) error { // Ping all servos to ensure they're responding s.logger.Debug("Pinging all servos...") if err := s.controller.Ping(ctx); err != nil { @@ -695,12 +690,7 @@ func (s *so101) doServoInitialization() error { } // diagnoseConnection provides detailed diagnostics for troubleshooting -func (s *so101) diagnoseConnection() error { - // Task 5 will thread the DoCommand caller's ctx through this helper. - // Until then we ignore the per-call ctx that's available on the stack - // at the DoCommand site and use cancelCtx instead. - ctx := s.cancelCtx - +func (s *so101) diagnoseConnection(ctx context.Context) error { s.logger.Debug("Starting SO-101 arm connection diagnosis...") // Test overall ping @@ -725,12 +715,7 @@ func (s *so101) diagnoseConnection() error { } // verifyServoConfig checks servo configuration -func (s *so101) verifyServoConfig() error { - // Task 5 will thread the DoCommand caller's ctx through this helper. - // Until then we ignore the per-call ctx that's available on the stack - // at the DoCommand site and use cancelCtx instead. - ctx := s.cancelCtx - +func (s *so101) verifyServoConfig(ctx context.Context) error { s.logger.Debug("Verifying arm servo configuration...") positions, err := s.controller.GetJointPositionsForServos(ctx, s.armServoIDs) diff --git a/lifecycle_test.go b/lifecycle_test.go index d00818f..5c4c9af 100644 --- a/lifecycle_test.go +++ b/lifecycle_test.go @@ -160,3 +160,20 @@ func TestRegistry_ReleaseUnknownPortIsNoop(t *testing.T) { registry := NewControllerRegistry() registry.ReleaseController("/dev/never-existed") } + +// TestArmHelpers_AcceptContext is a compile-time guard that the diagnose/ +// verify/initialize helpers take an explicit ctx parameter, not a struct field. +// This test documents the contract; the assertion is that the file compiles +// after the refactor. +func TestArmHelpers_AcceptContext(t *testing.T) { + // We can't construct a real *so101 without hardware, but we can assert + // the method set via type checks. If the method signatures change in a + // way that drops the ctx parameter, this file will fail to compile. + var s *so101 + if s == nil { + t.Skip("documentation test; behavior covered indirectly via hardware tests") + } + _ = s.doServoInitialization + _ = s.diagnoseConnection + _ = s.verifyServoConfig +} From bc795be338942ac7a06a98b407d5656e1eb35fd2 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 16:36:34 -0400 Subject: [PATCH 13/15] refactor(arm): drop write-only cancelCtx; convert sentinel test to file-scope assertion Code-review follow-up to 2803bc7. The cancelCtx field was assigned but never read once Task 5 threaded per-call ctx through the helpers; only cancelFunc is consumed by Close. Replaces TestArmHelpers_AcceptContext (a runtime-skipped documentation-only test that the linter correctly flagged as tautological) with file-scope signature assertions that fail to compile if any helper drops its ctx parameter. Co-Authored-By: Claude Opus 4.7 (1M context) --- arm.go | 4 +--- lifecycle_test.go | 26 ++++++++++---------------- 2 files changed, 11 insertions(+), 19 deletions(-) diff --git a/arm.go b/arm.go index fe39689..3f5892a 100644 --- a/arm.go +++ b/arm.go @@ -107,7 +107,6 @@ type so101 struct { motion motion.Service - cancelCtx context.Context cancelFunc func() } @@ -250,7 +249,7 @@ func NewSO101(ctx context.Context, deps resource.Dependencies, name resource.Nam } } - cancelCtx, cancelFunc := context.WithCancel(context.Background()) + _, cancelFunc := context.WithCancel(context.Background()) arm := &so101{ name: name, @@ -264,7 +263,6 @@ func NewSO101(ctx context.Context, deps resource.Dependencies, name resource.Nam defaultSpeed: speedDegsPerSec, defaultAcc: accelerationDegsPerSec, motion: ms, - cancelCtx: cancelCtx, cancelFunc: cancelFunc, } diff --git a/lifecycle_test.go b/lifecycle_test.go index 5c4c9af..9883ab7 100644 --- a/lifecycle_test.go +++ b/lifecycle_test.go @@ -161,19 +161,13 @@ func TestRegistry_ReleaseUnknownPortIsNoop(t *testing.T) { registry.ReleaseController("/dev/never-existed") } -// TestArmHelpers_AcceptContext is a compile-time guard that the diagnose/ -// verify/initialize helpers take an explicit ctx parameter, not a struct field. -// This test documents the contract; the assertion is that the file compiles -// after the refactor. -func TestArmHelpers_AcceptContext(t *testing.T) { - // We can't construct a real *so101 without hardware, but we can assert - // the method set via type checks. If the method signatures change in a - // way that drops the ctx parameter, this file will fail to compile. - var s *so101 - if s == nil { - t.Skip("documentation test; behavior covered indirectly via hardware tests") - } - _ = s.doServoInitialization - _ = s.diagnoseConnection - _ = s.verifyServoConfig -} +// File-scope signature assertions: these guarantee the ctx-threading helpers +// keep their (ctx context.Context) parameter. Dropping ctx would fail to +// compile here — caught at build time, not at test runtime. +var ( + _ func(context.Context) error = (*so101)(nil).doServoInitialization + _ func(context.Context) error = (*so101)(nil).diagnoseConnection + _ func(context.Context) error = (*so101)(nil).verifyServoConfig + _ func(context.Context) error = (*so101)(nil).initializeServos + _ func(context.Context, int) error = (*so101)(nil).initializeServosWithRetry +) From 0413e4334abca61d7d075a6c411509edc16ec341 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 16:42:06 -0400 Subject: [PATCH 14/15] refactor(arm): drop dead cancelFunc field and Close call Final-review follow-up to bc795be. Once cancelCtx was removed, the context returned by WithCancel was being discarded into _, leaving cancelFunc canceling a context nobody observed. PR4's opMgr work will introduce proper cancellation; until then there's no reason to keep the dead lifecycle field. Co-Authored-By: Claude Opus 4.7 (1M context) --- arm.go | 6 ------ 1 file changed, 6 deletions(-) diff --git a/arm.go b/arm.go index 3f5892a..0ec5e6e 100644 --- a/arm.go +++ b/arm.go @@ -106,8 +106,6 @@ type so101 struct { defaultAcc float32 motion motion.Service - - cancelFunc func() } func makeSO101ModelFrame() (referenceframe.Model, error) { @@ -249,8 +247,6 @@ func NewSO101(ctx context.Context, deps resource.Dependencies, name resource.Nam } } - _, cancelFunc := context.WithCancel(context.Background()) - arm := &so101{ name: name, cfg: conf, @@ -263,7 +259,6 @@ func NewSO101(ctx context.Context, deps resource.Dependencies, name resource.Nam defaultSpeed: speedDegsPerSec, defaultAcc: accelerationDegsPerSec, motion: ms, - cancelFunc: cancelFunc, } logger.Debugf("SO-101 configured with speed: %.1f deg/s, acceleration: %.1f deg/s²", @@ -619,7 +614,6 @@ func (s *so101) Geometries(ctx context.Context, extra map[string]interface{}) ([ } func (s *so101) Close(context.Context) error { - s.cancelFunc() globalRegistry.ReleaseController(s.controllerPort) return nil } From c361b5f4c015a06d7a789298c96dd93ef5c595a8 Mon Sep 17 00:00:00 2001 From: HipsterBrown Date: Mon, 27 Apr 2026 17:12:52 -0400 Subject: [PATCH 15/15] chore: move plans to global location --- .../2026-04-27-pr1-lifecycle-correctness.md | 1045 ----------------- .../2026-04-27-so101-remediation-design.md | 169 --- 2 files changed, 1214 deletions(-) delete mode 100644 docs/superpowers/plans/2026-04-27-pr1-lifecycle-correctness.md delete mode 100644 docs/superpowers/specs/2026-04-27-so101-remediation-design.md diff --git a/docs/superpowers/plans/2026-04-27-pr1-lifecycle-correctness.md b/docs/superpowers/plans/2026-04-27-pr1-lifecycle-correctness.md deleted file mode 100644 index 9d44f85..0000000 --- a/docs/superpowers/plans/2026-04-27-pr1-lifecycle-correctness.md +++ /dev/null @@ -1,1045 +0,0 @@ -# PR1: Lifecycle Correctness Implementation Plan - -> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Eliminate three lifecycle bugs in the SO-101 module — `runtime.Caller`-based reference tracking, `s.initCtx` reuse after construction, and the gripper-closes-first race when arm + gripper share a controller — without changing user-observable behavior. - -**Architecture:** Replace caller-PC tracking with explicit port handles stored on each component. Stop reusing the constructor's context after `NewSO101` returns; thread per-call ctx through all controller-touching helpers. Make the registry return the same `*SafeSoArmController` pointer to all callers per port, with an internal `closed` flag that all controller methods check. - -**Tech Stack:** Go 1.25, Viam RDK (`go.viam.com/rdk`), `github.com/hipsterbrown/feetech-servo` v0.4.0, `feetech.MockTransport` for tests. - -**Spec reference:** `docs/superpowers/specs/2026-04-27-so101-remediation-design.md` § PR1. - ---- - -## File Structure - -**Created:** -- `mock_bus_test.go` — test-only helper that builds a `feetech.Bus` with a `MockTransport` and wires it into a `SafeSoArmController` via the registry. Reused by PR2-PR5. -- `lifecycle_test.go` — test file for this PR's correctness behaviors (refcount, same-pointer, post-close error, port-explicit release). - -**Modified:** -- `manager.go` — add `closed atomic.Bool` field + `ErrControllerClosed` sentinel; check at the top of every method; remove `releaseFromCaller` plumbing. -- `registry.go` — delete `callerPorts`, `callerMu`, `trackCaller`, `releaseFromCaller`; cache the controller pointer in `ControllerEntry` and return that exact pointer from `getExistingController` / `createNewController`; `ReleaseController` sets `closed` before closing the bus. -- `arm.go` — store `controllerPort string` on the `so101` struct; `Close` calls `globalRegistry.ReleaseController(s.controllerPort)`; remove `initCtx` field; `doServoInitialization`, `diagnoseConnection`, `verifyServoConfig` take `ctx context.Context` parameters. -- `gripper.go` — store `controllerPort string` on `so101Gripper`; `Close` calls `globalRegistry.ReleaseController(g.controllerPort)`. -- `calibration.go` — store `controllerPort string` on `so101CalibrationSensor`; `Close` calls `globalRegistry.ReleaseController(cs.controllerPort)`. - -**Deleted (within `manager.go`):** `ReleaseSharedController()` (the caller-PC variant) is removed; callers move to the explicit port form. - -**Pre-existing tests touched:** -- `registry_test.go` — drop now-meaningless tests of `callerPorts`/`releaseFromCaller`; the new tests in `lifecycle_test.go` cover the replacement contract. - ---- - -## Task 1: Mock-bus test harness - -This harness is the foundation for every test in this PR (and PR2-PR5). It must come first. - -**Files:** -- Create: `mock_bus_test.go` - -- [ ] **Step 1: Confirm `MockTransport` is exported and usable** - -Run: `grep -n "type MockTransport" $HOME/go/pkg/mod/github.com/hipsterbrown/feetech-servo@v0.4.0/feetech/transport.go` -Expected: line showing `type MockTransport struct {` (already verified during planning, but re-confirm). - -- [ ] **Step 2: Write the harness file** - -```go -package so_arm - -import ( - "sync" - "testing" - "time" - - "github.com/hipsterbrown/feetech-servo/feetech" - "go.viam.com/rdk/logging" -) - -// scriptedMockTransport is a custom Transport (not feetech.MockTransport) whose -// Read responses are queued per-request. We use a custom type rather than the -// upstream MockTransport because tests in PR2-PR5 need to script multiple -// round-trips with different responses, and MockTransport's single ReadData -// buffer doesn't support that pattern cleanly. -type scriptedMockTransport struct { - mu sync.Mutex - written []byte - responses [][]byte - closed bool -} - -func (m *scriptedMockTransport) Read(p []byte) (int, error) { - m.mu.Lock() - defer m.mu.Unlock() - if len(m.responses) == 0 { - return 0, nil - } - resp := m.responses[0] - n := copy(p, resp) - if n >= len(resp) { - m.responses = m.responses[1:] - } else { - m.responses[0] = resp[n:] - } - return n, nil -} - -func (m *scriptedMockTransport) Write(p []byte) (int, error) { - m.mu.Lock() - defer m.mu.Unlock() - m.written = append(m.written, p...) - return len(p), nil -} - -func (m *scriptedMockTransport) Close() error { - m.mu.Lock() - defer m.mu.Unlock() - m.closed = true - return nil -} - -func (m *scriptedMockTransport) SetReadTimeout(timeout time.Duration) error { - return nil -} - -func (m *scriptedMockTransport) Flush() error { - return nil -} - -// queueResponse appends a raw frame to the mock's response queue. -func (m *scriptedMockTransport) queueResponse(b []byte) { - m.mu.Lock() - defer m.mu.Unlock() - m.responses = append(m.responses, b) -} - -// reset clears the written-data buffer (responses queue is preserved). -func (m *scriptedMockTransport) resetWritten() { - m.mu.Lock() - defer m.mu.Unlock() - m.written = nil -} - -// snapshotWritten returns a copy of all bytes written since the last reset. -func (m *scriptedMockTransport) snapshotWritten() []byte { - m.mu.Lock() - defer m.mu.Unlock() - out := make([]byte, len(m.written)) - copy(out, m.written) - return out -} - -// pingResponse builds the ping ACK frame for a given servo ID using the STS protocol. -// Frame: 0xFF 0xFF 0x02 0x00 -func pingResponse(servoID byte) []byte { - checksum := ^(servoID + 0x02 + 0x00) - return []byte{0xFF, 0xFF, servoID, 0x02, 0x00, checksum} -} - -// newMockBus returns a *feetech.Bus backed by a scriptedMockTransport. -// Caller is responsible for queuing responses before issuing bus operations. -func newMockBus(t *testing.T) (*feetech.Bus, *scriptedMockTransport) { - t.Helper() - mock := &scriptedMockTransport{} - bus, err := feetech.NewBus(feetech.BusConfig{ - Transport: mock, - Protocol: feetech.ProtocolSTS, - Timeout: 100 * time.Millisecond, - }) - if err != nil { - t.Fatalf("newMockBus: NewBus failed: %v", err) - } - t.Cleanup(func() { _ = bus.Close() }) - return bus, mock -} - -// newTestLogger returns a logger that discards output unless the test fails. -func newTestLogger(t *testing.T) logging.Logger { - t.Helper() - return logging.NewTestLogger(t) -} -``` - -- [ ] **Step 3: Build to confirm it compiles** - -Run: `go build ./...` -Expected: no output (success). The harness has no test functions yet; it's pure infrastructure. - -- [ ] **Step 4: Sanity test the harness — write a tiny test that pings via mock** - -Append to `mock_bus_test.go`: - -```go -func TestMockBus_PingRoundTrip(t *testing.T) { - bus, mock := newMockBus(t) - mock.queueResponse(pingResponse(1)) - - if _, err := bus.Ping(t.Context(), 1); err != nil { - t.Fatalf("ping failed: %v", err) - } - - written := mock.snapshotWritten() - if len(written) < 6 { - t.Fatalf("expected ping packet to be written, got %d bytes: %X", len(written), written) - } - // Ping packet: 0xFF 0xFF - if written[0] != 0xFF || written[1] != 0xFF || written[2] != 0x01 { - t.Errorf("malformed ping packet: %X", written[:6]) - } -} -``` - -- [ ] **Step 5: Run the harness sanity test** - -Run: `go test -run TestMockBus_PingRoundTrip -v` -Expected: PASS. - -If `t.Context()` is not available (Go < 1.24), substitute `context.Background()` and add the `context` import. - -- [ ] **Step 6: Commit** - -```bash -git add mock_bus_test.go -git commit -m "test: add scripted mock-bus harness for so-101 tests" -``` - ---- - -## Task 2: Add `ErrControllerClosed` sentinel and `closed` flag - -User contract: any controller method called after `Close` returns `ErrControllerClosed`. Internal methods check via a single helper. - -**Files:** -- Modify: `manager.go` (add field, sentinel, helper; gate every public method) -- Test: `lifecycle_test.go` (new file) - -- [ ] **Step 1: Write failing test for the closed-flag contract** - -Create `lifecycle_test.go`: - -```go -package so_arm - -import ( - "context" - "errors" - "testing" -) - -// TestController_PostCloseReturnsSentinel verifies that calling any controller -// method after the bus has been closed returns ErrControllerClosed rather than -// a panic or a serial-port error. -func TestController_PostCloseReturnsSentinel(t *testing.T) { - bus, _ := newMockBus(t) - - ctrl := &SafeSoArmController{ - bus: bus, - logger: newTestLogger(t), - } - ctrl.closed.Store(true) - - if err := ctrl.Ping(context.Background()); !errors.Is(err, ErrControllerClosed) { - t.Errorf("Ping after close: expected ErrControllerClosed, got %v", err) - } - if err := ctrl.SetTorqueEnable(context.Background(), true); !errors.Is(err, ErrControllerClosed) { - t.Errorf("SetTorqueEnable after close: expected ErrControllerClosed, got %v", err) - } - if _, err := ctrl.GetJointPositionsForServos(context.Background(), []int{1}); !errors.Is(err, ErrControllerClosed) { - t.Errorf("GetJointPositionsForServos after close: expected ErrControllerClosed, got %v", err) - } -} -``` - -- [ ] **Step 2: Run the test to see it fail** - -Run: `go test -run TestController_PostCloseReturnsSentinel -v` -Expected: BUILD FAIL — `ErrControllerClosed` undefined, `closed` field missing on `SafeSoArmController`. - -- [ ] **Step 3: Add the sentinel and field to `manager.go`** - -In `manager.go`, in the imports block ensure `errors` and `sync/atomic` are imported (they already are; `sync/atomic` is present). Add at the top of the file, after the imports: - -```go -// ErrControllerClosed is returned by SafeSoArmController methods after the -// underlying bus has been closed via the registry. Callers holding a stale -// reference should treat this as a permanent failure for that controller. -var ErrControllerClosed = errors.New("so101: controller is closed") -``` - -Add `closed atomic.Bool` to the `SafeSoArmController` struct (currently lines 22-29): - -```go -type SafeSoArmController struct { - bus *feetech.Bus - group *feetech.ServoGroup - calibratedServos map[int]*CalibratedServo - logger logging.Logger - calibration SO101FullCalibration - mu sync.RWMutex - closed atomic.Bool -} -``` - -Add a helper just below the struct definition: - -```go -// checkClosed returns ErrControllerClosed if the controller has been released. -func (s *SafeSoArmController) checkClosed() error { - if s.closed.Load() { - return ErrControllerClosed - } - return nil -} -``` - -- [ ] **Step 4: Gate every public method on `SafeSoArmController`** - -For each of these methods in `manager.go`, add `if err := s.checkClosed(); err != nil { return err }` (or `return nil, err` for two-return methods, or `return SO101FullCalibration{}, err` for `GetCalibration`) as the *first* statement of the function body, **before** any lock acquisition: - -- `MoveToJointPositions` (line 31) -- `MoveServosToPositions` (line 60) -- `GetJointPositions` (line 94) -- `GetJointPositionsForServos` (line 132) -- `SetTorqueEnable` (line 161) -- `Stop` (line 177) -- `Ping` (line 199) -- `WriteServoRegister` (line 212) -- `SetCalibration` (line 224) - -Do NOT gate `GetCalibration` (line 246) or `getCalibrationForServo` (line 253) — these are read-only, lock-protected, and a stale read is acceptable. Do NOT gate `Close` (line 189); it is the legitimate path to set the flag (see next task). - -Example for `Ping`: - -```go -func (s *SafeSoArmController) Ping(ctx context.Context) error { - if err := s.checkClosed(); err != nil { - return err - } - s.mu.RLock() - defer s.mu.RUnlock() - // ... existing body -} -``` - -- [ ] **Step 5: Run the failing test to verify it now passes** - -Run: `go test -run TestController_PostCloseReturnsSentinel -v` -Expected: PASS. - -- [ ] **Step 6: Run the full test suite to confirm nothing regressed** - -Run: `go test ./...` -Expected: PASS (existing tests should be unaffected; if any registry tests fail, they're test-fixture issues addressed in Task 5). - -- [ ] **Step 7: Commit** - -```bash -git add manager.go lifecycle_test.go -git commit -m "feat(controller): add ErrControllerClosed sentinel and gate methods" -``` - ---- - -## Task 3: Same-pointer registry semantics - -Make `getExistingController` return the cached pointer instead of building a fresh struct copy. `ReleaseController` sets `closed` on the cached controller before closing the bus. - -**Files:** -- Modify: `registry.go` (cache pointer in entry, return it from getters, set closed on release) -- Modify: `lifecycle_test.go` (add same-pointer test) - -- [ ] **Step 1: Write failing test for same-pointer contract** - -Append to `lifecycle_test.go`: - -```go -// TestRegistry_SamePointerForSamePort verifies that two callers acquiring a -// controller for the same port receive the *same* *SafeSoArmController, so -// that close-state propagates correctly across all consumers. -func TestRegistry_SamePointerForSamePort(t *testing.T) { - registry := NewControllerRegistry() - port := "/dev/test-port" - cfg := testConfig(port) - - // Inject a pre-built entry so we don't need a real bus. - bus, _ := newMockBus(t) - ctrl := &SafeSoArmController{ - bus: bus, - logger: cfg.Logger, - } - registry.entries[port] = &ControllerEntry{ - controller: ctrl, - config: cfg, - calibration: DefaultSO101FullCalibration, - refCount: 0, - } - - first, err := registry.GetController(port, cfg, DefaultSO101FullCalibration, false) - if err != nil { - t.Fatalf("first GetController: %v", err) - } - second, err := registry.GetController(port, cfg, DefaultSO101FullCalibration, false) - if err != nil { - t.Fatalf("second GetController: %v", err) - } - - if first != second { - t.Errorf("expected same pointer for same port; got %p and %p", first, second) - } - if first != ctrl { - t.Errorf("expected cached controller pointer to be returned") - } -} -``` - -- [ ] **Step 2: Run the test to see it fail** - -Run: `go test -run TestRegistry_SamePointerForSamePort -v` -Expected: FAIL — current `getExistingController` builds a new `&SafeSoArmController{...}` (registry.go:98-104), so the pointers differ. - -- [ ] **Step 3: Modify `getExistingController` to return the cached pointer** - -In `registry.go`, replace the return statement at lines 95-104: - -```go - atomic.AddInt64(&entry.refCount, 1) - r.trackCaller(entry.config.Port) - - return &SafeSoArmController{ - bus: entry.controller.bus, - group: entry.controller.group, - calibratedServos: entry.controller.calibratedServos, - logger: config.Logger, - calibration: entry.calibration, - }, nil -} -``` - -with: - -```go - atomic.AddInt64(&entry.refCount, 1) - - // Return the cached pointer so that all consumers observe close-state - // (and any future calibration updates) atomically. - return entry.controller, nil -} -``` - -The `r.trackCaller` line goes too — it's deleted in Task 4. - -- [ ] **Step 4: Modify `createNewController` to also return the cached pointer** - -In `registry.go`, the bottom of `createNewController` (lines 220-226) currently returns a new struct. Change it to return the same pointer it just stored on `entry.controller`: - -```go - entry.controller = &SafeSoArmController{ - bus: bus, - group: group, - calibratedServos: calibratedServos, - logger: config.Logger, - calibration: finalCalibration, - } - entry.calibration = finalCalibration - entry.lastError = nil - atomic.StoreInt64(&entry.refCount, 1) - - r.entries[portPath] = entry - - if config.Logger != nil { - config.Logger.Debugf("Created new feetech servo bus with %d servos for port %s", len(calibratedServos), portPath) - } - - return entry.controller, nil -} -``` - -(The `r.trackCaller(portPath)` call goes too — deleted in Task 4.) - -- [ ] **Step 5: Run the test to verify it passes** - -Run: `go test -run TestRegistry_SamePointerForSamePort -v` -Expected: PASS. - -- [ ] **Step 6: Add and run the close-propagation test** - -Append to `lifecycle_test.go`: - -```go -// TestRegistry_ReleaseClosesAllConsumers verifies that ReleaseController -// at refcount zero closes the bus and sets the closed flag on the shared -// controller, so other holders observe ErrControllerClosed on next call. -func TestRegistry_ReleaseClosesAllConsumers(t *testing.T) { - registry := NewControllerRegistry() - port := "/dev/test-port" - cfg := testConfig(port) - - bus, _ := newMockBus(t) - ctrl := &SafeSoArmController{ - bus: bus, - logger: cfg.Logger, - } - registry.entries[port] = &ControllerEntry{ - controller: ctrl, - config: cfg, - calibration: DefaultSO101FullCalibration, - refCount: 2, // simulate arm + gripper both holding - } - - // First release: refcount drops to 1, controller stays alive. - registry.ReleaseController(port) - if ctrl.closed.Load() { - t.Fatalf("controller closed prematurely at refcount > 0") - } - - // Second release: refcount drops to 0, controller closes. - registry.ReleaseController(port) - if !ctrl.closed.Load() { - t.Errorf("expected controller.closed=true after final release") - } - if err := ctrl.Ping(t.Context()); !errors.Is(err, ErrControllerClosed) { - t.Errorf("Ping after final release: expected ErrControllerClosed, got %v", err) - } -} -``` - -Run: `go test -run TestRegistry_ReleaseClosesAllConsumers -v` -Expected: FAIL — `ReleaseController` doesn't yet set `closed`. - -- [ ] **Step 7: Set `closed` in `ReleaseController` before closing the bus** - -In `registry.go`, in `ReleaseController` (lines 229-259), modify the refcount-zero branch: - -```go - currentRefCount := atomic.AddInt64(&entry.refCount, -1) - if currentRefCount <= 0 { - if entry.controller != nil { - entry.controller.closed.Store(true) - if entry.controller.bus != nil { - if err := entry.controller.bus.Close(); err != nil && entry.config != nil && entry.config.Logger != nil { - entry.config.Logger.Warnf("error closing shared controller for port %s: %v", portPath, err) - } - } - } - - r.mu.Lock() - delete(r.entries, portPath) - r.mu.Unlock() - - entry.controller = nil - entry.config = nil - entry.calibration = SO101FullCalibration{} - atomic.StoreInt64(&entry.refCount, 0) - entry.lastError = nil - } -``` - -Also update `ForceCloseController` (lines 261-287) to set `closed` similarly: - -```go - if entry.controller != nil { - entry.controller.closed.Store(true) - err = entry.controller.bus.Close() - entry.controller = nil - // ... rest unchanged - } -``` - -- [ ] **Step 8: Run the test to verify it passes** - -Run: `go test -run TestRegistry_ReleaseClosesAllConsumers -v` -Expected: PASS. - -- [ ] **Step 9: Run the full suite** - -Run: `go test ./...` -Expected: PASS (some pre-existing `registry_test.go` tests will skip; that's fine). - -- [ ] **Step 10: Commit** - -```bash -git add registry.go lifecycle_test.go -git commit -m "feat(registry): cache controller pointer; propagate close to all holders" -``` - ---- - -## Task 4: Delete `runtime.Caller` machinery and add explicit-port release - -Replace `releaseFromCaller` with explicit `ReleaseController(port)` calls. Components store `controllerPort` on their struct. - -**Files:** -- Modify: `registry.go` (delete `callerPorts`, `callerMu`, `trackCaller`, `releaseFromCaller`) -- Modify: `manager.go` (delete `ReleaseSharedController()`) -- Modify: `arm.go` (store `controllerPort`, call explicit `Release` in `Close`) -- Modify: `gripper.go` (same pattern) -- Modify: `calibration.go` (same pattern) -- Modify: `lifecycle_test.go` (add explicit-port release test) - -- [ ] **Step 1: Write failing test for the explicit-port contract** - -Append to `lifecycle_test.go`: - -```go -// TestRegistry_ExplicitPortReleaseDecrementsRefcount verifies that callers -// can release a controller by passing the port path directly, with no -// dependence on runtime.Caller PC tracking. -func TestRegistry_ExplicitPortReleaseDecrementsRefcount(t *testing.T) { - registry := NewControllerRegistry() - port := "/dev/test-port" - cfg := testConfig(port) - bus, _ := newMockBus(t) - registry.entries[port] = &ControllerEntry{ - controller: &SafeSoArmController{bus: bus, logger: cfg.Logger}, - config: cfg, - refCount: 3, - } - - registry.ReleaseController(port) - - got := atomic.LoadInt64(®istry.entries[port].refCount) - if got != 2 { - t.Errorf("expected refCount=2 after release, got %d", got) - } -} -``` - -Run: `go test -run TestRegistry_ExplicitPortReleaseDecrementsRefcount -v` -Expected: PASS (this already works — `ReleaseController(port)` exists; the test guards against future regression). - -- [ ] **Step 2: Delete `callerPorts`, `callerMu`, `trackCaller`, `releaseFromCaller` from `registry.go`** - -In `registry.go`: - -1. From the `ControllerRegistry` struct (lines 24-31), remove the `callerPorts` and `callerMu` fields: - -```go -type ControllerRegistry struct { - entries map[string]*ControllerEntry // port path -> entry - mu sync.RWMutex -} -``` - -2. From `NewControllerRegistry` (lines 33-38), remove the `callerPorts: ...` initializer: - -```go -func NewControllerRegistry() *ControllerRegistry { - return &ControllerRegistry{ - entries: make(map[string]*ControllerEntry), - } -} -``` - -3. Delete the `trackCaller` method (lines 332-341 — already-removed call sites in Task 3). - -4. Delete the `releaseFromCaller` method (lines 343-360). - -5. Remove the now-unused `runtime` import from the imports block. **Keep `strings`** — it's still used by `compareConfigs`. - -- [ ] **Step 3: Delete `ReleaseSharedController` from `manager.go`** - -In `manager.go`, delete the `ReleaseSharedController` function (lines 299-301): - -```go -func ReleaseSharedController() { - globalRegistry.releaseFromCaller() -} -``` - -Keep `GetSharedController`, `GetSharedControllerWithCalibration`, `ForceCloseSharedController`, `GetControllerStatus`, `GetCurrentCalibrationForPort`. Delete `GetCurrentCalibration` too — its godoc admits it's wrong, and PR3 was going to remove it anyway. Removing it now keeps the API surface honest. - -```go -// DELETE THIS FUNCTION: -// With multiple controllers, this returns the default calibration -// Use GetCurrentCalibrationForPort for port-specific calibration -func GetCurrentCalibration() SO101FullCalibration { - return DefaultSO101FullCalibration -} -``` - -- [ ] **Step 4: Add `controllerPort` field to `so101` and update `Close`** - -In `arm.go`: - -1. Add a field to the `so101` struct (lines 87-112), removing `initCtx`: - -```go -type so101 struct { - resource.AlwaysRebuild - - name resource.Name - logger logging.Logger - cfg *SO101ArmConfig - opMgr *operation.SingleOperationManager - controller *SafeSoArmController - controllerPort string // port path used to acquire the shared controller - - mu sync.RWMutex - moveLock sync.Mutex - isMoving atomic.Bool - model referenceframe.Model - - armServoIDs []int - - defaultSpeed float32 - defaultAcc float32 - - motion motion.Service - - cancelCtx context.Context - cancelFunc func() -} -``` - -2. In `NewSO101` (line 252-266), set `controllerPort` and remove `initCtx`: - -```go - arm := &so101{ - name: name, - cfg: conf, - opMgr: operation.NewSingleOperationManager(), - logger: logger, - controller: controller, - controllerPort: controllerConfig.Port, - model: model, - armServoIDs: conf.ServoIDs, - defaultSpeed: speedDegsPerSec, - defaultAcc: accelerationDegsPerSec, - motion: ms, - cancelCtx: cancelCtx, - cancelFunc: cancelFunc, - } -``` - -3. In the two `ReleaseSharedController()` cleanup-on-error sites (lines 230 and 274), replace with the explicit call: - -```go - model, err := makeSO101ModelFrame() - if err != nil { - globalRegistry.ReleaseController(controllerConfig.Port) - return nil, fmt.Errorf("failed to create kinematic model: %w", err) - } -``` - -and: - -```go - if err := arm.initializeServos(ctx); err != nil { - globalRegistry.ReleaseController(controllerConfig.Port) - return nil, fmt.Errorf("failed to initialize servos: %w", err) - } -``` - -(Note `initializeServos(ctx)` — fixed in Task 5.) - -4. In `Close` (lines 620-624): - -```go -func (s *so101) Close(context.Context) error { - s.cancelFunc() - globalRegistry.ReleaseController(s.controllerPort) - return nil -} -``` - -- [ ] **Step 5: Same pattern in `gripper.go`** - -In `gripper.go`: - -1. Add `controllerPort string` to `so101Gripper` struct (lines 58-76). -2. In `newSO101Gripper`, set `controllerPort: controllerConfig.Port` in the struct literal. -3. In `Close` (lines 341-344): - -```go -func (g *so101Gripper) Close(ctx context.Context) error { - globalRegistry.ReleaseController(g.controllerPort) - return nil -} -``` - -- [ ] **Step 6: Same pattern in `calibration.go`** - -In `calibration.go`: - -1. Add `controllerPort string` to `so101CalibrationSensor` struct (lines 108-135). -2. In `NewSO101CalibrationSensor`, set `controllerPort: controllerConfig.Port` in the struct literal. -3. In `Close` (lines 1221-1237), replace `ReleaseSharedController()` with `globalRegistry.ReleaseController(cs.controllerPort)`. - -- [ ] **Step 7: Build and confirm** - -Run: `go build ./...` -Expected: success. If there are remaining `ReleaseSharedController()` references (likely in `cmd/cli/`), leave them for PR8 — `cmd/cli/` is already broken on disk and not in the build path. - -If the build complains about `cmd/cli/` references during `./...`, scope the build to the module package: `go build .` - -- [ ] **Step 8: Run the full test suite** - -Run: `go test ./...` -Expected: PASS. Pre-existing `registry_test.go` tests that touched `callerPorts` need updating — see next task. - -- [ ] **Step 9: Commit** - -```bash -git add registry.go manager.go arm.go gripper.go calibration.go lifecycle_test.go -git commit -m "feat(registry): replace runtime.Caller tracking with explicit port handles" -``` - ---- - -## Task 5: Stop reusing `s.initCtx` - -Thread per-call `ctx` through the constructor's helpers. Remove the `initCtx` field. Update DoCommand handlers to pass their own ctx. - -**Files:** -- Modify: `arm.go` (remove `initCtx` field; helper signatures take ctx) -- Modify: `lifecycle_test.go` (add cancellation test if feasible without hardware) - -- [ ] **Step 1: Write a sentinel test that the helper accepts ctx** - -Append to `lifecycle_test.go`: - -```go -// TestArmHelpers_AcceptContext is a compile-time guard that the diagnose/ -// verify/initialize helpers take an explicit ctx parameter, not s.initCtx. -// This test documents the contract; the assertion is that the file compiles -// after the refactor. -func TestArmHelpers_AcceptContext(t *testing.T) { - // We can't construct a real *so101 without hardware, but we can assert - // the method set via type checks at the call site. If the methods change - // signature, this file will fail to compile. - var s *so101 - if s == nil { - t.Skip("documentation test; behavior covered indirectly via hardware tests") - } - _ = s.doServoInitialization - _ = s.diagnoseConnection - _ = s.verifyServoConfig -} -``` - -- [ ] **Step 2: Run the test to confirm it currently passes (the methods exist, just with wrong signatures)** - -Run: `go test -run TestArmHelpers_AcceptContext -v` -Expected: PASS (skipped at runtime). - -- [ ] **Step 3: Update `doServoInitialization` to take ctx** - -In `arm.go`: - -1. Change the signature of `doServoInitialization` (line 659): - -```go -func (s *so101) doServoInitialization(ctx context.Context) error { - s.logger.Debug("Pinging all servos...") - if err := s.controller.Ping(ctx); err != nil { - return fmt.Errorf("servo ping failed: %w", err) - } - // ... existing body, with `s.initCtx` replaced by `ctx` -} -``` - -Find every `s.initCtx` reference inside this function and replace with `ctx`. - -2. Change the signature of `initializeServosWithRetry` (line 632): - -```go -func (s *so101) initializeServosWithRetry(ctx context.Context, maxRetries int) error { - // ... existing body - for attempt := 1; attempt <= maxRetries; attempt++ { - // ... - if err := s.doServoInitialization(ctx); err != nil { - // ... - } - } -} -``` - -3. Change the signature of `initializeServos` (line 627): - -```go -func (s *so101) initializeServos(ctx context.Context) error { - return s.initializeServosWithRetry(ctx, 3) -} -``` - -4. Change `diagnoseConnection` (line 693) and `verifyServoConfig` (line 721) to take `ctx`: - -```go -func (s *so101) diagnoseConnection(ctx context.Context) error { - s.logger.Debug("Starting SO-101 arm connection diagnosis...") - // replace s.initCtx with ctx throughout -} - -func (s *so101) verifyServoConfig(ctx context.Context) error { - // replace s.initCtx with ctx throughout -} -``` - -- [ ] **Step 4: Update DoCommand handlers in `arm.go` to pass their own `ctx`** - -In `arm.go` `DoCommand` (line 448), the cases that call these helpers need to pass `ctx` (the function's own parameter): - -- `case "diagnose":` (line 472) — change `err := s.diagnoseConnection()` to `err := s.diagnoseConnection(ctx)`. -- `case "verify_config":` (line 479) — change `err := s.verifyServoConfig()` to `err := s.verifyServoConfig(ctx)`. -- `case "reinitialize":` (line 486) — change `err := s.initializeServosWithRetry(retries)` to `err := s.initializeServosWithRetry(ctx, retries)`. - -- [ ] **Step 5: Update the constructor call site** - -In `arm.go` `NewSO101` (line 273), change: - -```go - if err := arm.initializeServos(); err != nil { -``` - -to: - -```go - if err := arm.initializeServos(ctx); err != nil { -``` - -(The `ctx` here is `NewSO101`'s parameter, which is correct for construction-time init.) - -- [ ] **Step 6: Remove the `initCtx` field from the struct** - -The struct change was already made in Task 4 step 4 (when we listed the new struct without `initCtx`). Confirm that no compilation references to `s.initCtx` remain: - -Run: `grep -n "initCtx" arm.go` -Expected: no output. - -- [ ] **Step 7: Build and run tests** - -Run: `go build . && go test ./...` -Expected: PASS. - -- [ ] **Step 8: Commit** - -```bash -git add arm.go lifecycle_test.go -git commit -m "fix(arm): pass per-call ctx to init/diagnose/verify helpers" -``` - ---- - -## Task 6: Update pre-existing `registry_test.go` for new contract - -Some pre-existing tests reference `callerPorts` indirectly or assert state that no longer exists. Update them to test the new contract. - -**Files:** -- Modify: `registry_test.go` - -- [ ] **Step 1: Identify failing tests** - -Run: `go test -run TestRegistry -v` -Expected: most pass. Specifically inspect `TestRegistryCreation` (lines 30-48) — it asserts `registry.callerPorts != nil`, which no longer exists. - -- [ ] **Step 2: Update `TestRegistryCreation`** - -In `registry_test.go`, the test `TestRegistryCreation` currently asserts `callerPorts` is initialized. Remove that assertion: - -```go -func TestRegistryCreation(t *testing.T) { - registry := NewControllerRegistry() - - if registry == nil { - t.Fatal("NewControllerRegistry returned nil") - } - - if registry.entries == nil { - t.Fatal("Registry entries map not initialized") - } - - if len(registry.entries) != 0 { - t.Fatal("Registry should start empty") - } -} -``` - -- [ ] **Step 3: Run the full registry test file** - -Run: `go test -run TestRegistry -v` -Expected: PASS for non-skipped tests. - -- [ ] **Step 4: Commit** - -```bash -git add registry_test.go -git commit -m "test(registry): remove assertions on deleted callerPorts field" -``` - ---- - -## Task 7: Final verification - -Confirm the PR's intent end-to-end before declaring done. - -- [ ] **Step 1: Confirm no `runtime.Caller` references remain** - -Run: `grep -rn "runtime.Caller\|callerPorts\|releaseFromCaller\|trackCaller" *.go` -Expected: no output. - -- [ ] **Step 2: Confirm no `initCtx` references remain** - -Run: `grep -rn "initCtx" *.go` -Expected: no output. - -- [ ] **Step 3: Confirm the legacy `ReleaseSharedController` is gone** - -Run: `grep -n "func ReleaseSharedController" *.go` -Expected: no output. - -Run: `grep -n "ReleaseSharedController" *.go` -Expected: no output. (If output appears in `cmd/cli/`, leave it for PR8.) - -- [ ] **Step 4: Run the full test suite with race detector** - -Run: `go test -race .` -Expected: PASS, no race warnings. - -(Use the package-only path `.` rather than `./...` because `cmd/cli/` is currently broken on disk — multiple `package main` declarations in one directory. PR8 fixes it; until then, scope verification to the so_arm package.) - -- [ ] **Step 5: Build the module binary** - -Run: `make bin/arm` -Expected: success. - -- [ ] **Step 6: Verify the `make` target still works end-to-end** - -Run: `make` -Expected: success (builds module.tar.gz). - -- [ ] **Step 7: Open the PR** - -```bash -git push -u origin nhehr/speed_accel_setting -gh pr create --title "PR1: lifecycle correctness — explicit port handles, ctx, same-pointer registry" --body "$(cat <<'EOF' -## Summary -- Delete `runtime.Caller`-based reference tracking; replace with explicit `controllerPort` stored on each component -- Stop reusing `s.initCtx` after construction; thread per-call `ctx` through `doServoInitialization`, `initializeServosWithRetry`, `initializeServos`, `diagnoseConnection`, `verifyServoConfig`, and the corresponding DoCommand handlers -- Make `ControllerRegistry` return the cached `*SafeSoArmController` pointer for a port (instead of a fresh struct copy), so `Release` setting `closed=true` is observed by all holders — eliminates the gripper-closes-first race -- Add `ErrControllerClosed` sentinel; gate every public controller method -- Add scripted-mock-bus test harness (`mock_bus_test.go`); foundation for PR2-PR5 tests -- Delete `GetCurrentCalibration()` (its godoc admitted it returned the wrong value); callers should use `GetCurrentCalibrationForPort` - -Spec: docs/superpowers/specs/2026-04-27-so101-remediation-design.md § PR1. - -## Test plan -- [x] `go test -race ./...` passes -- [x] `make` succeeds -- [x] New tests cover: post-close sentinel returns, same-pointer-per-port, refcount-zero close propagates, explicit-port release decrements -- [ ] Manual smoke: rebuild module, run on hardware with arm + gripper sharing port, confirm both close cleanly without "controller leak" warnings -EOF -)" -``` - -(Skip the `gh pr create` if you'd rather review locally first.) - ---- - -## Done criteria - -- All tasks ticked. -- `go test -race ./...` passes. -- `make` produces a working module.tar.gz. -- Manual smoke on hardware confirms arm + gripper can both Close without leaving stale registry entries (verifiable via the `controller_status` DoCommand). diff --git a/docs/superpowers/specs/2026-04-27-so101-remediation-design.md b/docs/superpowers/specs/2026-04-27-so101-remediation-design.md deleted file mode 100644 index b8ac4ab..0000000 --- a/docs/superpowers/specs/2026-04-27-so101-remediation-design.md +++ /dev/null @@ -1,169 +0,0 @@ -# SO-101 Module Remediation Design - -**Date:** 2026-04-27 -**Status:** Draft -**Author:** Brainstorm with Nick Hehr - -## Context - -A full review of the `so_arm` module surfaced ~30 issues spanning critical bugs (silent calibration corruption, lifecycle leaks, a misleading speed/acceleration config), missing features (`IsHoldingSomething`, `MoveOptions`), structural problems (3-4 implementations of the same conversion math, type-explosion around calibration, `runtime.Caller`-based reference tracking), and gaps in testing/CI/docs. - -This spec sequences the remediation work into 8 risk-ordered PRs, prioritizing arm and gripper fixes before calibration sensor and discovery, and explicitly defers a possible architectural shift (singleton shared-controller → arm-as-core with gripper using DoCommand) until the in-place fixes are complete. - -## Goals - -- Land the most user-visible correctness fixes (the silent calibration write bug, the speed/accel lie, the broken `Stop`) within the first two PRs. -- Make every PR independently reviewable: behavior changes and refactors do not share a PR. -- Add test infrastructure (mock-bus harness, conversion-math round-trip tests) inside the PRs that need it, rather than as a separate phase, so each PR ships covered. -- Establish PR-time CI so future regressions of this class are caught at review. -- Leave the codebase in a state where the singleton-vs-arm-as-core architectural question can be re-evaluated honestly post-remediation. - -## Non-goals - -- **No architectural shift in this spec.** The shared-controller singleton stays. Whether to flip to arm-as-core (gripper communicates with arm via DoCommand) is a deferred decision, re-evaluated after PR8 lands. -- **No package-layout reshuffle** (`internal/`, sub-packages). Discussed in the review, intentionally out of scope here. -- **No changes to the setup-app SvelteKit application.** Calibration sensor stays one Viam resource — the web app's UX depends on it. -- **No new features beyond filling already-promised contracts.** `IsHoldingSomething`/`MoveOptions`/`Get3DModels` are already-promised interfaces returning errors; we make them work. We do not add servo diagnostics, runtime joint limits, or park-on-close in this spec. - -## Architecture decisions - -### Defer arm-as-core refactor - -The current shared-controller singleton has real boundary problems (gripper reaches across files into the controller's unexported state; calibration sensor reads `cs.controller.bus` directly), but the most painful manifestations — `runtime.Caller` PC tracking, `s.initCtx` reuse, the gripper-closes-first race — are independently fixable in place. After PR1-PR4 we will have a clean baseline against which the arm-as-core question is a more honest one. The arm-as-core option is not better than the singleton in the abstract; it is only better than the *broken* singleton. - -### Same-pointer registry semantics - -`getExistingController` currently returns a fresh `*SafeSoArmController` struct copied from the registry's cached one. The fix is to cache one `*SafeSoArmController` per port and return that exact pointer to all callers. When the bus closes, an internal `closed` flag is set; subsequent method calls return `ErrClosed`. This eliminates the gripper-closes-first race without introducing a new handle type. - -### Per-move speed, init-time acceleration - -Speed flows through `ServoGroup.SetPositionsWithSpeed` on every move, with the configured `speed_degs_per_sec` as the default and `MoveOptions.MaxVelocityRPS` as a per-call override. Acceleration is written to the `RegAcceleration` register at init/reconfigure (and re-written if reconfigure changes it). This matches Viam's expectation that `MoveOptions` overrides per call, and avoids the surprise of acceleration silently changing mid-move. - -### Free conversion helpers, not methods - -A new `conversion.go` provides `radiansFor(cal *MotorCalibration, raw int) (float64, error)` and `rawFor(cal *MotorCalibration, rad float64) (int, error)`. The gripper's "percent encoded as `[-π, π]` so the Viam gripper API can pretend it's a joint" convention lives in these helpers, not on `MotorCalibration` itself — that keeps `MotorCalibration` as a pure servo-math type. The DriveMode flip stays inside `Normalize`/`Denormalize` (single source of truth), eliminating the asymmetry bug where the gripper applied DriveMode at the radians layer while `MotorCalibration` applied it at the normalized layer. - -### Strict opMgr semantics on the arm - -`opMgr.New(ctx)` at the top of `MoveToJointPositions` cancels any in-flight move; `opMgr.CancelRunning(ctx)` from `Stop` actually stops; `IsMoving` becomes `opMgr.OpRunning()`. `moveLock` and `isMoving` are deleted. This is a user-observable behavior change: a second move now cancels the first instead of queueing behind a mutex. This is intentional — it is the correct arm semantics, matches RDK convention, and unblocks `Stop`. - -## PR sequence - -### PR1 — Lifecycle correctness - -Refactor only; no user-visible behavior change. - -- Delete `runtime.Caller`/`callerPorts` machinery in `registry.go:332-360`. Replace with explicit port tracking on each component struct (`s.controllerPort = config.Port`); `Close` calls `globalRegistry.ReleaseController(s.controllerPort)` directly. -- Stop reusing `s.initCtx` (`arm.go:111`). Pass per-call `ctx` through to `doServoInitialization`, `diagnoseConnection`, `verifyServoConfig`. Remove the `initCtx` field. -- Fix gripper-closes-first race: in `getExistingController`, cache one `*SafeSoArmController` per port in `ControllerEntry` and return that exact pointer (`registry.go:98-104`). Add `closed atomic.Bool` to `SafeSoArmController`; method calls return a sentinel `ErrControllerClosed` when set. `ReleaseController` at refcount zero sets the flag and closes the bus. -- Tests: introduce mock-bus harness based on `feetech/_examples/mock_transport`. Cover refcount-0 close, double-acquire returns same pointer, post-close method call returns sentinel error. - -### PR2 — Move-path correctness - -User-visible behavior change. Fixes the speed/accel lie, `Stop` actually stopping, and the silent calibration corruption bug. - -- Wire `speed_degs_per_sec` through `ServoGroup.SetPositionsWithSpeed`. Default from config; `MoveOptions.MaxVelocityRPS` per-call override. -- Write `RegAcceleration` per-servo at init and on reconfigure (when value changes). -- Replace `time.Sleep(moveTimeSeconds)` block (`arm.go:380-389`) with `group.WaitForStop(ctx, timeoutMs)`. `timeoutMs` derived from the current speed estimate but used only as upper-bound safety net. Honors `ctx.Done()`. -- Drop the unused `speed, acc int` parameters from `MoveServosToPositions` and `MoveToJointPositions` controller methods (the latter is dead anyway, deleted in PR3). -- Fix `writeHomingOffset` (`calibration.go:872`): correct register name (`position_offset`), use the actual `homingOffset` argument, encode as 2-byte sign-magnitude using `feetech.RegPositionOffset.SignBit`. -- Tests: mock-bus test that `MoveToJointPositions` with non-default speed produces the expected sync-write packet; mock-bus test that `Stop` cancels an in-flight move (context observed); regression test for `writeHomingOffset` correct register + payload. - -### PR3 — Conversion math consolidation + dead code removal - -Refactor only; no user-visible behavior change. Protected by tests added in this PR. - -- New `conversion.go` with `radiansFor(cal, raw)` and `rawFor(cal, rad)` covering both arm (degrees) and gripper (percent → `[-π, π]`) conventions. -- Replace inline math in: `MoveServosToPositions` (`manager.go:73-88`), `GetJointPositionsForServos` (`manager.go:150-153`), `gripper.percentToRadians` and `radiansToPercent` (`gripper.go:370-424`), `arm.calculateJointLimits` (`arm.go:130-167`), and the broken estimator at `calibration.go:838`. -- Delete dead methods: `SafeSoArmController.MoveToJointPositions` (`manager.go:31`), `GetJointPositions` (`manager.go:94`), `getCalibrationForServo` (`manager.go:253`), `CalibratedServo.SetPositionWithSpeed` (`calibrated_servo.go:213`), `GetCurrentCalibration` (`manager.go:356`). -- Delete dead config fields: `SoArm101Config.SpeedDegsPerSec`, `AccelerationDegsPerSec` (`config.go:23-24`). -- Consolidate the three calibration-update loops (`registry.go:78-90, 185-198`, `manager.go:228-244`) into one method on `SafeSoArmController`. -- Delete commented-out radians-conversion blocks at `calibration.go:427-457` and `:573-603`. -- Tests: round-trip property tests for `Normalize`/`Denormalize` over all four `NormMode` × DriveMode toggle × range edges. Round-trip for new `radiansFor`/`rawFor`. Unit tests for `calculateJointLimits` over hand-built calibrations. - -### PR4 — Surface fixes + concurrency cleanup - -Mostly behavior-correcting; the `opMgr` change is user-observable. - -- `IsHoldingSomething` (`gripper.go:358`): extract the position-vs-threshold inference from `Grab` into a private helper, reuse from `IsHoldingSomething`. Cache last `Grab` outcome optionally. -- `MoveThroughJointPositions` (`arm.go:394`): honor `MoveOptions` per step — `MaxVelocityRPS` overrides default speed for that step's `SetPositionsWithSpeed` call. -- `Get3DModels` (`arm.go:444`): return `nil, errors.ErrUnsupported` instead of a hard error. -- `arm.DoCommand` `default:` branch (`arm.go:550-600`): replace with explicit `case "set_speed"`, `case "set_acceleration"`, `case "get_motion_params"`. The default case becomes a clean unknown-command error that lists valid commands. -- `opMgr` integration: `s.opMgr.New(ctx)` at top of `MoveToJointPositions`, `s.opMgr.CancelRunning(ctx)` from `Stop`, `IsMoving` returns `s.opMgr.OpRunning()`. Delete `moveLock` and `isMoving`. -- Cache `calculateJointLimits` result on `so101` struct; invalidate on `SetCalibration` and on successful `reload_calibration` DoCommand. -- Replace `CalibratedServo.mu` with `atomic.Pointer[MotorCalibration]`. The bus already serializes wire access; the per-servo lock was redundant. -- Tests: `IsHoldingSomething` returns true after a `Grab` that succeeded and false after one that didn't (using mock bus to set position); `MoveOptions.MaxVelocityRPS` override propagates to the sync-write packet; `Stop` cancels in-flight move via opMgr (verified via context cancellation in the mock). - -### PR5 — Calibration sensor bug-fixes - -Sensor stays one resource (web app constraint). No structural extraction. - -- State-machine race teardown: consistent `if cs.recordingCancel != nil` checks in `Close`, `abortCalibration`, `resetCalibration`, `stopRangeRecording`. Audit for double-call. -- Replace `positionHistory []map[int]int` (`calibration.go:540-630`) with `var sampleCount atomic.Int64`. The history was only ever read for `len()`. Saves ~100KB churn during recording. -- Replace emoji status strings (`calibration.go:1077-1080`) with plain text. -- (Commented-out radians-conversion blocks at `:427-457` and `:573-603` are deleted in PR3.) -- Tests: state-machine progression test (idle → started → homing_position → range_recording → completed), backed by mock bus. Asserts on register writes and state transitions. Catches the writeHomingOffset class of regressions. - -### PR6 — Discovery improvements + canonical joint mapping - -- Worker-pool parallelize port scanning (`discovery.go:82-93`). Cap concurrency at 6 to avoid pathological hub behavior. Use `errgroup.WithContext` for cancellation propagation. -- Multi-baudrate discovery: delegate to `feetech.Bus.Discover`, which sweeps the package's `DefaultBaudRates` internally. Discovery loop opens a bus per candidate port at a placeholder baudrate, then calls `Discover(ctx)` and lets feetech do the sweep. If `Discover` returns no servos, the port is dropped from the result set. -- Extract canonical joint mapping into a shared file (likely `config.go` or new `joints.go`): - ```go - var SO101Joints = []struct{ ID int; Name string }{ - {1, "shoulder_pan"}, {2, "shoulder_lift"}, {3, "elbow_flex"}, - {4, "wrist_flex"}, {5, "wrist_roll"}, {6, "gripper"}, - } - ``` -- Replace `expectedMotors` literals at `calibration.go:1026-1033, :1105` and similar in discovery/arm with references to the canonical mapping. - -### PR7 — Documentation - -- Add `speed_degs_per_sec`, `acceleration_degs_per_sec_per_sec`, `motion` to the arm attributes table in README. -- Document gripper `calibrate_positions`, `set_motion_params`, `get_motion_params`. -- Document arm `set_speed`, `set_acceleration`, `get_motion_params`. -- Document the missing ~half of calibration sensor DoCommands. -- Fix the dead `MOTOR_SETUP.md` reference (line 432). -- Lift the duplicated "Communication" boilerplate (~25 lines × 3 sites) into one section; cross-link from each model. -- Add a table of contents. -- Add package-level godoc in a new `doc.go`. Backfill exported-symbol comments to ~80% coverage (today: ~30%). - -### PR8 — CI + build hygiene - -- New `.github/workflows/ci.yml` running on PRs: `go vet ./...`, `go test -race ./...`, `make`. Would have caught both the `writeHomingOffset` SA4006-class issue (unused parameter) and the `cmd/cli/` build break. -- Add `staticcheck` to `make lint`. -- Hardware build tag: `//go:build hardware` on tests currently using `t.Skip("hardware-dependent")`. Add `make test-hardware` target. -- Fix `cmd/cli/`: keep only `debug_cli` — move it to `cmd/debug/main.go`. Delete the other 8 files (`main.go`, `gentle_move.go`, `position_reader.go`, `raw_servo.go`, `read_servo.go`, `simple_test_try.go`, `sync_test_again.go`, `torque_disable.go`). Add a `make tools` target that builds `cmd/debug/`. - -## Test strategy - -Test infrastructure is built incrementally inside PRs that need it: - -- **PR1** introduces the mock-bus harness based on `feetech/_examples/mock_transport`. From this PR onward, lifecycle/move/calibration tests can use a real `feetech.Bus` over an in-process transport without hardware. -- **PR2** adds the first move-path tests using the harness. -- **PR3** adds pure-logic tests for conversion math (no harness needed). -- **PR4** adds surface tests using the harness (gripper inference, opMgr cancellation). -- **PR5** adds the first state-machine tests using the harness. - -By PR8, the existing `t.Skip("hardware-dependent")` tests in `registry_test.go` should be re-evaluated. Many can be re-enabled with the harness; the rest are tagged `//go:build hardware` and excluded from the default run. - -## Migration & compatibility - -- **PR2** is the first user-observable behavior change: speed/accel actually take effect. Users who configured these expecting them to do nothing (unlikely but possible) will see slower motion. Document in release notes. -- **PR4** is the second: a second `MoveToJointPositions` request cancels the first instead of queueing. Document in release notes. Likely to surface in the motion service's corrective-move path; this is the *correct* behavior per RDK convention. -- All other PRs are refactors or surface fixes with no user-observable contract changes. - -## Open questions - -- Whether to extract `internal/` packages comes after PR8 in the architecture revisit, not in this spec. -- `Reconfigure` support (today everything is `AlwaysRebuild`) is not in scope; revisit when the architecture decision is made. - -## Architecture revisit (post-PR8) - -After PR8, evaluate: - -1. Are the singleton's remaining boundary issues (cross-file access from calibration sensor into `cs.controller.bus`, the controller still being a separate type from the arm) painful enough to justify the arm-as-core flip? -2. Would the gripper-as-DoCommand-client model actually solve those issues, or just relocate them? -3. What's the cost of the migration vs. the cost of living with the cleaned-up singleton? - -Decision and ADR live in a separate spec written at that time.