gently-project · pskeshu · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026
diff --git a/.gitignore b/.gitignore
@@ -145,3 +145,7 @@ electron/
 /stage_definitions_for_review.txt
 gently/ui/tui/node_modules/
 gently/ui/tui/dist/
+
+# Stray local storage: on Linux the Windows default GENTLY_STORAGE_PATH
+# (D:\Gently3) is created literally as ./D:/ under the repo. Not data we track.
+/D:/
diff --git a/docs/HEURISTICS-AUDIT.md b/docs/HEURISTICS-AUDIT.md
@@ -0,0 +1,112 @@
+# Heuristics audit — where to use the model (as a typed-output function) instead
+
+Codebase sweep (5 parallel scanners + synthesis) for heuristics that **fake
+judgment** an LLM would do better — in the spirit of the genotype→channel
+refactor (drop the lookup table, let the model infer, keep a typed provenance
+record + confirm-when-unsure). The flip side — logic that **must stay
+deterministic** (safety, math, calibration, transport) — is listed at the end so
+we don't mistakenly LLM-ify it.
+
+The unifying move for every candidate: **LLM with a typed structured-output
+schema + provenance + a confirm/UNCERTAIN escape**, never free-text-then-parse.
+
+## Model candidates (ranked)
+
+### High value
+
+1. **Hatching / time-to-stage prediction** — `organisms/celegans/developmental_tracker.py`
+   *(the closest twin of genotype→channel; medium effort)*
+   Three hardcoded 20 °C lookup tables (`STAGE_TIMING_20C`, `TIME_TO_HATCHING`,
+   `TIMING_VARIABILITY`) plus magic `{HIGH:1.0, MEDIUM:1.5, LOW:2.0}` uncertainty
+   fudge factors. Structurally **can't use the rig's actual temperature** (we run
+   a TEC), the strain, or the embryo's observed progression rate. Let the model
+   produce a calibrated, explained interval; **keep the literature table as a
+   deterministic sanity bracket** and flag when the estimate falls outside it.
+   → `{ predicted_minutes_to_hatching, low, high, basis, assumptions{temperature_c,strain,used_observed_rate}, confidence, reasoning }`
+
+2. **Citation → PubMed query** — `harness/plan_mode/tools/research.py` (`_search_pmid`)
+   A regex that only handles "Surname et al YEAR …" + six hand-rolled query-
+   relaxation strategies + a stopword/word-position ladder that drops load-bearing
+   nouns. The model parses the sloppy citation and proposes relaxed queries; **code
+   keeps the deterministic esearch call and never fabricates a PMID.**
+   → `{ author_last, year, journal, topic_keywords[], organism, pubmed_query, alt_queries[], confidence }`
+
+3. **Lab-history retrieval** — `harness/plan_mode/tools/lab_context.py`, `harness/memory/interface.py`
+   Semantic recall faked by substring-OR over query tokens (matches "we"/"before",
+   misses every paraphrase). Feed the model the candidate records and have it
+   **rank/select from provided ids only** (no fabrication). Read-only, no
+   acquisition risk.
+   → `{ matches:[{kind,id,summary,relevance,why_relevant}], answer }`
+
+4. **Stage-label parse via 22-entry synonym dict** — `developmental_tracker.py` (`_parse_stage_name`)
+   *(small effort, pure robustness win)* The Vision call already classifies; the
+   brittleness is a plain-text `STAGE:/CONFIDENCE:` block scraped line-by-line, with
+   off-vocabulary phrasings silently collapsing to `UNKNOWN` (which kills the
+   downstream hatching prediction). Constrained-enum structured output deletes the
+   parser + synonym table.
+   → `{ stage: enum(...), confidence: enum(high|medium|low), is_transitional, reasoning }`
+
+### Medium value (mostly small — fix the output contract, not the judgment)
+
+5. **Calibration Vision calls** — `hardware/dispim/claude_client.py`
+   Four Vision calls return positional free text recovered by `'yes' in first_line`
+   / `re.search(r'\d+')` / first-valid-letter, with silent defaults (so "no, this is
+   not yes…" reads as *yes*). Typed output deletes the parse + silent-default layer.
+
+6. **ML architecture ranking** — `ml/architectures.py` (`get_suitable_architectures`)
+   Hard feasibility gates (VRAM / dataset) are correct **and stay**; the `+2/+1/+1`
+   point-score ranking that follows discards the per-arch prose. Let the model rank
+   the *pre-filtered feasible set* (ids constrained to that set).
+
+7. **Training label normalization** — `ml/data_loader.py` (`build_labels_from_store`)
+   Class space built by exact-string identity over free-text human annotations —
+   "1.5-fold" and "1.5 fold" become different classes. Model normalizes to the
+   canonical staging vocabulary, flags novel/ambiguous ones.
+
+### Lower value
+
+8. **"Plan has a control?"** — `plan_mode/tools/validation.py` — substring scan of a
+   6-word keyword set; a scientific judgment over the whole plan. Non-blocking
+   warning → safe for the model.
+9. **CGC HTML scraping** — `research.py` (`_cgc_search`) — positional multi-group
+   regex over fetched HTML; structured extraction the model does better (HTTP GET
+   stays code; **mark strain names low-confidence to avoid sending someone to order
+   a hallucinated strain**).
+
+### Cross-cutting batch (small each): typed output for the detector/verifier cluster
+`harness/detection/verifier.py`, `app/detectors/hatching.py`,
+`app/detectors/dopaminergic_signal.py`, `hardware/dispim/sam_detection.py` — all
+already make the right model call but reconstruct the verdict via
+`startswith`/regex-JSON-scraping with silent defaults. A batch move to native
+structured output **strictly reduces parse-induced false negatives** without
+touching the deterministic vote-tally/consensus/enum-dispatch downstream.
+
+**Reference implementations already in the repo (imitate, don't change):**
+`dopaminergic_signal`'s perceiver→classifier rubric (typed enums, UNCERTAIN
+escape, conservative-on-tie) and onboarding's `_extract_with_llm` (typed
+extraction, degrade-to-verbatim fallback).
+
+## Keep deterministic (do NOT LLM-ify)
+Safety, math, calibration, and transport — where a hallucinated value is unsafe
+or breaks reproducibility:
+- Laser-power safety limits + wavelength→MM-property map (`hardware/dispim/devices/optical.py`)
+- SPIM trigger-timing arithmetic, piezo–galvo calibration, MM framing (`dispim/config.py`)
+- Calibration prior EMA + R²≥0.75 slope-lock gate (`dispim/calibration.py`)
+- SwitchBot GATT byte commands / status decoding (`hardware/switchbot.py`)
+- Temperature setpoint bound [0,99.9] °C + stabilization I/O (`hardware/temperature.py`)
+- Autofocus signal-processing, curve fitting, adaptive-sweep stop rules (`analysis/core.py`, `analysis/focus.py`)
+- Classical-CV ROI detection + pixel→stage coordinate transforms (`detection.py`, `sam_detection.py` geometry)
+- Timelapse rule dispatch + `confirm_timepoints` debounce + monotonic power ramp (`app/orchestration/timelapse.py`)
+- Volume→b64 dark/flat calibration + fixed brightness scaling (`dopaminergic_signal._volume_to_b64` — deliberately non-adaptive)
+- Wake-router debounce/throttle/stage-transition gate (`app/wake_router.py`)
+- Plan hardware limits, detector-preset membership, dependency-cycle DFS, stage-order normalization (`plan_mode/tools/validation.py`)
+- Ensemble vote tally + 0.70 quorum / unanimity consensus (`detection/verifier.py`)
+- ML metric/aggregation math: confusion matrix, F1, federated averaging (`ml/evaluation.py`, `federated.py`)
+- Core imaging geometry (max-projection, crop bounds, Euler rotations) + UI event reduction/routing/security (`core/imaging.py`, `ui/web/*`)
+- Device-state SSE watchdog/staleness timers (`app/device_state_monitor.py`)
+- Reference-type dispatch (PMID/DOI/URL by canonical syntax), `os.path.isfile` checks (`research.py`)
+
+## Note
+`gap_assessment.conversation_weight` (the 0.25/0.1/0.05 readiness scalar) is now
+largely **vestigial** — it only returns 'heavy' (lab onboarding) or 'none' — so
+it's not worth an API call. Left off the candidate list.
diff --git a/docs/ux-v2-flow-audit.md b/docs/ux-v2-flow-audit.md
@@ -0,0 +1,106 @@
+# UX v2 — interaction-flow / IA audit
+
+**Branch:** `feature/ux-v2` (now includes the 3D optical-space view).
+**Scope:** the *flow* of the agent-first UI — clicks, how each step renders, how the
+workspace is unveiled, moving back/forth between views, resume — **not** the visual look
+(the look is fine). Plus where the 3D optical-space view belongs in the new workspace IA.
+
+**Method:** live click-audit driven through a real browser as a *dev biologist* would use
+it, with the agent **live** (Opus 4.8, `--offline` hardware, `GENTLY_NO_AUTH=1` single
+controller), cross-checked against the code. Screenshots from the run are in `screenshots/audit-*.png`.
+
+> Correction to an earlier automated pass: the plan-wizard helpers
+> (`buildAskCard`/`answerChoice`/`togglePanel`) are **not** missing — `agent-chat.js`
+> exports them and the module loads; the plan wizard works. The real issues are below.
+
+---
+
+## What works (keep it)
+
+- **The forward path is good.** Entry → one calm choice (Plan / Quick look / "just tell me")
+  → overlay dismisses to reveal the workspace → grouped rail (NOW / LIBRARY / SYSTEM) drives
+  everything through one chokepoint (`app.js switchTab`). The welcome→workspace unveil is genuinely nice.
+- **The agent-driven plan wizard is strong.** Live, it asked a well-framed scientific question
+  ("What's the core scientific question this run should capture?") with real C. elegans options,
+  ran a `query_lab_history` tool with visible provenance, and **assembled THE PLAN panel as each
+  answer landed** (strain → wavelengths, etc.). The "plan builds as you answer" feel is excellent.
+- **The dual-render** (ask shows in the plan stage *and* the chat transcript) is implemented.
+
+---
+
+## Findings (prioritized)
+
+| # | Pri | Symptom (felt) | Root cause / evidence | Fix |
+|---|-----|----------------|------------------------|-----|
+| 1 | **P0** | First plan step sat on "working through the next step…" for **~90s** with a static spinner — feels hung. | The wait is the model *thinking*. The streaming call requests **no thinking config** and the stream loop reads only text deltas. `conversation.py:272-275` (only `output_config.effort`), `conversation.py:654-657` (only `event.delta.text`). | Set `thinking={"type":"adaptive","display":"summarized"}` on the stream (`conversation.py:552`); handle `thinking_delta` in the loop (`:654`) and emit as a `thinking` activity; render it live + add an elapsed timer. See §1. |
+| 2 | **P1** | Agent's first line renders as **"'d love to help…"** — leading "I" dropped. | Plan-feed streaming path drops the first character of the turn's first text block; the chat transcript renders it correctly (`12_41` vs `12_3` in the run). Plan feed: `landing.js applyActivity` `'text'` case (`:269`). | Most likely the first `AGENT_ACTIVITY`/`text` delta is missed by `landing.js`'s listener (subscribed after the first delta) or coalesced wrong. Confirm with a 1-line repro; the transcript path is the reference. |
+| 3 | **P1** | Clicked the primary "Plan an experiment" → plan stage spun forever; the *real* blocker ("Viewing only — control is held by another client / sign in to control") was **hidden in the chat panel**. | Control/auth state isn't surfaced on the landing/plan surface — only in the chat dock. A viewer can enter the plan flow and dead-end. | Surface control/sign-in state on the landing **before** the primary CTA; gate or relabel "Plan an experiment" when `!hasControl`; show the wall on the plan stage, not just chat. |
+| 4 | **P1** | (Structural) The same ask renders in **two** stage mounts plus the transcript. | `#v2-plan-ask` **and** `#ask-stage` both render the ask (the overlay covers the workspace copy, so only cosmetic/perf today). Two live regions seen in the run (`12_10` + `12_24`). | One stage mount at a time — suppress `#ask-stage` while the landing overlay owns the ask. |
+| 5 | **P1** | Cross-surface clear can desync. | `ASK_CLEARED` is **listened for but emitted nowhere** (`landing.js:624`, `ask-stage.js:43` listen; no emit in repo). Answering works locally because `renderAsk.onPick` clears directly, but stage↔transcript sync relies on the missing signal. | Emit `ASK_CLEARED` the instant a `choice_response` is sent (per the migration plan's Phase-1 blocker), plus on cancel/control-loss/socket-close. |
+| 6 | **P1** | **No way back.** Once the landing dismisses, there's no path back to welcome / "start a new plan" from the workspace — must reload. | `dismiss()` is one-way (`landing.js:42-54`); `V2Landing.show()` exists but is never called from the workspace. | Add a "New plan" / "Talk to Gently" entry in the rail or header that re-summons the welcome/plan surface. |
+| 7 | **P2** | Browser **Back / refresh don't mean anything**; refresh mid-plan loses state and may re-show the landing. | Entry hash is consumed (`app.js` → `replaceState('/')`, ~`:650-662`); no deliberate URL/state sync; in-memory plan state (`planKickedOff`, feed pages) resets on reload. | Real routing: sync screen/tab to URL/History so Back/forward/refresh resolve; persist or re-hydrate plan progress. |
+| 8 | **P2** | **Resume = full page reload** — jarring, re-shows landing, drops chat position. | `session_changed` → `window.location.href='/'` (`websocket.js:147`; `review.js resumeSession ~:101-116`). Flagged in the migration plan. | In-place re-hydration on `session_changed` instead of a hard reload. |
+| 9 | **P1 (IA)** | The **3D optical-space view is buried**: SYSTEM → Devices → (Map / Details / **3D**) — a sub-sub-toggle. | It was integrated into the *legacy* Devices tab structure; the ux-v2 grouped rail doesn't surface it. | Promote "the scope in space" to a first-class run-time surface (NOW tier), reconciled with the grouped rail. See §2. |
+| 10 | **P2** | Offline / agent-silent dead-ends the wizard at "working…". | `startPlan` campaign fetch falls through silently if offline (`landing.js ~:502-508`); no error path. | Timeout + inline error/retry on the plan stage. |
+
+---
+
+## §1 — Make the loading state legible (P0, the one the user wants first)
+
+The 90s "working…" is the agent reasoning. The Claude streaming API exposes this on three
+channels; gently currently surfaces none of the reasoning:
+
+- **Thinking** — `content_block_delta` → `thinking_delta`. **Opus 4.8 defaults to
+  `display:"omitted"` (empty thinking text)**, and gently doesn't set the thinking config at
+  all on the stream, so there's nothing to show. Unlock: `thinking={"type":"adaptive","display":"summarized"}`.
+- **Tool activity** — `input_json_delta` + tool start/stop. **Already flowing** — the plan feed
+  renders tool cards (saw the `query_lab_history` card with input/result).
+- **Text** — `text_delta`. Already flowing (this is the path with the bug #2 truncation).
+
+**Backend (`gently/harness/conversation.py`):**
+1. `:552` `self.claude.messages.stream(...)` — add `thinking={"type":"adaptive","display":"summarized"}`
+   (keep `output_config.effort`).
+2. `:654` event loop — currently only `if hasattr(event.delta, "text")`. Add a branch for
+   `event.delta.type == "thinking_delta"` → `yield {"type":"thinking","text": event.delta.thinking}`.
+
+**Frontend (`gently/ui/web/static/js/landing.js`):** `applyActivity` already has a `thinking`
+case (`:266`) that only sets a static label — render the streamed thinking text instead, and add
+an elapsed timer to `#v2-plan-thinking` so a long think reads as progress, not a hang.
+
+Net: the reasoning summary + current tool + a timer fill the wait. Only the backend `display`
+flag is a new capability; the rest is surfacing data gently already receives.
+
+---
+
+## §2 — Workspace organization & where the 3D view belongs (P1, IA)
+
+The ux-v2 workspace is organized differently from the old flat tab bar: a **grouped rail**
+(NOW: Home/Experiment/Embryos · LIBRARY: Plans/Sessions · SYSTEM: Devices/Calibration/Logs),
+a **session-context strip**, and the **AGENT'S VIEW** surface. The 3D optical-space view,
+however, lives in the *legacy* Devices structure (`devices.js switchView`, VIEWS =
+`['map','details','optical3d']`; `index.html` devices-content Map/Details/3D switcher).
+
+During an actual run, "where the scope is in space" + the live experiment + the agent's view are
+**NOW-tier** concerns, not a System utility three clicks deep. Proposal (to design next):
+- Promote the 3D optical-space + live experiment to a first-class run-time surface in the rail
+  (or make it the default workspace view while a run is active).
+- Keep the Devices Map/Details as the System-tier hardware utility; the 3D "scope in space"
+  graduates out of that toggle.
+
+---
+
+## Recommended sequencing
+
+1. **P0 loading state** (§1) — highest felt value, mostly surfacing existing data.
+2. **P1 quick correctness**: #2 truncation, #3 control-wall surfacing, #4 single ask mount, #5 `ASK_CLEARED` emit.
+3. **P1 reachability**: #6 "new plan"/back entry; then #9 the workspace-IA / 3D-placement redesign (its own design pass).
+4. **P2 navigation**: #7 real routing, #8 resume re-hydration, #10 offline error path.
+
+---
+
+## Notes / housekeeping
+
+- Findings 1–5, 10 verified live with the agent on; 6–9 verified from code + the live rail.
+- `screenshots/audit-*.png` (live run) and `screenshots/uxv2-*.png` are local evidence (untracked).
+- The earlier visual-design exploration (`docs/superpowers/mockups/`, `screenshots/dir-*.png`) is
+  superseded — the look is staying as-is — and can be deleted.