Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
4530ab1
UX v2 Phase 0: unify connection status across the three indicators
pskeshu Jun 11, 2026
4571a58
UX v2 Phase 1 (scaffold): add GENTLY_UX_V2 feature flag
pskeshu Jun 11, 2026
17e66cb
UX v2 Phase 1: dual-render agent asks (chat transcript + main stage)
pskeshu Jun 11, 2026
52503e7
UX v2 Phase 2: grouped rail nav + session-context strip (behind flag)
pskeshu Jun 11, 2026
288f97b
UX v2 Phase 3a: inference-first plan mode (model-driven, with provena…
pskeshu Jun 11, 2026
e947506
UX v2 Phase 3b: surface inferred imaging spec + per-field provenance …
pskeshu Jun 11, 2026
2387225
UX v2 Phase 4: co-editable shared-visibility surface (the agent's min…
pskeshu Jun 11, 2026
e8ce2bf
UX v2 Phase 4 follow-up: show an empty-state for the agent's-view panel
pskeshu Jun 12, 2026
c98a33f
Put the free-the-port command in the "port in use" error
pskeshu Jun 12, 2026
927900a
UX v2 Phase 5: Experiment view shows real tactics only — drop stubbed…
pskeshu Jun 12, 2026
f69bc4e
UX v2 Phase 6 (step 1): flip GENTLY_UX_V2 default ON, keep v1 fallback
pskeshu Jun 12, 2026
e4de3a1
UX v2: agent-first landing → in-page plan wizard (flag-gated)
pskeshu Jun 12, 2026
37bc534
Add UX v2 entry-paradigm prototype + migration plan (design reference)
pskeshu Jun 12, 2026
6c6bc70
Detectors: forced-tool structured output for hatching + verifier
pskeshu Jun 12, 2026
4910d06
Fix recurring port-8080 false positive: SO_REUSEADDR on the viz prefl…
pskeshu Jun 12, 2026
5ee9e32
chore: gitignore the stray D:/ storage dir
pskeshu Jun 12, 2026
a8fd3e5
UX v2 landing: fix welcome→plan jump, dark mode, and the entry flow
pskeshu Jun 12, 2026
1cefe2f
UX v2: agent-activity events + GFM markdown rendering
pskeshu Jun 12, 2026
64f9338
Models: migrate to Fable 5 / Opus 4.8 / Sonnet 4.6 with refusal+400 f…
pskeshu Jun 12, 2026
0457975
Add UX v2 landing screenshot for the PR
pskeshu Jun 12, 2026
b526b23
lint: conform UX v2 + model-migration code to #47 ruff tooling
pskeshu Jun 14, 2026
ec64daf
Models: revert main tier from Fable 5 to Opus 4.8
pskeshu Jun 14, 2026
d5dac56
UX v2: expandable tool cards show full (bounded) tool results
pskeshu Jun 15, 2026
a26ee99
WIP: 3D optical-space view in the Devices tab
pskeshu Jun 15, 2026
91ec6d2
docs: UX v2 interaction-flow / IA audit
pskeshu Jun 16, 2026
9edb766
UX v2 P0: live-stream the agent turn + show reasoning during the wait
pskeshu Jun 16, 2026
f209874
Fix create/update_plan_item crash when spec/references passed as JSON…
pskeshu Jun 16, 2026
144d8dd
UX v2: add a concision/communication-style section to the plan-mode p…
pskeshu Jun 16, 2026
d061e9d
UX v2: run independent read-only tool calls concurrently
pskeshu Jun 16, 2026
a779901
Fix create_plan_item crash when phase_number is a string
pskeshu Jun 16, 2026
06741eb
Restrict tool concurrency to read-only tools; nudge batched plan-item…
pskeshu Jun 16, 2026
faee181
UX v2: plan-done state + restructured THE PLAN panel
pskeshu Jun 16, 2026
f6f2ba3
UX v2: drop wrap-up reasoning litter from the plan feed
pskeshu Jun 16, 2026
b8df610
UX v2: export-plan button replaces the end-of-plan prose upsell
pskeshu Jun 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -145,3 +145,7 @@ electron/
/stage_definitions_for_review.txt
gently/ui/tui/node_modules/
gently/ui/tui/dist/

# Stray local storage: on Linux the Windows default GENTLY_STORAGE_PATH
# (D:\Gently3) is created literally as ./D:/ under the repo. Not data we track.
/D:/
112 changes: 112 additions & 0 deletions docs/HEURISTICS-AUDIT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Heuristics audit — where to use the model (as a typed-output function) instead

Codebase sweep (5 parallel scanners + synthesis) for heuristics that **fake
judgment** an LLM would do better — in the spirit of the genotype→channel
refactor (drop the lookup table, let the model infer, keep a typed provenance
record + confirm-when-unsure). The flip side — logic that **must stay
deterministic** (safety, math, calibration, transport) — is listed at the end so
we don't mistakenly LLM-ify it.

The unifying move for every candidate: **LLM with a typed structured-output
schema + provenance + a confirm/UNCERTAIN escape**, never free-text-then-parse.

## Model candidates (ranked)

### High value

1. **Hatching / time-to-stage prediction** — `organisms/celegans/developmental_tracker.py`
*(the closest twin of genotype→channel; medium effort)*
Three hardcoded 20 °C lookup tables (`STAGE_TIMING_20C`, `TIME_TO_HATCHING`,
`TIMING_VARIABILITY`) plus magic `{HIGH:1.0, MEDIUM:1.5, LOW:2.0}` uncertainty
fudge factors. Structurally **can't use the rig's actual temperature** (we run
a TEC), the strain, or the embryo's observed progression rate. Let the model
produce a calibrated, explained interval; **keep the literature table as a
deterministic sanity bracket** and flag when the estimate falls outside it.
→ `{ predicted_minutes_to_hatching, low, high, basis, assumptions{temperature_c,strain,used_observed_rate}, confidence, reasoning }`

2. **Citation → PubMed query** — `harness/plan_mode/tools/research.py` (`_search_pmid`)
A regex that only handles "Surname et al YEAR …" + six hand-rolled query-
relaxation strategies + a stopword/word-position ladder that drops load-bearing
nouns. The model parses the sloppy citation and proposes relaxed queries; **code
keeps the deterministic esearch call and never fabricates a PMID.**
→ `{ author_last, year, journal, topic_keywords[], organism, pubmed_query, alt_queries[], confidence }`

3. **Lab-history retrieval** — `harness/plan_mode/tools/lab_context.py`, `harness/memory/interface.py`
Semantic recall faked by substring-OR over query tokens (matches "we"/"before",
misses every paraphrase). Feed the model the candidate records and have it
**rank/select from provided ids only** (no fabrication). Read-only, no
acquisition risk.
→ `{ matches:[{kind,id,summary,relevance,why_relevant}], answer }`

4. **Stage-label parse via 22-entry synonym dict** — `developmental_tracker.py` (`_parse_stage_name`)
*(small effort, pure robustness win)* The Vision call already classifies; the
brittleness is a plain-text `STAGE:/CONFIDENCE:` block scraped line-by-line, with
off-vocabulary phrasings silently collapsing to `UNKNOWN` (which kills the
downstream hatching prediction). Constrained-enum structured output deletes the
parser + synonym table.
→ `{ stage: enum(...), confidence: enum(high|medium|low), is_transitional, reasoning }`

### Medium value (mostly small — fix the output contract, not the judgment)

5. **Calibration Vision calls** — `hardware/dispim/claude_client.py`
Four Vision calls return positional free text recovered by `'yes' in first_line`
/ `re.search(r'\d+')` / first-valid-letter, with silent defaults (so "no, this is
not yes…" reads as *yes*). Typed output deletes the parse + silent-default layer.

6. **ML architecture ranking** — `ml/architectures.py` (`get_suitable_architectures`)
Hard feasibility gates (VRAM / dataset) are correct **and stay**; the `+2/+1/+1`
point-score ranking that follows discards the per-arch prose. Let the model rank
the *pre-filtered feasible set* (ids constrained to that set).

7. **Training label normalization** — `ml/data_loader.py` (`build_labels_from_store`)
Class space built by exact-string identity over free-text human annotations —
"1.5-fold" and "1.5 fold" become different classes. Model normalizes to the
canonical staging vocabulary, flags novel/ambiguous ones.

### Lower value

8. **"Plan has a control?"** — `plan_mode/tools/validation.py` — substring scan of a
6-word keyword set; a scientific judgment over the whole plan. Non-blocking
warning → safe for the model.
9. **CGC HTML scraping** — `research.py` (`_cgc_search`) — positional multi-group
regex over fetched HTML; structured extraction the model does better (HTTP GET
stays code; **mark strain names low-confidence to avoid sending someone to order
a hallucinated strain**).

### Cross-cutting batch (small each): typed output for the detector/verifier cluster
`harness/detection/verifier.py`, `app/detectors/hatching.py`,
`app/detectors/dopaminergic_signal.py`, `hardware/dispim/sam_detection.py` — all
already make the right model call but reconstruct the verdict via
`startswith`/regex-JSON-scraping with silent defaults. A batch move to native
structured output **strictly reduces parse-induced false negatives** without
touching the deterministic vote-tally/consensus/enum-dispatch downstream.

**Reference implementations already in the repo (imitate, don't change):**
`dopaminergic_signal`'s perceiver→classifier rubric (typed enums, UNCERTAIN
escape, conservative-on-tie) and onboarding's `_extract_with_llm` (typed
extraction, degrade-to-verbatim fallback).

## Keep deterministic (do NOT LLM-ify)
Safety, math, calibration, and transport — where a hallucinated value is unsafe
or breaks reproducibility:
- Laser-power safety limits + wavelength→MM-property map (`hardware/dispim/devices/optical.py`)
- SPIM trigger-timing arithmetic, piezo–galvo calibration, MM framing (`dispim/config.py`)
- Calibration prior EMA + R²≥0.75 slope-lock gate (`dispim/calibration.py`)
- SwitchBot GATT byte commands / status decoding (`hardware/switchbot.py`)
- Temperature setpoint bound [0,99.9] °C + stabilization I/O (`hardware/temperature.py`)
- Autofocus signal-processing, curve fitting, adaptive-sweep stop rules (`analysis/core.py`, `analysis/focus.py`)
- Classical-CV ROI detection + pixel→stage coordinate transforms (`detection.py`, `sam_detection.py` geometry)
- Timelapse rule dispatch + `confirm_timepoints` debounce + monotonic power ramp (`app/orchestration/timelapse.py`)
- Volume→b64 dark/flat calibration + fixed brightness scaling (`dopaminergic_signal._volume_to_b64` — deliberately non-adaptive)
- Wake-router debounce/throttle/stage-transition gate (`app/wake_router.py`)
- Plan hardware limits, detector-preset membership, dependency-cycle DFS, stage-order normalization (`plan_mode/tools/validation.py`)
- Ensemble vote tally + 0.70 quorum / unanimity consensus (`detection/verifier.py`)
- ML metric/aggregation math: confusion matrix, F1, federated averaging (`ml/evaluation.py`, `federated.py`)
- Core imaging geometry (max-projection, crop bounds, Euler rotations) + UI event reduction/routing/security (`core/imaging.py`, `ui/web/*`)
- Device-state SSE watchdog/staleness timers (`app/device_state_monitor.py`)
- Reference-type dispatch (PMID/DOI/URL by canonical syntax), `os.path.isfile` checks (`research.py`)

## Note
`gap_assessment.conversation_weight` (the 0.25/0.1/0.05 readiness scalar) is now
largely **vestigial** — it only returns 'heavy' (lab onboarding) or 'none' — so
it's not worth an API call. Left off the candidate list.
106 changes: 106 additions & 0 deletions docs/ux-v2-flow-audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# UX v2 — interaction-flow / IA audit

**Branch:** `feature/ux-v2` (now includes the 3D optical-space view).
**Scope:** the *flow* of the agent-first UI — clicks, how each step renders, how the
workspace is unveiled, moving back/forth between views, resume — **not** the visual look
(the look is fine). Plus where the 3D optical-space view belongs in the new workspace IA.

**Method:** live click-audit driven through a real browser as a *dev biologist* would use
it, with the agent **live** (Opus 4.8, `--offline` hardware, `GENTLY_NO_AUTH=1` single
controller), cross-checked against the code. Screenshots from the run are in `screenshots/audit-*.png`.

> Correction to an earlier automated pass: the plan-wizard helpers
> (`buildAskCard`/`answerChoice`/`togglePanel`) are **not** missing — `agent-chat.js`
> exports them and the module loads; the plan wizard works. The real issues are below.

---

## What works (keep it)

- **The forward path is good.** Entry → one calm choice (Plan / Quick look / "just tell me")
→ overlay dismisses to reveal the workspace → grouped rail (NOW / LIBRARY / SYSTEM) drives
everything through one chokepoint (`app.js switchTab`). The welcome→workspace unveil is genuinely nice.
- **The agent-driven plan wizard is strong.** Live, it asked a well-framed scientific question
("What's the core scientific question this run should capture?") with real C. elegans options,
ran a `query_lab_history` tool with visible provenance, and **assembled THE PLAN panel as each
answer landed** (strain → wavelengths, etc.). The "plan builds as you answer" feel is excellent.
- **The dual-render** (ask shows in the plan stage *and* the chat transcript) is implemented.

---

## Findings (prioritized)

| # | Pri | Symptom (felt) | Root cause / evidence | Fix |
|---|-----|----------------|------------------------|-----|
| 1 | **P0** | First plan step sat on "working through the next step…" for **~90s** with a static spinner — feels hung. | The wait is the model *thinking*. The streaming call requests **no thinking config** and the stream loop reads only text deltas. `conversation.py:272-275` (only `output_config.effort`), `conversation.py:654-657` (only `event.delta.text`). | Set `thinking={"type":"adaptive","display":"summarized"}` on the stream (`conversation.py:552`); handle `thinking_delta` in the loop (`:654`) and emit as a `thinking` activity; render it live + add an elapsed timer. See §1. |
| 2 | **P1** | Agent's first line renders as **"'d love to help…"** — leading "I" dropped. | Plan-feed streaming path drops the first character of the turn's first text block; the chat transcript renders it correctly (`12_41` vs `12_3` in the run). Plan feed: `landing.js applyActivity` `'text'` case (`:269`). | Most likely the first `AGENT_ACTIVITY`/`text` delta is missed by `landing.js`'s listener (subscribed after the first delta) or coalesced wrong. Confirm with a 1-line repro; the transcript path is the reference. |
| 3 | **P1** | Clicked the primary "Plan an experiment" → plan stage spun forever; the *real* blocker ("Viewing only — control is held by another client / sign in to control") was **hidden in the chat panel**. | Control/auth state isn't surfaced on the landing/plan surface — only in the chat dock. A viewer can enter the plan flow and dead-end. | Surface control/sign-in state on the landing **before** the primary CTA; gate or relabel "Plan an experiment" when `!hasControl`; show the wall on the plan stage, not just chat. |
| 4 | **P1** | (Structural) The same ask renders in **two** stage mounts plus the transcript. | `#v2-plan-ask` **and** `#ask-stage` both render the ask (the overlay covers the workspace copy, so only cosmetic/perf today). Two live regions seen in the run (`12_10` + `12_24`). | One stage mount at a time — suppress `#ask-stage` while the landing overlay owns the ask. |
| 5 | **P1** | Cross-surface clear can desync. | `ASK_CLEARED` is **listened for but emitted nowhere** (`landing.js:624`, `ask-stage.js:43` listen; no emit in repo). Answering works locally because `renderAsk.onPick` clears directly, but stage↔transcript sync relies on the missing signal. | Emit `ASK_CLEARED` the instant a `choice_response` is sent (per the migration plan's Phase-1 blocker), plus on cancel/control-loss/socket-close. |
| 6 | **P1** | **No way back.** Once the landing dismisses, there's no path back to welcome / "start a new plan" from the workspace — must reload. | `dismiss()` is one-way (`landing.js:42-54`); `V2Landing.show()` exists but is never called from the workspace. | Add a "New plan" / "Talk to Gently" entry in the rail or header that re-summons the welcome/plan surface. |
| 7 | **P2** | Browser **Back / refresh don't mean anything**; refresh mid-plan loses state and may re-show the landing. | Entry hash is consumed (`app.js` → `replaceState('/')`, ~`:650-662`); no deliberate URL/state sync; in-memory plan state (`planKickedOff`, feed pages) resets on reload. | Real routing: sync screen/tab to URL/History so Back/forward/refresh resolve; persist or re-hydrate plan progress. |
| 8 | **P2** | **Resume = full page reload** — jarring, re-shows landing, drops chat position. | `session_changed` → `window.location.href='/'` (`websocket.js:147`; `review.js resumeSession ~:101-116`). Flagged in the migration plan. | In-place re-hydration on `session_changed` instead of a hard reload. |
| 9 | **P1 (IA)** | The **3D optical-space view is buried**: SYSTEM → Devices → (Map / Details / **3D**) — a sub-sub-toggle. | It was integrated into the *legacy* Devices tab structure; the ux-v2 grouped rail doesn't surface it. | Promote "the scope in space" to a first-class run-time surface (NOW tier), reconciled with the grouped rail. See §2. |
| 10 | **P2** | Offline / agent-silent dead-ends the wizard at "working…". | `startPlan` campaign fetch falls through silently if offline (`landing.js ~:502-508`); no error path. | Timeout + inline error/retry on the plan stage. |

---

## §1 — Make the loading state legible (P0, the one the user wants first)

The 90s "working…" is the agent reasoning. The Claude streaming API exposes this on three
channels; gently currently surfaces none of the reasoning:

- **Thinking** — `content_block_delta` → `thinking_delta`. **Opus 4.8 defaults to
`display:"omitted"` (empty thinking text)**, and gently doesn't set the thinking config at
all on the stream, so there's nothing to show. Unlock: `thinking={"type":"adaptive","display":"summarized"}`.
- **Tool activity** — `input_json_delta` + tool start/stop. **Already flowing** — the plan feed
renders tool cards (saw the `query_lab_history` card with input/result).
- **Text** — `text_delta`. Already flowing (this is the path with the bug #2 truncation).

**Backend (`gently/harness/conversation.py`):**
1. `:552` `self.claude.messages.stream(...)` — add `thinking={"type":"adaptive","display":"summarized"}`
(keep `output_config.effort`).
2. `:654` event loop — currently only `if hasattr(event.delta, "text")`. Add a branch for
`event.delta.type == "thinking_delta"` → `yield {"type":"thinking","text": event.delta.thinking}`.

**Frontend (`gently/ui/web/static/js/landing.js`):** `applyActivity` already has a `thinking`
case (`:266`) that only sets a static label — render the streamed thinking text instead, and add
an elapsed timer to `#v2-plan-thinking` so a long think reads as progress, not a hang.

Net: the reasoning summary + current tool + a timer fill the wait. Only the backend `display`
flag is a new capability; the rest is surfacing data gently already receives.

---

## §2 — Workspace organization & where the 3D view belongs (P1, IA)

The ux-v2 workspace is organized differently from the old flat tab bar: a **grouped rail**
(NOW: Home/Experiment/Embryos · LIBRARY: Plans/Sessions · SYSTEM: Devices/Calibration/Logs),
a **session-context strip**, and the **AGENT'S VIEW** surface. The 3D optical-space view,
however, lives in the *legacy* Devices structure (`devices.js switchView`, VIEWS =
`['map','details','optical3d']`; `index.html` devices-content Map/Details/3D switcher).

During an actual run, "where the scope is in space" + the live experiment + the agent's view are
**NOW-tier** concerns, not a System utility three clicks deep. Proposal (to design next):
- Promote the 3D optical-space + live experiment to a first-class run-time surface in the rail
(or make it the default workspace view while a run is active).
- Keep the Devices Map/Details as the System-tier hardware utility; the 3D "scope in space"
graduates out of that toggle.

---

## Recommended sequencing

1. **P0 loading state** (§1) — highest felt value, mostly surfacing existing data.
2. **P1 quick correctness**: #2 truncation, #3 control-wall surfacing, #4 single ask mount, #5 `ASK_CLEARED` emit.
3. **P1 reachability**: #6 "new plan"/back entry; then #9 the workspace-IA / 3D-placement redesign (its own design pass).
4. **P2 navigation**: #7 real routing, #8 resume re-hydration, #10 offline error path.

---

## Notes / housekeeping

- Findings 1–5, 10 verified live with the agent on; 6–9 verified from code + the live rail.
- `screenshots/audit-*.png` (live run) and `screenshots/uxv2-*.png` are local evidence (untracked).
- The earlier visual-design exploration (`docs/superpowers/mockups/`, `screenshots/dir-*.png`) is
superseded — the look is staying as-is — and can be deleted.
Loading