Skip to content

feat: voice dictation — Slack/Discord audio → text, supervised with the daemon (Parakeet-MLX / Canary-Qwen)#1

Open
kwliang1 wants to merge 56 commits into
mainfrom
feat/voice-dictation-canary-qwen
Open

feat: voice dictation — Slack/Discord audio → text, supervised with the daemon (Parakeet-MLX / Canary-Qwen)#1
kwliang1 wants to merge 56 commits into
mainfrom
feat/voice-dictation-canary-qwen

Conversation

@kwliang1

@kwliang1 kwliang1 commented Jun 28, 2026

Copy link
Copy Markdown
Owner

What

Adds voice dictation to Hydra: inbound audio attachments (Discord voice notes, Slack voice clips / audio uploads) are transcribed to text so you can dictate prompts to Claude alongside text and images. Claude has no native audio input, so the daemon transcribes first and merges the result as a [voice transcript] ... block; the original audio stays in downloaded_files.

Rebased on the latest upstream main (the hydra CLI, Slack Home tab, markdown chunking, etc.). The unrelated proj:/dir: spawn-prefix commit that was previously on this branch is dropped — that's its own PR (upstream sf8193#60).

Integrated into daemon start (not a separate service to remember)

  • hydra up starts the sidecar with the daemon (kicked off right after the daemon tmux spawn so the model loads while the byte comes up). hydra down stops it. Every hydra watchdog tick revives it. Legacy start-daemon.sh / watchdog.sh call the same path.
  • All supervisors go through start-transcribe.sh --auto, one shared gate: explicit HYDRA_TRANSCRIBE_AUTOSTART=1/0 wins in both directions; when unset it starts the sidecar iff a real backend is set up (venv built) and quietly no-ops otherwise — machines without dictation never log a failure per watchdog cycle. After the one-time ./transcribe-server/setup.sh there is zero extra config.
  • One shared tmux session (hydra-transcribe) serves every platform daemon — per-platform sessions would race for the same port.
  • A crashed server parks instead of exiting: the session stays alive holding the error, so a broken config (bad port, missing ffmpeg, failed model download) fails once — not a model-load crash-loop every 120s watchdog tick. Fix the cause, tmux kill-session -t hydra-transcribe to retry.
  • The mock backend is never auto-supervised (manual run or explicit AUTOSTART=1 only) — leftover test config can't keep canned transcripts flowing into real prompts. A remote HYDRA_TRANSCRIBE_URL disables local autostart.

Slack audio attachments

  • Slack voice clips arrive as message events with subtype file_share/slack_audio and a files[] entry (mimetype: audio/mp4, .m4a); the gateway only filters bot_message, so they flow through downloadAttachments (bearer-auth url_private_download) into the transcription hook like any attachment.
  • Detection covers Slack shapes: audio/mp4, audio/webm;codecs=opus (browser recordings), extension fallback for generic mimetypes only — a definitive non-audio MIME (video/mp4 screen recording) is never re-classified as voice. Both sidecar servers resample via ffmpeg, so m4a/ogg/webm all work.
  • New transcribeDownloads tests pin the contract with a stubbed fetch: only audio files are POSTed, sidecar failure skips (never blocks delivery), the size cap is enforced before the network, the disabled flag short-circuits.

Packaged default (not opt-in)

  • On by default on the daemon side. Whenever a transcription sidecar is reachable, voice notes are transcribed; when it isn't, audio passes through untouched (fetch fails fast and is skipped; a live-but-slow sidecar delays only the voice message itself, bounded by HYDRA_TRANSCRIBE_TIMEOUT_MS). HYDRA_TRANSCRIBE_ENABLED=0 opts out.
  • Self-hosted, audio stays local. macOS (Apple Silicon) → Parakeet-MLX (default); Linux+GPU → NVIDIA Canary-Qwen 2.5B via NeMo. Swappable behind a one-line HTTP contract (POST /transcribe{"text": ...}).

Hardening (3-lens adversarial review: engineering / ops blast-radius / security)

  • The sidecar env gets only dictation keys from the state-dir .env (no more set -a sourcing that leaked bot tokens into the model server's environment); parsing matches shell sourcing (export prefix, quoted values verbatim, inline comments stripped).
  • PATH forwarded into the tmux pane (launchd-frozen server env broke ffmpeg lookup); every interpolated value shell-quoted (shq); mock backend falls back to system python (asdf shims fail without a pinned version).
  • transcribeFile checks size via statSync before reading (the cap protects daemon memory, not just sidecar latency); a URL without an explicit port binds the scheme default so a mismatch fails visibly instead of silently serving the wrong port; start-daemon.sh's sidecar step can't fail the script under set -e after a successful daemon start.
  • Fixed a test-infra bug from the original branch: a module-level process.stderr.write stub swallowed every later test file's output (and hid a suite crash).

Try it now — no GPU

./start-transcribe.sh mock        # GPU-free stub (manual only — never auto-supervised)
# send a voice note -> Claude gets "[voice transcript] This is a mock transcription..."

Real model (one-time; needs ffmpeg)

./transcribe-server/setup.sh      # venv + the right backend for your platform
# done — the sidecar now starts and stays up with the daemon

Changes

  • daemon/transcription.ts — audio detection + transcript merge (pure, unit-tested) + HTTP client; failures logged and skipped, never block delivery.
  • daemon/router.ts — hook into buildNotificationPayload after attachment download; voice_transcript meta marker.
  • cli/lifecycle.ts, cli/helpers.ts — sidecar supervision in hydra up / down / watchdog (startTranscribeAuto, shared transcribeTmux).
  • start-transcribe.sh--auto gate, shared session, park-on-crash, key-allowlisted .env extraction, shq quoting.
  • start-daemon.sh, watchdog.sh — call the shared --auto path (replaces the watchdog's AUTOSTART=1 grep opt-in; sidecar step runs before the watchdog's early exits).
  • transcribe-server/server_mlx.py (Parakeet-MLX), server.py (Canary-Qwen via NeMo + ffmpeg), mock_server.py (stdlib stub), setup.sh, requirements*.txt, README.md.
  • daemon/__tests__/transcription.test.ts — helper tests + Slack shapes + transcribeDownloads with stubbed fetch + proper env save/restore.
  • .env.example, README.md.

Verified: bun build clean on daemon.ts / bridge.ts / cli/hydra.ts; bun test298/298; mock sidecar e2e over the real code path (transcribeDownloads → multipart POST → merge), incl. sidecar-down fail-fast; live checks of the full --auto gate matrix, park-on-crash (forced bind failure → parked once, no respawn), .env parsing edge cases, no token leak into the sidecar env (ps eww), hydra watchdog starting and hydra down stopping the sidecar. Reviewed by 3 parallel independent agents (engineering / ops / security) over 4 rounds until 2 successive clean rounds.

🤖 Generated with Claude Code

kwliang1 pushed a commit that referenced this pull request Jun 28, 2026
Sam's suggestion #1: the cached `status: 'live' | 'dead'` field can go
stale if a session dies outside daemon control. Replace with `deadAt?:
number` (records when death was detected) and `isAlive()` helper that
checks tmux directly. Dead code `isSessionDead()` removed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kwliang1 kwliang1 force-pushed the feat/voice-dictation-canary-qwen branch from 4bbdf4d to 994ac68 Compare June 28, 2026 22:32
@kwliang1 kwliang1 changed the title feat: voice dictation via Canary-Qwen transcription sidecar feat: voice dictation (Canary-Qwen) — packaged default Jun 28, 2026
@kwliang1 kwliang1 changed the title feat: voice dictation (Canary-Qwen) — packaged default feat: voice dictation — Parakeet-MLX (macOS) / Canary-Qwen (GPU) Jun 29, 2026
dcetlin and others added 26 commits June 29, 2026 10:30
Moves all spawn prompt construction (spawn, fork, handoff, resurrect)
out of session-lifecycle.ts into dedicated prompt builders. Sharpens
the set_description instruction: "Lead with the domain if one is
clear. 5 words max. Rewrite it whenever your focus shifts."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…INSTRUCTION

- Move prompts.ts into daemon/prompts/session.ts alongside existing
  protocol prompts (build-critic, review-critic, design-*, etc.)
- Export DESCRIPTION_INSTRUCTION for protocol prompts to compose in
- Honest commit scope: this is refactor + behavioral change (10→5 words,
  domain-leading, lower rewrite threshold)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor: extract session prompts to daemon/prompts.ts
Adds `hydra` CLI for programmatic session management over the Unix
socket. Commands: spawn, list, status, kill, health, clear-key.

- Idempotency keys prevent duplicate spawns (survives daemon restarts)
- Initiator tracked as structured field on SpawnOpts and SessionInfo
- @mentions allowFrom users in CLI-spawned threads for auto-join
- Kill race fixed: capture key before kill, overwrite after death handler
- DEFAULT_SESSION_CHANNEL updated to active server

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fetch helpers now return null on API failure (distinct from [] for
success-with-no-items). pollPr() only advances lastCheckedAt when all
three comment fetches succeed, so a failed poll cycle retries the same
time window instead of permanently skipping comments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: hydra CLI — programmatic daemon interface
Re-implements PRs sf8193#39 and sf8193#42 which were merged to `live` (not `main`)
and lost in the 2026-06-28 live rebuild.

- readAccessFile() now spreads parsed over defaults — new Access fields
  no longer silently drop
- defaultListen on Access and GroupPolicy types
- resolveListenState cascade: thread listenOverride → group defaultListen
  → global defaultListen → false
- listen/unlisten commands persist listenOverride to ThreadMetadata so
  respawned sessions inherit the thread's listen preference

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
msg is not in doSpawnSession's scope — chatId is the correct parameter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ThrottledQueue previously swallowed errors without retry — if ch.setName()
threw for any reason, the visual update was permanently lost. Now re-enqueues
on failure (up to 3 attempts), preserving original priority and coalescing
with any newer value for the same key.

Documents the empirically measured Discord shared-scope rate limit on thread
renames: under burst conditions, ~2 rapid renames trigger 429 + retry-after
~600s (x-ratelimit-scope: shared). Per-channel vs global scoping unconfirmed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat(discord): connectivity-aware resilience via gateway health contract
Discord enforces a shared-scope rate limit on thread renames (~2 per
burst window). Mid-protocol turn transitions consumed the budget before
completion could land.

Thread renames now fire only on outer state changes (spawn, protocol
start/end, kill, cancel). Mid-protocol progress moves to thread-visible
text — review uses a single live-edited status message (gateway.edit,
5/2s rate limit), build and design embed badges in existing status
messages. Design badges use formatPhaseBadge() (single source of truth).

A 3-round review goes from 8+ renames to 2 — completion always lands
immediately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ment, per-platform logs, byte revival, safe restart preflight
Typing `! <message>` in a thread sends Escape to the tmux session
(interrupting current work), then delivers the message normally.

- Uses Bun.spawn array form (no shell, no injection surface)
- Adds initiator field to SpawnOpts for CLI compatibility
- Resolves type errors from sf8193#53 visual system refactor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: ThrottledQueue retry on failure + rate limit docs
fix: restore defaultListen + persist listen across respawns
Publishes a live session overview to the bot's Home tab using Block Kit.
Auto-updates on session changes (debounced) and periodically every 5 min.
Shows status, thread links, description, and age for each live session.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement onReaction in SlackGateway so the existing hocho handler in
the daemon router works for Slack, not just Discord. Handles thread
parents by deleting children first, reacts ⚠️ when threads contain
undeletable messages from other users. Includes bot self-reaction guard
for parity with Discord.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reverts the thread-parent delete logic that fetched and deleted all
children before retrying the parent. Single-message delete only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Track lastReplyId on sessions from both inbound messages (router) and
outbound bot replies (bridge-dispatch). Dashboard and CLI list use
gateway.getMessageUrl to build thread-scoped deep links that open
Slack to the latest message in the thread panel.

Includes debounced persist (2s coalesce), deleted-message cleanup,
startup backfill via Slack's latest_reply, and shutdown flush.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add spawn input to Home tab (text field with Enter dispatch)
- Show PR watch links as context blocks under each session
- Auto-unwatch PRs on merge/close during poll cycle
- Backfill PR titles from GitHub on daemon startup
- Auth checks on home:spawn and app_home_opened (allowFrom gate)
- Deduplicate PR API call in pollPr (pass prData to fetchCheckStatus)
- Fix lastReplyId semantic split — standardize on outbound reply ID
- Escape mrkdwn injection in session descriptions
- Cap block count at 31 to stay under Slack's 100-block view limit
- Input max_length: 500 with handler-side truncation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each daemon now writes daemon-{platform}.json alongside the legacy
daemon.json during bridge sync. The bridge checks the platform-keyed
file first, eliminating the last-writer-wins race when two daemons
share a plugin cache. CHAT_PLATFORM is now propagated to spawned
sessions and the Slack byte so bridges can resolve the correct file.

Fully backwards compatible — old bridges fall through to daemon.json,
old daemons still write daemon.json which new bridges accept as
legacy fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dcetlin and others added 28 commits July 1, 2026 11:28
feat: platform-keyed daemon config for dual-daemon operation
Add daemon+byte lifecycle commands to the hydra CLI. Platform is
always required (no default).

- `hydra up <platform>` — validate byte script exists, check for
  running tmux sessions and orphaned claude processes (prevents
  gotcha sf8193#32 ping-pong), start daemon, wait for socket, start byte
- `hydra down <platform>` — stop byte via stop-byte.sh (orphan
  cleanup), stop daemon, remove stale socket + PID file
- `hydra restart <platform>` — restart daemon only (picks up code
  changes)

No hardcoded platform enum — uses filesystem-based validation
(does start-{platform}-byte.sh exist?). New platforms work with
zero CLI changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: hydra up/down/restart — CLI lifecycle management
Replaces start-byte-v2.sh (discord) and start-slack-byte.sh with a
single platform-agnostic start-byte.sh. The v2 daemon+bridge
architecture is now the only architecture — the version suffix was a
migration artifact. Old names preserved as thin deprecation wrappers
(inject CHAT_PLATFORM, exec start-byte.sh, print notice to stderr).

Key changes beyond dedup:
- Shared env preamble (env-setup.sh) — PATH, .env sourcing, STATE_DIR
  in one place, sourced by every script (including preflight.sh)
- Strict mode (set -euo pipefail) on all executable scripts
- Progressive CHAT_PLATFORM enforcement — refuses to default when
  multiple platform state dirs exist (N-platform aware)
- Auth tokens read from file, not interpolated into command strings
- macOS assertion makes the platform contract explicit
- Unified log paths: ~/hydra-${CHAT_PLATFORM}-{daemon,byte}.log
- Consistent #!/bin/bash shebangs across all scripts
- Script architecture documented in README (layering, conventions)
- Deprecation wrappers for backwards compat (shell history, worktrees)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor: shared preamble, unified byte script, strict mode
…adAt

Four fixes to the crash detection category:

1. Remove duplicate inline death check in bridge-server.ts socket.on('end')
   — checkSessionDeath() already covers this case via setTimeout. The inline
   copy fired in parallel, producing double death notices.

2. Health poll (daemon.ts) changed from OR to AND — only flag as crashed
   when BOTH tmux is dead AND bridge is disconnected. Bridge-only
   disconnects are handled by the bridge-server disconnect handler (3s
   delay + tmux check). The OR condition false-positived on temporary
   bridge drops and newly spawned sessions.

3. Health poll now sets info.deadAt + calls registry.persist() +
   threadRegistry update + refreshSessionVisual(), fully consistent
   with checkSessionDeath().

4. 60s spawn grace period — skip crash detection for sessions younger
   than SPAWN_GRACE_MS (bridge needs time to connect after spawn).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: 60s grace period on crash detection for spawned sessions
DEFAULT_SESSION_CHANNEL was hardcoded to a Discord channel ID as
fallback. This is deployment-specific config that belongs in .env,
not source code. Now required from .env — daemon warns on startup
if missing.

Also fixes load order: export moved after .env sourcing so .env
values are actually read.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: DEFAULT_SESSION_CHANNEL must read after .env sourcing
Absorb all 10 shell scripts (start-daemon, start-byte, stop-byte,
restart-daemon, watchdog, preflight, env-setup, compile-check,
kill-orphan-bytes) into the TypeScript CLI as cli/helpers.ts and
cli/lifecycle.ts. The CLI entry point (cli/hydra.ts) is now a slim
router that delegates to typed lifecycle commands.

New commands: hydra watchdog <platform>, hydra preflight <platform>.
Shell scripts retained for backward compat until production validated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hydra install <platform> generates a launchd plist with correct paths
for the current user, loads it, creates the state dir, and runs
preflight. hydra uninstall removes it. Simplifies new user setup to:
bun install → create .env → hydra install → hydra up.

README rewritten to use CLI commands instead of shell scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hydra install now accepts --cwd and --config-dir flags so users don't
need env vars. All executable shell scripts now print a deprecation
warning pointing to the CLI equivalent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
If git worktree remove --force fails (e.g. corrupted .git file),
fall back to rm -rf but only when the path is inside a .worktrees/
directory to prevent accidental deletion of real repos.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Splits that land inside a ``` code fence now close the fence at the
chunk boundary and reopen it (with the language tag) at the start of
the next chunk, so multi-part messages render correctly in Slack and
Discord. The new 'markdown' mode is the default; 'length' and 'newline'
modes are preserved for back-compat.

Split-point preference: paragraph break outside fence > line break
outside fence > line break inside fence > space > hard limit.
Best-effort avoidance of mid-table-row splits. Reserves 4 chars of
headroom for the fence closer so chunks never exceed the stated limit.
Progress guard prevents infinite loops when fence overhead >= cut size.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat(daemon): add markdown-aware message chunking mode
Transcribe inbound audio attachments (Discord voice notes, Slack audio
clips) to text so users can dictate prompts to Claude alongside text and
images. Claude has no native audio input, so the daemon transcribes first
and merges the result into the message as a [voice transcript] block; the
original audio file stays in downloaded_files.

- daemon/transcription.ts: audio detection, transcript merging (pure,
  unit-tested), and an HTTP client to a self-hosted STT sidecar. Failures
  are logged and skipped — dictation never blocks message delivery.
- daemon/router.ts: hook transcription into buildNotificationPayload after
  attachment download.
- transcribe-server/: self-hosted sidecar serving NVIDIA Canary-Qwen 2.5B
  via NeMo (top of the Open ASR leaderboard for English accuracy), plus a
  GPU-free mock_server.py for testing the wiring locally.
- Off by default; enable with HYDRA_TRANSCRIBE_ENABLED=1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make dictation work as a packaged default rather than a manual opt-in:

- Daemon transcription is now ON by default ("auto"): it's attempted
  whenever audio arrives and silently no-ops if no sidecar is reachable
  (the fetch fails fast). HYDRA_TRANSCRIBE_ENABLED=0 opts out.
- start-transcribe.sh: idempotent launcher for the sidecar in a tmux
  session. Accepts a backend arg (`./start-transcribe.sh mock`) for a
  zero-GPU end-to-end test; canary backend refuses cleanly until set up.
- watchdog.sh: revive the sidecar when HYDRA_TRANSCRIBE_AUTOSTART is set,
  reusing the same supervision pattern as the bot session.
- transcribe-server/setup.sh: one-time venv + NeMo install.
- Docs/env updated for the packaged-default flow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pasting a doc line with an inline '# comment' into interactive zsh (which
does not strip '#') passed the comment as args, making the launcher try
backend '#'. Only accept mock|canary as the positional arg; ignore
anything else with a warning and fall back to env/default.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Canary-Qwen runs via NeMo and needs a CUDA GPU, so it can't run on Mac.
Add a Parakeet-MLX backend (NVIDIA Parakeet TDT on Apple's MLX runtime):
native, ~50x realtime on M-series, ~6% English WER, no GPU.

- transcribe-server/server_mlx.py — FastAPI server, same /transcribe
  contract, parakeet-mlx + ffmpeg resample.
- transcribe-server/requirements-mlx.txt — light deps (no torch/NeMo).
- start-transcribe.sh — `parakeet` backend; default by platform
  (Darwin -> parakeet, else canary).
- setup.sh — installs the right requirements per platform / arg.
- Docs updated across README, transcribe-server/README, .env.example.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
MLX ships arm64-only wheels, and on Apple Silicon the shell frequently
runs under Rosetta (x86_64), where 'pip install mlx' fails with no
matching wheel. setup.sh now uses an arm64 Homebrew python@3.12 and
builds the venv via 'arch -arm64' for the parakeet backend; uv/system
python paths remain for canary. Also relax the parakeet-mlx pin.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tmux does not reliably inherit env, so a PARAKEET_MODEL override in .env
never reached the server (it would fall back to the 0.6B default). Forward
model-selection vars explicitly, only when set. Enables pinning the smaller
110M Parakeet on constrained networks where the 2.4GB 0.6B is impractical.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tests

The module-top-level stub leaked into every test file loaded after it,
swallowing bun test's per-test output and final summary for the rest of
the suite (bun runs all files in one process). The stub was unnecessary:
these tests only exercise pure helpers that never write to stderr.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Voice dictation now comes up with the daemon instead of needing a separate
manual step or an explicit opt-in flag:

- start-transcribe.sh gains an --auto mode used by every supervisor:
  explicit HYDRA_TRANSCRIBE_AUTOSTART wins in both directions; when unset it
  starts the sidecar iff the backend is ready (venv built, or mock chosen),
  and quietly no-ops otherwise so unconfigured machines don't log a failure
  every watchdog cycle. Honors HYDRA_TRANSCRIBE_ENABLED=0.
- hydra up starts it right after the daemon (model loads while the byte
  comes up); hydra watchdog revives it each tick; hydra down stops it.
- Legacy start-daemon.sh and watchdog.sh call the same --auto path; the
  watchdog's AUTOSTART=1 opt-in grep is replaced by the shared gate.
- mock backend: resolve python3 up front and fall back to /usr/bin/python3 —
  asdf's shim fails when no python version is pinned for the dir, and
  mock_server.py is pure stdlib.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Slack voice clips arrive as files with mimetype audio/mp4 (m4a), or
audio/webm;codecs=opus from browser recordings — assert detection for those
shapes plus the extension fallback. Add transcribeDownloads tests with a
stubbed fetch: only audio files are POSTed, sidecar failure skips the file
instead of throwing, and the disabled flag short-circuits before the network.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…d 1)

Structural:
- ONE shared tmux session (hydra-transcribe) for all platform daemons —
  per-platform sessions raced for the same default port, and the loser
  reloaded the full model every watchdog tick. hydra down kills it; the
  other platform's watchdog revives it within a tick.
- A crashed server PARKS its session (error on screen + in log) instead of
  exiting, so supervisors' has-session check holds: broken configs fail
  once, not as a model-load crash-loop every 120s.
- The mock backend is excluded from --auto (manual or explicit
  AUTOSTART=1 only) — leftover test config must not keep canned
  transcripts flowing into real prompts. A remote HYDRA_TRANSCRIBE_URL
  also disables local autostart.

Environment/robustness:
- Extract only dictation keys from the state-dir .env instead of set -a
  sourcing the whole file — the model server has no business holding chat
  bot tokens (they leaked into the tmux server env when this script
  bootstrapped it).
- Forward PATH into the tmux pane (launchd-frozen server env lacks
  /opt/homebrew/bin, breaking the servers' ffmpeg lookup) and shell-quote
  every interpolated value (shq) so an embedded quote can't break out of
  the tmux command string.
- URL without an explicit port now binds the scheme default so a mismatch
  fails visibly instead of the sidecar silently serving a port the daemon
  never queries.
- start-daemon.sh: sidecar refusal no longer fails the whole script under
  set -e after a successful daemon start. Legacy watchdog runs the sidecar
  step before the daemon branches' early exits.

Daemon:
- isAudioFile: a definitive non-audio MIME (video/mp4 screen recording) is
  no longer re-classified as audio by its extension; only generic types
  fall back. Codec suffixes (audio/webm;codecs=opus) parsed correctly.
- transcribeFile checks size via statSync BEFORE reading — the cap now
  protects daemon memory, not just sidecar latency.
- Tests: env save/restore moved into beforeEach/afterEach (the old
  describe-body restore ran at collection time and leaked env into later
  test files); network tests set HYDRA_TRANSCRIBE_ENABLED explicitly; new
  cases for video-MIME rejection and the pre-network size cap.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- --auto with AUTOSTART unset now also requires BACKEND != mock — with a
  built venv, leftover BACKEND=mock in .env passed the venv-only gate and
  auto-supervised canned transcripts, the exact residue case the previous
  commit claimed to prevent.
- .env key extraction now parses like shell sourcing: optional 'export'
  prefix, quoted values kept verbatim (a # inside quotes is not a
  comment), unquoted values lose trailing inline comments/whitespace.
  The grep|cut version kept ' 0  # never' whole, silently defeating
  explicit opt-outs and erroring per watchdog tick on commented backends.
- Document that multi-platform machines must keep dictation config
  identical across platform .env files (shared session = first
  supervisor wins).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@kwliang1 kwliang1 force-pushed the feat/voice-dictation-canary-qwen branch from 6136e07 to 2e5be02 Compare July 3, 2026 08:38
@kwliang1 kwliang1 changed the title feat: voice dictation — Parakeet-MLX (macOS) / Canary-Qwen (GPU) feat: voice dictation — Slack/Discord audio → text, supervised with the daemon (Parakeet-MLX / Canary-Qwen) Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants