Skip to content

[codex] Add meeting memory transcriber#74

Draft
Savin99 wants to merge 13 commits into
salute-developers:mainfrom
Savin99:codex/meeting-memory-postgres
Draft

[codex] Add meeting memory transcriber#74
Savin99 wants to merge 13 commits into
salute-developers:mainfrom
Savin99:codex/meeting-memory-postgres

Conversation

@Savin99

@Savin99 Savin99 commented Jun 16, 2026

Copy link
Copy Markdown

What changed

  • Adds the local mac_transcriber pipeline for GigaAM transcription, meeting report generation, Zoom backfill/import helpers, launchd examples, and local web UI assets.
  • Adds repository/agent instructions plus implementation notes for the meeting memory work.
  • Includes tests for archive handling, diarization/reporting, memory DB behavior, service endpoints, Markdown transcription CLI, Zoom backfill, and Zoom import.
  • Renders adaptive protocol sections in Markdown and HTML reports.

Validation

  • git diff --cached --check
  • .venv/bin/python -m pytest mac_transcriber/tests (83 passed)

Notes

Real secrets and local transcript artifacts were left out of git; examples use placeholders only.

Ilya and others added 13 commits June 16, 2026 16:14
…g-memory work

- AI report critic (MAC_TRANSCRIBER_REPORT_CRITIC_MODEL, off by default): structured
  edit-ops (keep/drop/merge/rewrite) for dedup, cross-section dedup and ASR/clarity
  rewrite of formal sections; citations preserved from source items, re-validated with
  rollback to the pre-critic report on failure.

- recovery: stop raw verbatim transcript utterances leaking into formal sections via
  recover_base_items (_is_clean_recovered_statement).

- includes pending meeting-memory-postgres pipeline work (asr/memory_db/service/
  scripts/tests).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add MAC_TRANSCRIBER_VAD flag to choose the per-track speech detector in build_segments: rms (default, unchanged) or silero (silero-vad neural VAD). silero fragments less and keeps in-sentence boundaries intact while capturing slightly more speech at comparable speed.

Segment timestamps stay absolute (source sample position), so chronological order across tracks and the pyannote diarization path are unaffected; in multi-track mode the speaker is still the track, not the VAD. Falls back to rms with a logged warning when silero-vad is missing.

Includes ab_vad_eval.py A/B harness and test_vad.py coverage.

Co-authored-by: Claude <claude@anthropic.com>
Flip MAC_TRANSCRIBER_VAD default from rms to silero so all new recordings use the neural detector; rms stays as the opt-out fallback. Validated across all multi-track meetings: cross-track speech overlap is the same or lower than rms on all but one recording (no systematic ghost turns), and silero captures >= speech on 23/33 tracks while fragmenting less and keeping in-sentence boundaries intact.

Co-authored-by: Claude <claude@anthropic.com>
Local-fallback reports put raw transcript fragments into decisions/tasks; those leaked into meeting_facts and biased later reports into fabricating decisions (a meeting's report invented a decision not in the discussion). _upsert_facts now skips facts unless report_health.json reports a non-local generated_by. Status (ok/degraded) is not a usable signal — local reports are often marked ok.

Existing junk (213 facts, all from local reports) was purged separately with a JSON backup; the fact axis refills only from AI reports going forward.

Co-authored-by: Claude <claude@anthropic.com>
…tracts

regen_facts.py integrates clean facts (produced offline by Claude subagents reading transcripts) into memory: it rewrites report.json fact lists, stamps report_health.json with a non-local generated_by so the fact gate trusts them, and upserts via upsert_meeting_memory WITHOUT embeddings — facts are token-searchable, so no OpenAI calls are made. Used to replace the purged local-fallback junk facts (213 raw-transcript items) with 344 clean, owner-attributed facts across 18 meetings, 0% raw-transcript.

Co-authored-by: Claude <claude@anthropic.com>
…of local fallback

Previously only quota errors parked the meeting; any other AI failure (network/timeout/5xx/rate-limit/missing key) fell back to the raw local keyword report — the same junk that polluted memory. New ReportUnavailableError marks API-availability failures and propagates through every path (direct/chunked/synthesis/critic/wrapper) without being downgraded to a generic error or swallowed as a skipped chunk. build_report re-raises it (no local fallback); the service parks the meeting as blocked_on_ai (transcript kept, no report written); reprocess_blocked.py drains both blocked_on_quota and blocked_on_ai when AI is back (schedulable for automatic draining).

Co-authored-by: Claude <claude@anthropic.com>
…ueue

Adds com.slack-zoom.gigaam-reprocess: a periodic (StartInterval 900s) LaunchAgent that runs reprocess_blocked.py to drain blocked_on_quota/blocked_on_ai meetings once the AI API is back, making the 'queue until available, then process' loop fully automatic. Carries no secrets (reads .env.local via --env-file). Wired into install_launchd.sh and documented.

Co-authored-by: Claude <claude@anthropic.com>
Coverage data (.coverage, *,cover, htmlcov) больше не попадает в дерево.

Co-authored-by: Claude <claude@anthropic.com>
Скрипт заливает записи встреч (сведённый audio.m4a + раздельные дорожки
спикеров) на Яндекс.Диск через REST API с дедупликацией по sha256.
Раскладка: одна папка на запись '<ГГГГ-ММ-ДД> <имя>'. OAuth-токен читается
из .env.local (YANDEX_DISK_OAUTH_TOKEN), без хардкода секретов. Транскрипты
остаются локально; локальная копия аудио удаляется только после подтверждённой
заливки. Запускается вручную, в launchd не подключён.

Co-authored-by: Claude <claude@anthropic.com>
Сравнивает самодельный ретривер с LlamaIndex и mem0 на едином context_pack
(facts/segments/embedding_chunks). Стадии: retrieve (дёшево, метрики) и report
(дорого, полная генерация + LLM-судья). Дев-инструмент, гоняется из
изолированного venv; в прод не входит. Секреты из env (OPENAI_API_KEY,
DATABASE_URL).

Co-authored-by: Claude <claude@anthropic.com>
…hardening

Заменяет платный LLM-API генерации отчётов на headless `claude -p` по расписанию
(launchd), плюс надёжность очереди blocked-встреч и архива на Яндекс.Диск.

- reporting: kill-switch MAC_TRANSCRIBER_REPORT_BACKEND=claude — build_ai_report
  бросает ReportUnavailableError, сервис паркует встречу в blocked_on_ai
- agent_report.py: шов prepare/finalize (transcript.json -> ai_payload -> рендер
  штатным рендерером; coverage авто, цитаты фильтруются)
- drain_reports_via_claude.py: часовой слив очереди blocked/stale через `claude -p`.
  Кросс-процессный flock, health-гейт (failed остаётся blocked), поштучная заливка
  на Я.Диск + self-heal (маркер report_disk_uploaded), обработка временного лимита
  (без штрафа, ретрай), кап попыток для детерминированных сбоев, DISABLE_AUTOUPDATER
- upload_reports_to_yandex.py: заливка/переименование отчётов на Я.Диск
  (титульные папки, дедуп, verify размера)
- promote_reports.py: дедуп-рендер + разнос отчётов по дублям встреч
- launchd plist example + обёртка + report_agent_instructions.md
- reprocess_blocked.py / archive_audio_to_yandex.py: доработки очереди и архива
- tests: kill-switch, agent_report round-trip, helper'ы дрейнера; conftest изолирует
  прод-флаг report-backend из .env.local от тестов

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant