[codex] Add meeting memory transcriber#74
Draft
Savin99 wants to merge 13 commits into
Draft
Conversation
…g-memory work - AI report critic (MAC_TRANSCRIBER_REPORT_CRITIC_MODEL, off by default): structured edit-ops (keep/drop/merge/rewrite) for dedup, cross-section dedup and ASR/clarity rewrite of formal sections; citations preserved from source items, re-validated with rollback to the pre-critic report on failure. - recovery: stop raw verbatim transcript utterances leaking into formal sections via recover_base_items (_is_clean_recovered_statement). - includes pending meeting-memory-postgres pipeline work (asr/memory_db/service/ scripts/tests). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add MAC_TRANSCRIBER_VAD flag to choose the per-track speech detector in build_segments: rms (default, unchanged) or silero (silero-vad neural VAD). silero fragments less and keeps in-sentence boundaries intact while capturing slightly more speech at comparable speed. Segment timestamps stay absolute (source sample position), so chronological order across tracks and the pyannote diarization path are unaffected; in multi-track mode the speaker is still the track, not the VAD. Falls back to rms with a logged warning when silero-vad is missing. Includes ab_vad_eval.py A/B harness and test_vad.py coverage. Co-authored-by: Claude <claude@anthropic.com>
Flip MAC_TRANSCRIBER_VAD default from rms to silero so all new recordings use the neural detector; rms stays as the opt-out fallback. Validated across all multi-track meetings: cross-track speech overlap is the same or lower than rms on all but one recording (no systematic ghost turns), and silero captures >= speech on 23/33 tracks while fragmenting less and keeping in-sentence boundaries intact. Co-authored-by: Claude <claude@anthropic.com>
Local-fallback reports put raw transcript fragments into decisions/tasks; those leaked into meeting_facts and biased later reports into fabricating decisions (a meeting's report invented a decision not in the discussion). _upsert_facts now skips facts unless report_health.json reports a non-local generated_by. Status (ok/degraded) is not a usable signal — local reports are often marked ok. Existing junk (213 facts, all from local reports) was purged separately with a JSON backup; the fact axis refills only from AI reports going forward. Co-authored-by: Claude <claude@anthropic.com>
…tracts regen_facts.py integrates clean facts (produced offline by Claude subagents reading transcripts) into memory: it rewrites report.json fact lists, stamps report_health.json with a non-local generated_by so the fact gate trusts them, and upserts via upsert_meeting_memory WITHOUT embeddings — facts are token-searchable, so no OpenAI calls are made. Used to replace the purged local-fallback junk facts (213 raw-transcript items) with 344 clean, owner-attributed facts across 18 meetings, 0% raw-transcript. Co-authored-by: Claude <claude@anthropic.com>
…of local fallback Previously only quota errors parked the meeting; any other AI failure (network/timeout/5xx/rate-limit/missing key) fell back to the raw local keyword report — the same junk that polluted memory. New ReportUnavailableError marks API-availability failures and propagates through every path (direct/chunked/synthesis/critic/wrapper) without being downgraded to a generic error or swallowed as a skipped chunk. build_report re-raises it (no local fallback); the service parks the meeting as blocked_on_ai (transcript kept, no report written); reprocess_blocked.py drains both blocked_on_quota and blocked_on_ai when AI is back (schedulable for automatic draining). Co-authored-by: Claude <claude@anthropic.com>
…ueue Adds com.slack-zoom.gigaam-reprocess: a periodic (StartInterval 900s) LaunchAgent that runs reprocess_blocked.py to drain blocked_on_quota/blocked_on_ai meetings once the AI API is back, making the 'queue until available, then process' loop fully automatic. Carries no secrets (reads .env.local via --env-file). Wired into install_launchd.sh and documented. Co-authored-by: Claude <claude@anthropic.com>
Coverage data (.coverage, *,cover, htmlcov) больше не попадает в дерево. Co-authored-by: Claude <claude@anthropic.com>
Скрипт заливает записи встреч (сведённый audio.m4a + раздельные дорожки спикеров) на Яндекс.Диск через REST API с дедупликацией по sha256. Раскладка: одна папка на запись '<ГГГГ-ММ-ДД> <имя>'. OAuth-токен читается из .env.local (YANDEX_DISK_OAUTH_TOKEN), без хардкода секретов. Транскрипты остаются локально; локальная копия аудио удаляется только после подтверждённой заливки. Запускается вручную, в launchd не подключён. Co-authored-by: Claude <claude@anthropic.com>
Сравнивает самодельный ретривер с LlamaIndex и mem0 на едином context_pack (facts/segments/embedding_chunks). Стадии: retrieve (дёшево, метрики) и report (дорого, полная генерация + LLM-судья). Дев-инструмент, гоняется из изолированного venv; в прод не входит. Секреты из env (OPENAI_API_KEY, DATABASE_URL). Co-authored-by: Claude <claude@anthropic.com>
…hardening Заменяет платный LLM-API генерации отчётов на headless `claude -p` по расписанию (launchd), плюс надёжность очереди blocked-встреч и архива на Яндекс.Диск. - reporting: kill-switch MAC_TRANSCRIBER_REPORT_BACKEND=claude — build_ai_report бросает ReportUnavailableError, сервис паркует встречу в blocked_on_ai - agent_report.py: шов prepare/finalize (transcript.json -> ai_payload -> рендер штатным рендерером; coverage авто, цитаты фильтруются) - drain_reports_via_claude.py: часовой слив очереди blocked/stale через `claude -p`. Кросс-процессный flock, health-гейт (failed остаётся blocked), поштучная заливка на Я.Диск + self-heal (маркер report_disk_uploaded), обработка временного лимита (без штрафа, ретрай), кап попыток для детерминированных сбоев, DISABLE_AUTOUPDATER - upload_reports_to_yandex.py: заливка/переименование отчётов на Я.Диск (титульные папки, дедуп, verify размера) - promote_reports.py: дедуп-рендер + разнос отчётов по дублям встреч - launchd plist example + обёртка + report_agent_instructions.md - reprocess_blocked.py / archive_audio_to_yandex.py: доработки очереди и архива - tests: kill-switch, agent_report round-trip, helper'ы дрейнера; conftest изолирует прод-флаг report-backend из .env.local от тестов Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
mac_transcriberpipeline for GigaAM transcription, meeting report generation, Zoom backfill/import helpers, launchd examples, and local web UI assets.Validation
git diff --cached --check.venv/bin/python -m pytest mac_transcriber/tests(83 passed)Notes
Real secrets and local transcript artifacts were left out of git; examples use placeholders only.