feat(moderation): moderate transcript text for video assets by claude[bot] · Pull Request #208 · muxinc/ai

claude · 2026-06-29T08:35:36Z

Requested by Victor Boutté, Phil Cluff · Slack thread

Summary

Video assets previously only had their storyboard thumbnails moderated — the caption transcript was never checked. This PR adds an opt-in includeTranscript option so a video asset can also moderate its caption transcript via OpenAI text moderation, returns transcript results in a dedicated transcriptScores array, and moves the transcript into dynamic, overlapping time windows whose size scales with the asset's duration.

Before

For video assets, getModerationScores only moderated storyboard thumbnails — the caption transcript was never checked. (Audio-only assets already moderate transcript text; the two paths were mutually exclusive.) Audio-only transcript results were returned in thumbnailScores using a synthetic transcript:-prefixed url.

After

A new opt-in includeTranscript?: boolean option (default false) makes a video asset also moderate its caption transcript text alongside thumbnails, and transcript results now live in a dedicated transcriptScores array instead of being folded into thumbnailScores. Each transcript score is segmented into a time window carrying startTime/endTime timecodes — analogous to how a flagged thumbnail carries its time — so consumers can locate flagged speech on the timeline:

Provider: only supported with provider: 'openai' (uses OpenAI text moderation). Throws if set with hive / google-vision-api.
Skipped silently when no ready text track / no parseable caption cues exist. Transcription is never triggered.
No effect on audio-only assets, which always moderate transcript text regardless.
Surfaced categories: sexual + violence only, thresholds 0.8 / 0.8 (unchanged DEFAULT_THRESHOLDS).

How

src/workflows/moderation.ts
- TranscriptModerationScore is time-windowed: { startTime, endTime, sexual, violence, error, errorMessage? } (timecodes in seconds; error fields mirror ThumbnailModerationScore). There is no chunkIndex. ModerationResult.transcriptScores: TranscriptModerationScore[] is one entry per time window (empty [] when nothing was moderated). thumbnailScores holds image entries only (always a real url + time).
- Dynamic, overlapping windowing (replaces the fixed char-budget chunking): transcript fetch pulls the raw VTT (cleanTranscript: false) and parses per-cue timecodes via parseVTTCues. A new exported pure helper buildTranscriptWindows(cues, duration, params) builds overlapping time windows whose size scales with the asset's duration:
  - windowSeconds = clamp(duration / targetWindowCount, minWindowSeconds, maxWindowSeconds) (defaults clamp(duration / 40, 20, 120))
  - overlapSeconds = max(minOverlapSeconds, windowSeconds * overlapFraction) (defaults max(5, windowSeconds * 0.15))
  - stride = max(windowSeconds - overlapSeconds, 1)
  Window k covers [k*stride, k*stride + windowSeconds]; a cue belongs to window k when it intersects that interval (cue.startTime < end && cue.endTime > start), so cues are atomic and boundary-straddling cues appear in both neighbouring windows, keeping abuse across a boundary scored intact. Asset duration comes from the duration already computed in getModerationScores (min of getVideoTrackDurationSecondsFromAsset / getAssetDurationSecondsFromAsset), falling back to the last cue's endTime when missing/0. Empty windows (silence) are skipped, and two consecutive windows with the exact same cue set are deduped to avoid a redundant request. A rare safety guard splits any single window whose joined text would exceed the TRANSCRIPT_WINDOW_MAX_UTF16_CODE_UNITS (10k) ceiling into sub-windows under the cap (cues stay atomic, each carrying its own cue span). Consecutive windows' reported [startTime, endTime] ranges may overlap by ~overlapSeconds by design.
- Array-batched OpenAI requests (replaces one-request-per-window): requestOpenAITranscriptModeration packs windows into batches (capped at TRANSCRIPT_BATCH_MAX_UTF16_CODE_UNITS = 100,000 combined chars and TRANSCRIPT_BATCH_MAX_ITEMS = 100 items per request) and sends each batch as a single /v1/moderations POST with an array input; results[i] is mapped back to window i. callOpenAIModerationApi now accepts string[] input. Fallback: if a batch is rejected with a 400 (too large) and holds more than one window, it is split in half and retried recursively down to a single window. The existing 429/5xx retry/backoff inside callOpenAIModerationApi is reused; a window that still fails emits its startTime/endTime with sexual: 0, violence: 0, error: true, errorMessage. Concurrency (processConcurrently / maxConcurrent) now runs across batches.
- Tunable params: a new optional transcriptWindowing?: { targetWindowCount?, minWindowSeconds?, maxWindowSeconds?, overlapFraction?, minOverlapSeconds? } on ModerationOptions overrides the module-level DEFAULT_TRANSCRIPT_WINDOWING (defaults targetWindowCount 40 / minWindowSeconds 20 / maxWindowSeconds 120 / overlapFraction 0.15 / minOverlapSeconds 5), threaded through getModerationScores into the windowing helper.
- Audio-only path: windowed transcript results go into transcriptScores; thumbnailScores is []; mode === "transcript".
- Video includeTranscript path: windowed transcript results go into transcriptScores alongside populated thumbnailScores (empty [] if no caption track / no cues). Provider guard (openai-only) preserved.
- maxScores / exceedsThreshold take the max of sexual / violence across both thumbnailScores and all transcriptScores windows; the all-failed guard considers both arrays. Same DEFAULT_THRESHOLDS ({ sexual: 0.8, violence: 0.8 }).
- Coverage stays thumbnail-only. Audio-only results (zero thumbnails) stay confident — isLowConfidence is driven by transcript-window success when there are no thumbnails.
- mode is "thumbnails" | "transcript" | "combined": "transcript" for audio-only, "combined" when both arrays are non-empty, "thumbnails" otherwise.
docs/API.md — documented the dynamic, overlapping, duration-scaled windowing (clamp(duration / 40, 20s, 120s), ~15% / min-5s overlap), the array-batched requests with split-and-retry on 400, the transcriptWindowing tuning option and its defaults, and noted that transcriptScores entries may have slightly overlapping time ranges by design. Removed the stale fixed char-only chunking text.
tests/unit/moderation-coverage.test.ts — mock-based unit tests (mock the .vtt fetch + OpenAI fetch): (a) video + includeTranscript: true produces transcriptScores carrying startTime/endTime, no chunkIndex, thumbnailScores images-only, high transcript sexual (0.95) raises maxScores/exceedsThreshold, mode === "combined"; (b) overlapping windows — at least two windows whose consecutive time ranges overlap; (c) dynamic sizing via a direct buildTranscriptWindows unit test (a long duration yields fewer/larger windows than a short duration for the same cue density); (d) a boundary cue appears in two consecutive windows; (e) array batching — multiple windows packed into one request's input array; (f) a forced 400 triggers split-and-retry producing per-window results; (g) no ready text track produces empty transcriptScores, mode === "thumbnails"; (h) non-openai provider throws; (i) audio-only produces windowed transcriptScores, empty thumbnailScores, mode === "transcript", not low-confidence.
tests/integration/moderation.test.ts — assertions remain shape-compatible (startTime/endTime, no chunkIndex); network behavior unchanged.

Breaking change

This is a breaking change to ModerationResult (pre-1.0, acceptable): audio-only transcript scores moved out of thumbnailScores into the new transcriptScores array, the synthetic transcript: URL convention is removed, and transcript scores are now time-windowed (startTime/endTime) rather than chunk-indexed. Approved by Phil.

Note for reviewers

There is an in-flight branch pc/variable-granularity-for-moderation also touching moderation. This PR is intentionally based on main, not that branch — please rebase intentionally to resolve any overlap.

Add an opt-in includeTranscript option to getModerationScores so that video assets can have their caption transcript moderated alongside storyboard thumbnails. Transcript moderation runs only with provider 'openai', is skipped silently when no ready text track exists (never triggering transcription), and folds transcript scores into thumbnailScores with a synthetic transcript: URL prefix. Thumbnail coverage excludes those transcript entries from its denominator.

snyk-io · 2026-06-29T08:35:54Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues
✅	Licenses	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

…ores field BREAKING CHANGE: transcript moderation results no longer appear in `thumbnailScores`. They now live in a dedicated `transcriptScores` array (`TranscriptModerationScore[]`), keyed by `chunkIndex` instead of a synthetic `transcript:` URL. `thumbnailScores` holds image entries only. Adds a `combined` value to `mode` for video assets moderated with `includeTranscript`. maxScores/exceedsThreshold aggregate across both arrays; thumbnail coverage is naturally image-only and audio-only results stay confident with zero thumbnails.

…codes Replace the arbitrary ~10k-char chunk model with time windows aligned to caption cues. Each transcript score now carries startTime/endTime (like a thumbnail's time) instead of a chunkIndex, so consumers can locate flagged speech on the timeline. Consecutive cues are grouped into contiguous, non-overlapping windows that stay under the OpenAI moderation input budget; a single over-budget cue is emitted as its own window covering its full time range. maxScores/exceedsThreshold and the all-failed guard aggregate across thumbnails plus all transcript windows.

… moderation Replace the fixed char-budget transcript windowing with dynamic, overlapping time windows whose size scales with the asset's duration (clamp(duration / 40, 20s, 120s)) and overlap (~15%, min 5s) so abuse straddling a window boundary is still scored intact. Windows are sent to OpenAI as array-batched /v1/moderations requests, with results mapped back index-aligned; an oversized batch (400) is split in half and retried down to a single window, reusing existing 429/5xx retry/backoff and per-batch concurrency. Add a tunable transcriptWindowing option and export a pure buildTranscriptWindows helper for direct unit testing.

claude added 3 commits June 29, 2026 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(moderation): moderate transcript text for video assets#208

feat(moderation): moderate transcript text for video assets#208
claude[bot] wants to merge 4 commits into
mainfrom
feat/moderate-transcript-for-video

claude Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

snyk-io Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

claude Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Before

After

How

Breaking change

Note for reviewers

Uh oh!

snyk-io Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jun 29, 2026 •

edited

Loading

snyk-io Bot commented Jun 29, 2026 •

edited

Loading