feat(diarizer): expose per-chunk embeddings on DiarizationResult by adamsro · Pull Request #633 · FluidInference/FluidAudio

adamsro · 2026-05-23T18:25:32Z

Why is this change needed?

Surfaces the per-chunk speaker embeddings the offline pipeline already computes internally (currently only reachable via --export-embeddings JSON dump in the CLI), so consumers can implement chunk-granularity post-processing without re-running the embedding model.

The motivating use case is downstream cluster-purity correction: the clustering step occasionally lands a fraction of one speaker's chunks in another speaker's cluster (mega-cluster contamination on long files, bidirectional smearing between similar voices). With per-chunk embeddings plus their cluster assignments exposed on DiarizationResult, downstream code can compute per-cluster centroids and migrate chunks whose own-cluster cosine is dominated by an alternate centroid — pure NumPy/BLAS-equivalent post-processing, no extra model calls.

Validated in production at a downstream consumer across a 58-file canonical diarization corpus: +0.225pp aggregate post-LLM SAA, no per-corpus regression, ~70–557ms compute on M-series silicon (well under 1% of total pipeline time vs ASR/snap/LLM stages).

What changed

New public ChunkEmbedding struct (Sendable, Codable) carrying speakerId, chunkIndex, speakerIndex, startTimeSeconds, endTimeSeconds, embedding256, and rho128. Speaker IDs follow the same "S\(cluster + 1)" convention as TimedSpeakerSegment.speakerId so chunk embeddings align to segments by string equality. rho128 is non-optional and empty when no PLDA model is loaded, matching the internal TimedEmbedding.rho128 shape.
New optional chunkEmbeddings: [ChunkEmbedding]? field on DiarizationResult (defaults to nil, populated only when opted in).
New OfflineDiarizerConfig.exposeChunkEmbeddings: Bool flag (defaults to false). When enabled, OfflineDiarizerManager.process(...) maps the internal [TimedEmbedding] + assignments to the public [ChunkEmbedding] array via buildPublicChunkEmbeddings(...).
9 new unit tests covering default values, opt-in behavior, initializer round-trip, the cluster-int → "S\(N)" mapping, length-mismatch handling (returns empty + warns), empty input, and Codable round-tripping.

Backwards compatibility & performance

Fully opt-in. With exposeChunkEmbeddings = false (the default), the new code path is one boolean check that lands in the nil branch — no extra allocation, no extra compute, no memory cost.
DiarizationResult.init(...) adds chunkEmbeddings: between speakerDatabase: and timings: with default nil. All existing callers in this repo use named arguments (DiarizerManager.swift:220, with one positional-segments-only call at :223), so no call site changes.
OfflineDiarizerConfig.init(...) adds exposeChunkEmbeddings: with default false. The community/community-1 presets and existing init overloads keep their behavior.
When opted in: ~1–2 MB per hour of audio for the embedding + PLDA payload. Helper is O(n) over already-computed data — no model calls, no audio access, ~1 ms even on 90-min files.

Tests & lint

$ swift test
Executed 1352 tests, with 24 tests skipped and 0 failures (0 unexpected)

$ swift format lint --recursive --configuration .swift-format Sources/ Tests/
# (only pre-existing warnings on Fa/Fb in OfflineDiarizerConfig and
#  Sortformer/CLI files; none introduced by this change)

Out of scope

The downstream consumer that motivated this API lives outside this repo and is not part of this PR.

Generated by Claude Code

Add EmbeddingSkipStrategy to OfflineDiarizerConfig that skips redundant speaker embedding model calls when consecutive segmentation windows have highly similar speaker masks. At the default config (stepRatio=0.20) this has minimal effect. At higher-overlap configs (e.g., stepRatio=0.15) it provides 1.4-2.3x embedding speedup with zero quality loss.

Add an opt-in `exposeChunkEmbeddings` flag on OfflineDiarizerConfig that surfaces per-chunk speaker embeddings (already computed internally) on DiarizationResult.chunkEmbeddings. Enables chunk-granularity post-processing — e.g. cluster-purity correction via centroid migration — without re-running the embedding model. Default off; no behavior change for existing callers.

adamsro added 2 commits April 3, 2026 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(diarizer): expose per-chunk embeddings on DiarizationResult#633

feat(diarizer): expose per-chunk embeddings on DiarizationResult#633
adamsro wants to merge 2 commits into
FluidInference:mainfrom
MimicScribe:feat/expose-chunk-embeddings

adamsro commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adamsro commented May 23, 2026

Why is this change needed?

What changed

Backwards compatibility & performance

Tests & lint

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant