Skip to content

feat(diarizer): expose per-chunk embeddings on DiarizationResult#633

Open
adamsro wants to merge 2 commits into
FluidInference:mainfrom
MimicScribe:feat/expose-chunk-embeddings
Open

feat(diarizer): expose per-chunk embeddings on DiarizationResult#633
adamsro wants to merge 2 commits into
FluidInference:mainfrom
MimicScribe:feat/expose-chunk-embeddings

Conversation

@adamsro
Copy link
Copy Markdown
Contributor

@adamsro adamsro commented May 23, 2026

Why is this change needed?

Surfaces the per-chunk speaker embeddings the offline pipeline already computes internally (currently only reachable via --export-embeddings JSON dump in the CLI), so consumers can implement chunk-granularity post-processing without re-running the embedding model.

The motivating use case is downstream cluster-purity correction: the clustering step occasionally lands a fraction of one speaker's chunks in another speaker's cluster (mega-cluster contamination on long files, bidirectional smearing between similar voices). With per-chunk embeddings plus their cluster assignments exposed on DiarizationResult, downstream code can compute per-cluster centroids and migrate chunks whose own-cluster cosine is dominated by an alternate centroid — pure NumPy/BLAS-equivalent post-processing, no extra model calls.

Validated in production at a downstream consumer across a 58-file canonical diarization corpus: +0.225pp aggregate post-LLM SAA, no per-corpus regression, ~70–557ms compute on M-series silicon (well under 1% of total pipeline time vs ASR/snap/LLM stages).

What changed

  • New public ChunkEmbedding struct (Sendable, Codable) carrying speakerId, chunkIndex, speakerIndex, startTimeSeconds, endTimeSeconds, embedding256, and rho128. Speaker IDs follow the same "S\(cluster + 1)" convention as TimedSpeakerSegment.speakerId so chunk embeddings align to segments by string equality. rho128 is non-optional and empty when no PLDA model is loaded, matching the internal TimedEmbedding.rho128 shape.
  • New optional chunkEmbeddings: [ChunkEmbedding]? field on DiarizationResult (defaults to nil, populated only when opted in).
  • New OfflineDiarizerConfig.exposeChunkEmbeddings: Bool flag (defaults to false). When enabled, OfflineDiarizerManager.process(...) maps the internal [TimedEmbedding] + assignments to the public [ChunkEmbedding] array via buildPublicChunkEmbeddings(...).
  • 9 new unit tests covering default values, opt-in behavior, initializer round-trip, the cluster-int → "S\(N)" mapping, length-mismatch handling (returns empty + warns), empty input, and Codable round-tripping.

Backwards compatibility & performance

  • Fully opt-in. With exposeChunkEmbeddings = false (the default), the new code path is one boolean check that lands in the nil branch — no extra allocation, no extra compute, no memory cost.
  • DiarizationResult.init(...) adds chunkEmbeddings: between speakerDatabase: and timings: with default nil. All existing callers in this repo use named arguments (DiarizerManager.swift:220, with one positional-segments-only call at :223), so no call site changes.
  • OfflineDiarizerConfig.init(...) adds exposeChunkEmbeddings: with default false. The community/community-1 presets and existing init overloads keep their behavior.
  • When opted in: ~1–2 MB per hour of audio for the embedding + PLDA payload. Helper is O(n) over already-computed data — no model calls, no audio access, ~1 ms even on 90-min files.

Tests & lint

$ swift test
Executed 1352 tests, with 24 tests skipped and 0 failures (0 unexpected)

$ swift format lint --recursive --configuration .swift-format Sources/ Tests/
# (only pre-existing warnings on Fa/Fb in OfflineDiarizerConfig and
#  Sortformer/CLI files; none introduced by this change)

Out of scope

The downstream consumer that motivated this API lives outside this repo and is not part of this PR.


Generated by Claude Code

adamsro added 2 commits April 3, 2026 16:32
Add EmbeddingSkipStrategy to OfflineDiarizerConfig that skips redundant
speaker embedding model calls when consecutive segmentation windows have
highly similar speaker masks.

At the default config (stepRatio=0.20) this has minimal effect. At
higher-overlap configs (e.g., stepRatio=0.15) it provides 1.4-2.3x
embedding speedup with zero quality loss.
Add an opt-in `exposeChunkEmbeddings` flag on OfflineDiarizerConfig that
surfaces per-chunk speaker embeddings (already computed internally) on
DiarizationResult.chunkEmbeddings. Enables chunk-granularity
post-processing — e.g. cluster-purity correction via centroid migration —
without re-running the embedding model. Default off; no behavior change
for existing callers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant