feat(diarizer): expose per-chunk embeddings on DiarizationResult#633
Open
adamsro wants to merge 2 commits into
Open
feat(diarizer): expose per-chunk embeddings on DiarizationResult#633adamsro wants to merge 2 commits into
adamsro wants to merge 2 commits into
Conversation
Add EmbeddingSkipStrategy to OfflineDiarizerConfig that skips redundant speaker embedding model calls when consecutive segmentation windows have highly similar speaker masks. At the default config (stepRatio=0.20) this has minimal effect. At higher-overlap configs (e.g., stepRatio=0.15) it provides 1.4-2.3x embedding speedup with zero quality loss.
Add an opt-in `exposeChunkEmbeddings` flag on OfflineDiarizerConfig that surfaces per-chunk speaker embeddings (already computed internally) on DiarizationResult.chunkEmbeddings. Enables chunk-granularity post-processing — e.g. cluster-purity correction via centroid migration — without re-running the embedding model. Default off; no behavior change for existing callers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why is this change needed?
Surfaces the per-chunk speaker embeddings the offline pipeline already computes internally (currently only reachable via
--export-embeddingsJSON dump in the CLI), so consumers can implement chunk-granularity post-processing without re-running the embedding model.The motivating use case is downstream cluster-purity correction: the clustering step occasionally lands a fraction of one speaker's chunks in another speaker's cluster (mega-cluster contamination on long files, bidirectional smearing between similar voices). With per-chunk embeddings plus their cluster assignments exposed on
DiarizationResult, downstream code can compute per-cluster centroids and migrate chunks whose own-cluster cosine is dominated by an alternate centroid — pure NumPy/BLAS-equivalent post-processing, no extra model calls.Validated in production at a downstream consumer across a 58-file canonical diarization corpus: +0.225pp aggregate post-LLM SAA, no per-corpus regression, ~70–557ms compute on M-series silicon (well under 1% of total pipeline time vs ASR/snap/LLM stages).
What changed
ChunkEmbeddingstruct (Sendable, Codable) carryingspeakerId,chunkIndex,speakerIndex,startTimeSeconds,endTimeSeconds,embedding256, andrho128. Speaker IDs follow the same"S\(cluster + 1)"convention asTimedSpeakerSegment.speakerIdso chunk embeddings align to segments by string equality.rho128is non-optional and empty when no PLDA model is loaded, matching the internalTimedEmbedding.rho128shape.chunkEmbeddings: [ChunkEmbedding]?field onDiarizationResult(defaults tonil, populated only when opted in).OfflineDiarizerConfig.exposeChunkEmbeddings: Boolflag (defaults tofalse). When enabled,OfflineDiarizerManager.process(...)maps the internal[TimedEmbedding] + assignmentsto the public[ChunkEmbedding]array viabuildPublicChunkEmbeddings(...)."S\(N)"mapping, length-mismatch handling (returns empty + warns), empty input, andCodableround-tripping.Backwards compatibility & performance
exposeChunkEmbeddings = false(the default), the new code path is one boolean check that lands in thenilbranch — no extra allocation, no extra compute, no memory cost.DiarizationResult.init(...)addschunkEmbeddings:betweenspeakerDatabase:andtimings:with defaultnil. All existing callers in this repo use named arguments (DiarizerManager.swift:220, with one positional-segments-only call at:223), so no call site changes.OfflineDiarizerConfig.init(...)addsexposeChunkEmbeddings:with defaultfalse. The community/community-1 presets and existing init overloads keep their behavior.Tests & lint
Out of scope
The downstream consumer that motivated this API lives outside this repo and is not part of this PR.
Generated by Claude Code