Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions backend/docs/translation-architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# ADR-001: Full-Text vs Delta Translation in Streaming Coordinator

## Status

**Accepted** (2026-03-31, implicit — original implementation)
**Superseded By:** DD-008 improvements (PR #28, 2026-06-14)

## Context

Real-time translation during speech-to-text (STT) produces evolving text. As Deepgram streams transcript segments, each segment's text grows incrementally:

```
"Hola" → "Hola como" → "Hola como estas" → "Hola como estas bien"
```

The `TranslationCoordinator` must decide what to send to the Google Translate V3 API on each update cycle.

## Decision

Send **full segment text** (not just the delta/new portion) to the batch translator.

### Rationale

1. **Translation quality**: Google Translate V3's NMT model produces better output for complete sentences than for fragments. Full context enables:
- Correct gender agreement (`"Las estudiantes son inteligentes"` → `"The students are intelligent"`, not `"intelligent"` losing feminine)
- Idiom recognition (`"Estoy de acuerdo"` → `"I agree"`, not `"acuerdo"` → `"agreement"`)
- Proper word ordering in language pairs with different SVO structures

2. **Assembly simplicity**: The `assembled_translation` stored in `SegmentState` IS the final persisted result — it must be high quality since it's written to Firestore and displayed to users. No stitching logic needed.

3. **Stability gates filter most fragments**: Text only reaches the TRANSLATE gate when it has sentence-ending punctuation, a speaker switch, >700ms silence, STT `is_final` signal, or ≥12 tokens open for ≥3 seconds. By the time text reaches Google, it's usually a complete clause.

## Consequences

### Positive

| Aspect | Impact |
|--------|--------|
| Translation quality | High — full NMT context for all persisted results |
| Code simplicity | Single-phase architecture, no delta-stitching logic |
| Debuggability | Each translation is self-contained; easy to inspect |
| Cache correctness | Full-text cache entries are always valid as-is |

### Negative

| Aspect | Impact | Dollar Cost |
|--------|--------|-------------|
| Redundant API calls | Evolving text generates unique MD5 cache keys at every step | ~$1,700–2,350/mo avoidable |
| Cache hit rate depression | Identical sentences across different segments don't dedup | Current: ~9% effective hit rate |
| Firestore write amplification | Each intermediate translation overwrites previous one | ~5x writes per stabilized segment |
| Character throughput | 284M chars/mo vs ~120–170M chars/mo potential | $4,282/mo vs ~$1,900–2,500/mo target |

## Alternatives Considered

### Alternative A: Delta Text + Stitching
Send only `new_text` (the uncommitted portion) to the API. Stitch onto previous `assembled_translation`.

**Rejected initially** because:
- Fragment quality risk (see Rationale #1 above)
- Complex stitching logic needed for STT backtracking
- Would need to detect when STT revises earlier words

**Re-evaluated in DD-008** as Phase 2 (future work) with streaming best-effort + finalization quality guarantee.

### Alternative B: Two-Phase Architecture (Streaming + Finalization)
- **Streaming phase**: Send deltas for real-time UX (best-effort quality)
- **Finalization phase**: On stability signal, send full text split into sentences with per-sentence caching

**Selected as future direction** (DD-008 Phase 2). Not implemented yet because it requires significant architectural changes and STT backtracking handling.

### Alternative C: Sentence-Level Dedup Only (CHOSEN — This PR)
Keep sending full text, but split into sentences before cache lookup. Preserves full-sentence translation quality while eliminating redundant translations of identical sentences across segments/users.

**Implemented in PR #28** alongside merge-aware Redis lookup.

## This PR's Changes (DD-008 Phase 1 + 3)

| Change | File | What | Savings |
|--------|------|------|---------|
| Sentence-level dedup | `translation.py:translate_units_batch()` | Split texts → per-sentence cache check → reassemble | $1,070–$1,500/mo |
| Merge-aware Redis lookup | `translation_coordinator.py:199-214` | On prefix reset, check Redis before re-translating | $430–640/mo |
| Design-decision comments | `translation_coordinator.py` | Document why full text is sent, trade-offs, future path | $0 (documentation) |

## References

- DD-008 deep dive: `deep-dives/DD-008-translate-dedup-gap.md`
- DD-008 design review: `deep-dives/DD-008-design-review.md`
- Original implementation: commit `fd25ede51` (PR #6155/#6178 by Thinh, 2026-03-31)
- Google Cloud Translate V3 API: https://cloud.google.com/translate/docs/basic/translating-text
138 changes: 91 additions & 47 deletions backend/utils/translation.py
Original file line number Diff line number Diff line change
Expand Up @@ -530,60 +530,76 @@ def translate_text_by_sentence(self, dest_language: str, text: str) -> Tuple[str
def translate_units_batch(self, dest_language: str, units: List[Tuple[str, str]]) -> List[Tuple[str, str, str]]:
"""Translate a batch of (unit_id, text) pairs in minimal GCP API calls.

Deduplicates identical texts, checks all cache layers, and batches
only truly uncached texts into a single API call.
Splits each text into sentences, checks all cache layers PER SENTENCE,
batches only truly uncached sentences into a single API call, then
reassembles per-unit results.

This sentence-level dedup means that if two different units share
a common sentence (e.g., "How are you?"), only the first occurrence
triggers an API call — subsequent units get the cached result.

Returns list of (unit_id, translated_text, detected_lang) in input order.
"""
if not units:
return []

# Build deduplicated mapping: text_hash -> (text, [indices])
results = [None] * len(units)
hash_to_info = {} # text_hash -> {'text': str, 'indices': [int]}

for i, (unit_id, text) in enumerate(units):
text_hash = hashlib.md5(text.encode()).hexdigest()
if text_hash not in hash_to_info:
hash_to_info[text_hash] = {'text': text, 'indices': [], 'hash': text_hash}
hash_to_info[text_hash]['indices'].append(i)

# Phase 1: Check caches for each unique text
uncached_hashes = []
for text_hash, info in hash_to_info.items():
# Phase 0: Split each unit's text into sentences
# unit_sentences[i] = list of (sentence_text, sentence_hash) for unit i
unit_sentences = []
for unit_id, text in units:
sentences = split_into_sentences(text)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check full-text caches before splitting

For multi-sentence segments, this now splits the text before any cache lookup, so the old full-text LRU/Redis/negative keys are never consulted. That means existing 14-day Redis entries produced by the previous batch path or by translate_text() for texts like Hola. Gracias. will be bypassed and sent to Google again until sentence-level keys happen to be populated, undercutting the cost-saving goal on deploy; add the full-text cache/negative-cache check before falling back to sentence-level dedup.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve abbreviations when splitting sentences

With this new batch path, any transcript containing common abbreviations or decimals is sent to Google as broken fragments because split_into_sentences() splits on every period: for example I live in the U.S. now. becomes I live in the U., S., now., and Version 3.11 is installed. becomes Version 3., 11 is installed.. The previous full-text batch translation preserved that context, so these inputs can now persist noticeably worse translations; use safer sentence-boundary rules or avoid sentence splitting for these cases.

Useful? React with 👍 / 👎.

hashed = [(s, hashlib.md5(s.encode()).hexdigest()) for s in sentences]
unit_sentences.append((unit_id, text, hashed))

# Build global sentence-level dedup map:
# sent_hash -> {'text': str, 'indices': [(unit_idx, sent_idx), ...]}
sent_hash_to_info = {}
for unit_idx, (_, _, sentences) in enumerate(unit_sentences):
for sent_idx, (sent_text, sent_hash) in enumerate(sentences):
if sent_hash not in sent_hash_to_info:
sent_hash_to_info[sent_hash] = {
'text': sent_text,
'indices': [],
}
sent_hash_to_info[sent_hash]['indices'].append((unit_idx, sent_idx))

# Phase 1: Check caches for each unique sentence
# sent_translation[hash] = (translated_text, detected_lang) or None
sent_translation = {} # hash -> (str, str)
uncached_sent_hashes = []

for sent_hash, info in sent_hash_to_info.items():
# Check negative cache first
if get_negative_cache(text_hash, dest_language):
for idx in info['indices']:
results[idx] = (units[idx][0], info['text'], '') # return original text
if get_negative_cache(sent_hash, dest_language):
sent_translation[sent_hash] = (info['text'], '') # return original
continue

# Check memory cache
cached = self._check_memory_cache(text_hash, dest_language)
cached = self._check_memory_cache(sent_hash, dest_language)
if cached:
for idx in info['indices']:
results[idx] = (units[idx][0], cached[0], cached[1])
sent_translation[sent_hash] = cached
continue

# Check Redis cache
redis_cached = get_cached_translation(text_hash, dest_language)
redis_cached = get_cached_translation(sent_hash, dest_language)
if redis_cached:
translated = redis_cached["text"]
detected = redis_cached.get("detected_lang", "")
self._set_memory_cache(text_hash, dest_language, translated, detected)
for idx in info['indices']:
results[idx] = (units[idx][0], translated, detected)
self._set_memory_cache(sent_hash, dest_language, translated, detected)
sent_translation[sent_hash] = (translated, detected)
continue

uncached_hashes.append(text_hash)
uncached_sent_hashes.append(sent_hash)

# Phase 2: Batch translate uncached texts
if uncached_hashes:
uncached_texts = [hash_to_info[h]['text'] for h in uncached_hashes]
# Phase 2: Batch translate uncached sentences
_failed_sent_hashes: set = set() # track which sentences fell back to original text
if uncached_sent_hashes:
uncached_texts = [sent_hash_to_info[h]['text'] for h in uncached_sent_hashes]

for chunk_start in range(0, len(uncached_texts), MAX_BATCH_SIZE):
chunk_end = min(chunk_start + MAX_BATCH_SIZE, len(uncached_texts))
chunk = uncached_texts[chunk_start:chunk_end]
chunk_hashes = uncached_hashes[chunk_start:chunk_end]
chunk_hashes = uncached_sent_hashes[chunk_start:chunk_end]

try:
response = _client.translate_text(
Expand All @@ -594,29 +610,57 @@ def translate_units_batch(self, dest_language: str, units: List[Tuple[str, str]]
)

for j, translation in enumerate(response.translations):
text_hash = chunk_hashes[j]
sent_hash = chunk_hashes[j]
translated_text = translation.translated_text
detected_lang = translation.detected_language_code or ""
info = hash_to_info[text_hash]

self._set_memory_cache(text_hash, dest_language, translated_text, detected_lang)
cache_translation(text_hash, dest_language, translated_text, detected_lang)

for idx in info['indices']:
results[idx] = (units[idx][0], translated_text, detected_lang)
self._set_memory_cache(sent_hash, dest_language, translated_text, detected_lang)
cache_translation(sent_hash, dest_language, translated_text, detected_lang)
sent_translation[sent_hash] = (translated_text, detected_lang)

except Exception as e:
logger.error(f"Batch translation error: {e}")
logger.error(f"Sentence-level batch translation error: {e}")
# Mark failed sentences as fallbacks so Phase 3 does NOT
# write them to Redis (avoids poisoning cache with raw text).
for h in chunk_hashes:
info = hash_to_info[h]
for idx in info['indices']:
if results[idx] is None:
results[idx] = (units[idx][0], info['text'], '')

# Fill any remaining None results (shouldn't happen, but be safe)
for i in range(len(results)):
if results[i] is None:
results[i] = (units[i][0], units[i][1], '')
if h not in sent_translation:
sent_translation[h] = (sent_hash_to_info[h]['text'], '')
_failed_sent_hashes.add(h)

# Phase 3: Reassemble per-unit results from sentence translations
results = []
for unit_idx, (unit_id, original_text, sentences) in enumerate(unit_sentences):
if not sentences:
results.append((unit_id, original_text, ''))
continue

translated_parts = []
detected_langs = []
for sent_text, sent_hash in sentences:
if sent_hash in sent_translation:
trans_text, det_lang = sent_translation[sent_hash]
translated_parts.append(trans_text)
if det_lang:
detected_langs.append(det_lang)
else:
# Fallback: should not happen, but use original text
translated_parts.append(sent_text)

assembled = ' '.join(translated_parts)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ASCII space joining breaks CJK target-language output

' '.join(translated_parts) inserts a Latin space between sentences even when the target language is Chinese, Japanese, or Korean. Those languages don't delimit sentences with spaces, so the assembled string will contain spurious U+0020 characters (e.g. 你好。 你好吗? instead of 你好。你好吗?). This affects every CJK target-language user and degrades the quality of the persisted Firestore translation.

Suggested change
assembled = ' '.join(translated_parts)
_CJK_LANGS = {'zh', 'ja', 'ko', 'zh-cn', 'zh-tw', 'zh-hk'}
sep = '' if dest_language.lower().split('-')[0] in _CJK_LANGS else ' '
assembled = sep.join(translated_parts)

# Dominant detected language from constituent sentences
dominant_lang = ''
if detected_langs:
lang_counts = Counter(detected_langs)
dominant_lang = lang_counts.most_common(1)[0][0]

text_hash = hashlib.md5(original_text.encode()).hexdigest()
# Only persist to any cache if NO sentence fell back to original text
# (avoids poisoning both in-memory LRU and Redis with untranslated output).
if not any(sh in _failed_sent_hashes for _, sh in sentences):
self._set_memory_cache(text_hash, dest_language, assembled, dominant_lang)
cache_translation(text_hash, dest_language, assembled, dominant_lang)

results.append((unit_id, assembled, dominant_lang))
Comment on lines +630 to +663

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Phase 3 unconditionally writes full-text positive cache even for same-language text

cache_translation(text_hash, ...) is always called in Phase 3 regardless of whether the assembled translation actually differs from the original. If all constituent sentences were detected as same-language (returning original text), the full-text cache entry will hold the original text — an entry that downstream callers (including the merge-aware Fix 2 lookup) will treat as a valid translation. The _flush_batch check in the coordinator catches this for the normal flow, but the Fix 2 shortcut path bypasses it, so a "translation" equalling the source text can reach on_translation_ready.


return results

Expand Down
Loading
Loading