Skip to content

feat(core-agent): add stripUnreadableSymbols for TTS text sanitization#14

Closed
vi70x4 wants to merge 8 commits into
mainfrom
feat/unreadable-symbols-stripper
Closed

feat(core-agent): add stripUnreadableSymbols for TTS text sanitization#14
vi70x4 wants to merge 8 commits into
mainfrom
feat/unreadable-symbols-stripper

Conversation

@vi70x4

@vi70x4 vi70x4 commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Extends stripMarkdownFromSpeech with additional passes to strip emoji, decorative Unicode, standalone special chars, math operators, and repeated punctuation from TTS input text. All passes are configurable via options with sensible defaults (strip everything by default).

Streaming control tokens (<|ACT|>, <|DELAY|>, <|CALL|>) are preserved via Private Use Area placeholder extraction.

Closes: extends plaintext-response-format spec

Summary

Extends the plaintext-response-format spec by adding a new stripUnreadableSymbols function that strips emoji, decorative Unicode, standalone special characters, math operators, and repeated punctuation from TTS input text — complementing the existing stripMarkdownFromSpeech function.

What changed

  • New file: packages/core-agent/src/runtime/unreadable-symbols-stripper.ts

    • stripUnreadableSymbols(text, options?) — 6-pass sanitizer (Markdown + emoji + decorative Unicode + standalone special chars + math operators + repeated punctuation collapsing)
    • StripUnreadableSymbolsOptions interface with 5 configurable boolean flags (all default true)
    • Streaming control tokens (<|ACT|>, <|DELAY|>, <|CALL|>) preserved via Private Use Area placeholder extraction
    • stripMarkdownFromSpeech remains exported unchanged (backward compatibility)
  • New file: packages/core-agent/src/runtime/unreadable-symbols-stripper.test.ts

    • 63 tests covering all stripping categories, token preservation, options behavior, edge cases, and backward compatibility
  • Modified: packages/core-agent/src/runtime/chat-orchestrator-runtime.ts

    • Both stripMarkdownFromSpeech calls (streaming path + final categorization path) replaced with stripUnreadableSymbols
  • Modified: packages/core-agent/src/index.ts

    • Added exports for stripUnreadableSymbols and StripUnreadableSymbolsOptions
  • New spec: .roo/specs/unreadable-symbols-stripper/ (requirements.md, design.md, tasks.md)

How tested

  • pnpm -F @proj-airi/core-agent typecheck — passed
  • pnpm -F @proj-airi/core-agent exec vitest run — 164 tests passed (14 test files)

Summary by Sourcery

Introduce a configurable TTS text sanitization utility and integrate it into the chat orchestrator to replace Markdown-only stripping.

New Features:

  • Add a new stripUnreadableSymbols function with configurable options for removing unreadable symbols from TTS input text and preserving streaming control tokens.
  • Export stripUnreadableSymbols and its StripUnreadableSymbolsOptions type from the core-agent public API.

Enhancements:

  • Update the chat orchestrator runtime to use the new stripUnreadableSymbols sanitizer instead of the Markdown-only stripper for both streaming and final categorization paths.

Documentation:

  • Add design, requirements, and task specs for the unreadable symbols stripper module.

Tests:

  • Add comprehensive unit tests for stripUnreadableSymbols covering symbol categories, options behavior, token preservation, edge cases, and backward compatibility.

Extends stripMarkdownFromSpeech with additional passes to strip emoji,
decorative Unicode, standalone special chars, math operators, and
repeated punctuation from TTS input text. All passes are configurable
via options with sensible defaults (strip everything by default).

Streaming control tokens (<|ACT|>, <|DELAY|>, <|CALL|>) are preserved
via Private Use Area placeholder extraction.

Closes: extends plaintext-response-format spec
@mergeguards

mergeguards Bot commented Jun 6, 2026

Copy link
Copy Markdown

MergeGuard — Free plan allows 1 active repository. Upgrade to protect more repositories.

@sourcery-ai

sourcery-ai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Reviewer's Guide

Adds a new configurable TTS sanitization utility stripUnreadableSymbols, wires it into the chat orchestrator in place of stripMarkdownFromSpeech, exports it from core-agent, and documents the behavior with a dedicated spec and tests.

Flow diagram for stripUnreadableSymbols sanitization pipeline

flowchart TD
  A[Input text with streaming tokens and markdown] --> B[extractStreamingTokens]
  B --> C[safeText]
  C --> D[stripMarkdownFromSpeech]
  D --> E[Pass 2: stripEmoji]
  E --> F[Pass 3: stripDecorativeUnicode]
  F --> G[Pass 4: stripStandaloneSpecialChars]
  G --> H[Pass 5: stripMathOperators]
  H --> I[Pass 6: collapseRepeatedPunctuation]
  I --> J[Collapse multiple spaces]
  J --> K[restoreStreamingTokens]
  K --> L[Trim result]
  L --> M[Sanitized TTS text output]
Loading

File-Level Changes

Change Details Files
Introduce stripUnreadableSymbols TTS sanitization utility with configurable passes and streaming token preservation.
  • Define StripUnreadableSymbolsOptions with 5 boolean flags defaulting to true.
  • Implement stripUnreadableSymbols to first delegate to stripMarkdownFromSpeech, then run emoji, decorative Unicode, standalone special char, math-operator, and repeated-punctuation passes in sequence.
  • Add extractStreamingTokens/restoreStreamingTokens helpers using Private Use Area placeholders to protect <
...
Add comprehensive tests for the unreadable symbols stripper behavior and options.
  • Cover emoji, decorative Unicode, standalone special chars, math operators, repeated punctuation collapsing, and their combinations.
  • Verify streaming control tokens like <
ACT
Wire stripUnreadableSymbols into the chat orchestrator runtime instead of stripMarkdownFromSpeech for TTS speech text.
  • Import stripUnreadableSymbols alongside stripMarkdownFromSpeech in chat-orchestrator-runtime.
  • Replace the streaming path call that sanitized categorizer.filterToSpeech(...) with stripUnreadableSymbols.
  • Replace the final categorization speech sanitization with stripUnreadableSymbols while leaving reasoning handling unchanged.
packages/core-agent/src/runtime/chat-orchestrator-runtime.ts
Expose the new stripper utility and options type from the core-agent public API.
  • Export stripUnreadableSymbols from the package index next to stripMarkdownFromSpeech.
  • Export the StripUnreadableSymbolsOptions type for external configuration.
packages/core-agent/src/index.ts
Document the unreadable symbols stripper design, requirements, and implementation tasks.
  • Describe the multi-pass architecture, Unicode ranges used, and streaming token protection strategy in the design doc.
  • Capture functional requirements for each stripping category and integration expectations in requirements.md.
  • Record the concrete implementation and testing tasks and their completion status in tasks.md.
.roo/specs/unreadable-symbols-stripper/design.md
.roo/specs/unreadable-symbols-stripper/requirements.md
.roo/specs/unreadable-symbols-stripper/tasks.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@deepsource-io

deepsource-io Bot commented Jun 6, 2026

Copy link
Copy Markdown

DeepSource Code Review

We reviewed changes in 3bdd1ac...77f95c8 on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

Important

Some issues found as part of this review are outside of the diff in this pull request and aren't shown in the inline review comments due to GitHub's API limitations. You can see those issues on the DeepSource dashboard.

PR Report Card

Overall Grade  

Focus Area: Reliability
Security  

Reliability  

Complexity  

Hygiene  

Feedback

Logging pattern in browser-facing code

  • The same console usage appears across multiple browser runtime files, which is why it shows up as nine occurrences of the same reliability issue.
  • Might be worth deciding when/where you want logging in these paths, since TTS and chat runtimes are going to be user-facing and potentially noisy if this sticks around.

Code Review Summary

Analyzer Status Updated (UTC) Details
JavaScript Jun 6, 2026 11:25p.m. Review ↗
Shell Jun 6, 2026 11:25p.m. Review ↗
C & C++ Jun 6, 2026 11:25p.m. Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The streaming token placeholder implementation (TOKEN_PLACEHOLDER_BASE = '\uE0000' with surrounding the index) deviates from the spec’s null-byte approach and relies on a specific Private Use Area codepoint; consider switching to a delimiter that cannot plausibly appear in user text (e.g. \x00-wrapped index) to avoid accidental collisions and simplify the restore regex.
  • The stripMathOperators regex uses a lookbehind (?<=\s), which can be brittle across JS runtimes and is slightly inconsistent with the boundary-based approach documented in the design; you might want to rework this to use an explicit (^|\s)-style grouping so behavior is both more portable and easier to reason about at string boundaries.
  • The emoji/decorative Unicode passes chain many overlapping and partially duplicated ranges with multiple replace calls; it may be easier to maintain and reason about if you consolidate these into a smaller number of well-documented ranges (or a single compiled regex) that more directly reflects the categories described in the design.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The streaming token placeholder implementation (`TOKEN_PLACEHOLDER_BASE = '\uE0000'` with surrounding the index) deviates from the spec’s null-byte approach and relies on a specific Private Use Area codepoint; consider switching to a delimiter that cannot plausibly appear in user text (e.g. `\x00`-wrapped index) to avoid accidental collisions and simplify the restore regex.
- The `stripMathOperators` regex uses a lookbehind `(?<=\s)`, which can be brittle across JS runtimes and is slightly inconsistent with the boundary-based approach documented in the design; you might want to rework this to use an explicit `(^|\s)`-style grouping so behavior is both more portable and easier to reason about at string boundaries.
- The emoji/decorative Unicode passes chain many overlapping and partially duplicated ranges with multiple `replace` calls; it may be easier to maintain and reason about if you consolidate these into a smaller number of well-documented ranges (or a single compiled regex) that more directly reflects the categories described in the design.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the stripUnreadableSymbols utility to sanitize LLM response text for TTS by stripping emoji, decorative Unicode, standalone special characters, math operators, and collapsing repeated punctuation. It integrates this utility into the chat orchestrator runtime and exports it from the core agent package. Feedback on the implementation highlights a bug in the Unicode escape sequence for the token placeholder base, which evaluates to a two-character string instead of a single character. Additionally, it is recommended to combine the numerous consecutive .replace() calls for emoji and decorative Unicode stripping into single regular expressions to improve streaming performance and eliminate redundant range matches.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/core-agent/src/runtime/unreadable-symbols-stripper.ts Outdated
Comment thread packages/core-agent/src/runtime/unreadable-symbols-stripper.ts Outdated
Comment on lines +147 to +160
if (opts.stripDecorativeUnicode) {
// U+2190-U+21FF: Arrows
result = result.replace(/[\u{2190}-\u{21FF}]/gu, '')
// U+2500-U+257F: Box drawing
result = result.replace(/[\u{2500}-\u{257F}]/gu, '')
// U+2580-U+259F: Block elements
result = result.replace(/[\u{2580}-\u{259F}]/gu, '')
// U+25A0-U+25FF: Geometric shapes
result = result.replace(/[\u{25A0}-\u{25FF}]/gu, '')
// Specific decorative chars: © ® ™ § ¶ † ‡ • ‣ ⁃
result = result.replace(/[©®™§¶†‡•‣⁃]/g, '')
// U+2100-U+214F: Letterlike symbols
result = result.replace(/[\u{2100}-\u{214F}]/gu, '')
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly to the emoji pass, we can combine all decorative Unicode ranges and characters into a single regular expression to reduce the number of .replace() calls from 6 to 1, improving performance during streaming.

  if (opts.stripDecorativeUnicode) {
    result = result.replace(/[\u{2190}-\u{21FF}\u{2500}-\u{257F}\u{2580}-\u{2580}-\u{259F}\u{25A0}-\u{25FF}©®§\u{2100}-\u{214F}]/gu, '')
  }

@@ -21,6 +21,7 @@ import { createChatHooks } from './agent-hooks'
import { useLlmmarkerParser } from './llm-marker-parser'
import { categorizeResponse, createStreamingCategorizer } from './response-categoriser'
import { stripMarkdownFromSpeech } from './markdown-stripper'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'stripMarkdownFromSpeech' is defined but never used


Unused variables are generally considered a code smell and should be avoided.

Comment on lines +38 to +46
function extractStreamingTokens(text: string): { processed: string, tokens: string[] } {
const tokens: string[] = []
const processed = text.replace(/<\|[^|]+\|>/g, (match) => {
const index = tokens.length
tokens.push(match)
return `${TOKEN_PLACEHOLDER_BASE}${index.toString(36)}${TOKEN_PLACEHOLDER_BASE}`
})
return { processed, tokens }
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

Comment on lines +51 to +56
function restoreStreamingTokens(text: string, tokens: string[]): string {
return text.replace(
new RegExp(`${TOKEN_PLACEHOLDER_BASE}([0-9a-z]+)${TOKEN_PLACEHOLDER_BASE}`, 'g'),
(_, indexStr) => tokens[Number.parseInt(indexStr, 36)] ?? '',
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

Comment on lines +84 to +195
export function stripUnreadableSymbols(
text: string,
options?: StripUnreadableSymbolsOptions,
): string {
const opts: Required<StripUnreadableSymbolsOptions> = { ...DEFAULT_OPTIONS, ...options }

// Protect streaming control tokens from stripping
const { processed: safeText, tokens } = extractStreamingTokens(text)

// Pass 1: Strip Markdown syntax (always run)
let result = stripMarkdownFromSpeech(safeText)

// Pass 2: Strip emoji and Unicode pictographic symbols
if (opts.stripEmoji) {
// Remove variation selectors, ZWJ, keycap combining chars first
result = result.replace(/\uFE0F/gu, '')
result = result.replace(/\u200D/gu, '')
result = result.replace(/\u20E3/gu, '')

// Remove emoji Unicode ranges
// U+1F300-U+1F9FF: Misc symbols, emoticons, transport, supplemental
result = result.replace(/[\u{1F300}-\u{1F9FF}]/gu, '')
// U+2600-U+26FF: Misc symbols
result = result.replace(/[\u{2600}-\u{26FF}]/gu, '')
// U+2700-U+27BF: Dingbats
result = result.replace(/[\u{2700}-\u{27BF}]/gu, '')
// U+1F3FB-U+1F3FF: Skin tone modifiers
result = result.replace(/[\u{1F3FB}-\u{1F3FF}]/gu, '')
// U+1F1E0-U+1F1FF: Regional indicator symbols (flags)
result = result.replace(/[\u{1F1E0}-\u{1F1FF}]/gu, '')
// U+1F600-U+1F64F: Emoticons (faces)
result = result.replace(/[\u{1F600}-\u{1F64F}]/gu, '')
// U+1F680-U+1F6FF: Transport and map symbols
result = result.replace(/[\u{1F680}-\u{1F6FF}]/gu, '')
// U+1FA00-U+1FAFF: Extended-A and beyond
result = result.replace(/[\u{1FA00}-\u{1FAFF}]/gu, '')
// U+2702-U+27B0: Dingbats (additional)
result = result.replace(/[\u{2702}-\u{27B0}]/gu, '')
// U+231A-U+231B: Watch, hourglass
result = result.replace(/[\u{231A}-\u{231B}]/gu, '')
// U+23E9-U+23F3: Media controls, clocks
result = result.replace(/[\u{23E9}-\u{23F3}]/gu, '')
// U+23F8-U+23FA: Media controls
result = result.replace(/[\u{23F8}-\u{23FA}]/gu, '')
// U+25AA-U+25AB, U+25B6, U+25C0, U+25FB-U+25FE: Geometric shapes
result = result.replace(/[\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}]/gu, '')
// U+2614-U+2615: Umbrella, hot beverage
result = result.replace(/[\u{2614}-\u{2615}]/gu, '')
// U+2648-U+2653: Zodiac
result = result.replace(/[\u{2648}-\u{2653}]/gu, '')
// U+267F, U+2693, U+26A1, U+26AA-U+26AB, U+26BD-U+26BE, U+26C4-U+26C5, U+26CE, U+26D4, U+26EA, U+26F2-U+26F3, U+26F5, U+26FA, U+26FD: Misc
result = result.replace(/[\u{267F}\u{2693}\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}]/gu, '')
// U+2934-U+2935: Arrows
result = result.replace(/[\u{2934}-\u{2935}]/gu, '')
// U+2B05-U+2B07: Arrows
result = result.replace(/[\u{2B05}-\u{2B07}]/gu, '')
// U+2B1B-U+2B1C, U+2B50, U+2B55: Geometric shapes
result = result.replace(/[\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}]/gu, '')
// U+3030, U+303D, U+3297, U+3299: CJK symbols
result = result.replace(/[\u{3030}\u{303D}\u{3297}\u{3299}]/gu, '')
}

// Pass 3: Strip decorative Unicode (arrows, box-drawing, shapes, dingbats)
if (opts.stripDecorativeUnicode) {
// U+2190-U+21FF: Arrows
result = result.replace(/[\u{2190}-\u{21FF}]/gu, '')
// U+2500-U+257F: Box drawing
result = result.replace(/[\u{2500}-\u{257F}]/gu, '')
// U+2580-U+259F: Block elements
result = result.replace(/[\u{2580}-\u{259F}]/gu, '')
// U+25A0-U+25FF: Geometric shapes
result = result.replace(/[\u{25A0}-\u{25FF}]/gu, '')
// Specific decorative chars: © ® ™ § ¶ † ‡ • ‣ ⁃
result = result.replace(/[©®™§¶†‡•‣⁃]/g, '')
// U+2100-U+214F: Letterlike symbols
result = result.replace(/[\u{2100}-\u{214F}]/gu, '')
}

// Pass 4: Strip standalone special characters
if (opts.stripStandaloneSpecialChars) {
// Matches standalone special chars surrounded by whitespace or at string boundaries.
// Consumes the surrounding whitespace to avoid double spaces.
result = result.replace(/(^|\s)[*#@|\\/~^`]+(?=\s|$)/g, '$1')
}

// Pass 5: Strip standalone math/operator symbols
if (opts.stripMathOperators) {
// Matches standalone operator sequences surrounded by whitespace or boundaries.
// Uses lookbehind for whitespace/start and lookahead for whitespace/end.
// Does NOT match when adjacent to non-whitespace characters (e.g., C++, A&B).
// Consumes the trailing whitespace to avoid double spaces.
result = result.replace(/(?:^|(?<=\s))[+=\-<>&^~|\\/%]+(?=\s|$)/g, '')
}

// Pass 6: Collapse repeated punctuation
if (opts.collapseRepeatedPunctuation) {
result = result.replace(/!{3,}/g, '!')
result = result.replace(/\?{3,}/g, '?')
result = result.replace(/\.{4,}/g, '…')
result = result.replace(/-{3,}/g, '—')
result = result.replace(/~{2,}/g, '~')
}

// Collapse multiple spaces into one
result = result.replace(/ {2,}/g, ' ')

// Restore streaming control tokens
result = restoreStreamingTokens(result, tokens)

// Trim leading/trailing whitespace from the overall result
return result.trim()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

* stripUnreadableSymbols('Price is $5!!! Really???', { collapseRepeatedPunctuation: false })
* // -> 'Price is $5!!! Really???'
*/
export function stripUnreadableSymbols(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`stripUnreadableSymbols` has a cyclomatic complexity of 6 with "medium" risk


A function with high cyclomatic complexity can be hard to understand and
maintain. Cyclomatic complexity is a software metric that measures the number of
independent paths through a function. A higher cyclomatic complexity indicates
that the function has more decision points and is more complex.

// U+25A0-U+25FF: Geometric shapes
result = result.replace(/[\u{25A0}-\u{25FF}]/gu, '')
// Specific decorative chars: © ® ™ § ¶ † ‡ • ‣ ⁃
result = result.replace(/[©®™§¶†‡•‣⁃]/g, '')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the 'u' flag with regular expressions


It is recommended to use the u flag with regular expressions.

- Switch streaming token placeholders from PUA codepoints to null-byte
  delimiters (\x00) to match the spec and avoid collision risk
- Rework math operator regex to use explicit (^|\s) grouping instead of
  lookbehind for better portability across JS runtimes
- Consolidate emoji/decorative Unicode ranges into fewer, well-documented
  regex calls (2 passes each instead of ~20 individual replace calls)
@mergeguards

mergeguards Bot commented Jun 6, 2026

Copy link
Copy Markdown

MergeGuard — Free plan allows 1 active repository. Upgrade to protect more repositories.

…regex calls

Deduplicate overlapping ranges (e.g. \u{1F600}-\u{1F64F} is a subset of
\u{1F300}-\u{1F9FF}, \u{2614}-\u{2615} is a subset of \u{2600}-\u{26FF})
and combine variation selectors, ZWJ, and keycap chars into the same regex.
Reduces from ~23 individual .replace() calls to 2 for all emoji/decorative
symbol stripping.
@mergeguards

mergeguards Bot commented Jun 6, 2026

Copy link
Copy Markdown

MergeGuard — Free plan allows 1 active repository. Upgrade to protect more repositories.

Comment on lines +85 to +155
export function stripUnreadableSymbols(
text: string,
options?: StripUnreadableSymbolsOptions,
): string {
const opts: Required<StripUnreadableSymbolsOptions> = { ...DEFAULT_OPTIONS, ...options }

// Protect streaming control tokens from stripping
const { processed: safeText, tokens } = extractStreamingTokens(text)

// Pass 1: Strip Markdown syntax (always run)
let result = stripMarkdownFromSpeech(safeText)

// Pass 2: Strip emoji and Unicode pictographic symbols
if (opts.stripEmoji) {
// Remove variation selectors, ZWJ, keycap combining chars first
result = result.replace(/\uFE0F/gu, '')
result = result.replace(/\u200D/gu, '')
result = result.replace(/\u20E3/gu, '')

// Emoji ranges: emoticons, faces, transport, misc symbols, dingbats,
// skin tones, regional indicators, extended-A/B, supplemental symbols
result = result.replace(
/[\u{1F300}-\u{1F9FF}\u{1F600}-\u{1F64F}\u{1F680}-\u{1F6FF}\u{1FA00}-\u{1FAFF}\u{2600}-\u{26FF}\u{2700}-\u{27BF}\u{1F3FB}-\u{1F3FF}\u{1F1E0}-\u{1F1FF}\u{2702}-\u{27B0}\u{231A}-\u{231B}\u{23E9}-\u{23F3}\u{23F8}-\u{23FA}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2614}-\u{2615}\u{2648}-\u{2653}\u{267F}\u{2693}\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}]/gu,
'',
)
}

// Pass 3: Strip decorative Unicode (arrows, box-drawing, shapes, dingbats)
if (opts.stripDecorativeUnicode) {
// Arrows, box-drawing, block elements, geometric shapes, letterlike symbols
result = result.replace(
/[\u{2190}-\u{21FF}\u{2500}-\u{257F}\u{2580}-\u{259F}\u{25A0}-\u{25FF}\u{2100}-\u{214F}]/gu,
'',
)
// Specific decorative chars: © ® ™ § ¶ † ‡ • ‣ ⁃
result = result.replace(/[©®™§¶†‡•‣⁃]/g, '')
}

// Pass 4: Strip standalone special characters
if (opts.stripStandaloneSpecialChars) {
// Matches standalone special chars surrounded by whitespace or at string boundaries.
// Consumes the surrounding whitespace to avoid double spaces.
result = result.replace(/(^|\s)[*#@|\\/~^`]+(?=\s|$)/g, '$1')
}

// Pass 5: Strip standalone math/operator symbols
if (opts.stripMathOperators) {
// Matches standalone operator sequences at string boundaries or surrounded by whitespace.
// Uses explicit (^|\s) grouping for boundary matching — more portable than lookbehind.
// Does NOT match when adjacent to non-whitespace characters (e.g., C++, A&B).
result = result.replace(/(^|\s)[+=\-<>&^~|\\/%]+(?=\s|$)/g, '$1')
}

// Pass 6: Collapse repeated punctuation
if (opts.collapseRepeatedPunctuation) {
result = result.replace(/!{3,}/g, '!')
result = result.replace(/\?{3,}/g, '?')
result = result.replace(/\.{4,}/g, '…')
result = result.replace(/-{3,}/g, '—')
result = result.replace(/~{2,}/g, '~')
}

// Collapse multiple spaces into one
result = result.replace(/ {2,}/g, ' ')

// Restore streaming control tokens
result = restoreStreamingTokens(result, tokens)

// Trim leading/trailing whitespace from the overall result
return result.trim()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

// Emoji ranges: emoticons, faces, transport, misc symbols, dingbats,
// skin tones, regional indicators, extended-A/B, supplemental symbols
result = result.replace(
/[\u{1F300}-\u{1F9FF}\u{1F600}-\u{1F64F}\u{1F680}-\u{1F6FF}\u{1FA00}-\u{1FAFF}\u{2600}-\u{26FF}\u{2700}-\u{27BF}\u{1F3FB}-\u{1F3FF}\u{1F1E0}-\u{1F1FF}\u{2702}-\u{27B0}\u{231A}-\u{231B}\u{23E9}-\u{23F3}\u{23F8}-\u{23FA}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2614}-\u{2615}\u{2648}-\u{2653}\u{267F}\u{2693}\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}]/gu,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected modified Emoji in character class


Unicode includes the characters which are made with multiple code points. RegExp character class syntax (/[abc]/) cannot handle characters which are made by multiple code points as a character; those characters will be dissolved to each code point. Probably the most important concept about Unicode in JavaScript is to treat strings as sequences of code units, as they really are. The confusion appears when the developer thinks that strings are composed of graphemes (or symbols), ignoring the code unit sequence concept.

Comment on lines +85 to +178
export function stripUnreadableSymbols(
text: string,
options?: StripUnreadableSymbolsOptions,
): string {
const opts: Required<StripUnreadableSymbolsOptions> = { ...DEFAULT_OPTIONS, ...options }

// Protect streaming control tokens from stripping
const { processed: safeText, tokens } = extractStreamingTokens(text)

// Pass 1: Strip Markdown syntax (always run)
let result = stripMarkdownFromSpeech(safeText)

// Pass 2: Strip emoji, pictographic symbols, and decorative Unicode.
// Ranges are deduplicated: \u{1F300}-\u{1F9FF} already covers
// \u{1F600}-\u{1F64F} (emoticons) and \u{1F680}-\u{1F6FF} (transport),
// so those subsets are omitted. \u{2600}-\u{26FF} covers \u{2614}-\u{2615},
// \u{2648}-\u{2653}, etc. Variation selectors and ZWJ are included in the
// same regex to minimize .replace() calls.
if (opts.stripEmoji || opts.stripDecorativeUnicode) {
// Build a combined character class from all needed ranges
const ranges: string[] = []

if (opts.stripEmoji) {
// Variation selectors, ZWJ, keycap combining chars
ranges.push('\uFE0F', '\u200D', '\u20E3')
// Emoji & pictographic symbols (deduplicated — no subsets of the above)
ranges.push(
'\\u{1F300}-\\u{1F9FF}', // Misc symbols, emoticons, transport, supplemental
'\\u{1F1E0}-\\u{1F1FF}', // Regional indicator symbols (flags)
'\\u{1F3FB}-\\u{1F3FF}', // Skin tone modifiers
'\\u{1FA00}-\\u{1FAFF}', // Extended-A and beyond
'\\u{2600}-\\u{26FF}', // Misc symbols (covers \u{2614}-\u{2615}, \u{2648}-\u{2653}, etc.)
'\\u{2700}-\\u{27BF}', // Dingbats
'\\u{231A}-\\u{231B}', // Watch, hourglass
'\\u{23E9}-\\u{23F3}', // Media controls, clocks
'\\u{23F8}-\\u{23FA}', // Media controls
'\\u{25AA}-\\u{25AB}', '\\u{25B6}', '\\u{25C0}', '\\u{25FB}-\\u{25FE}', // Geometric shapes
'\\u{2934}-\\u{2935}', // Arrows
'\\u{2B05}-\\u{2B07}', // Arrows
'\\u{2B1B}-\\u{2B1C}', '\\u{2B50}', '\\u{2B55}', // Geometric shapes
'\\u{3030}', '\\u{303D}', '\\u{3297}', '\\u{3299}', // CJK symbols
)
}

if (opts.stripDecorativeUnicode) {
// Arrows, box-drawing, block elements, geometric shapes, letterlike symbols
ranges.push(
'\\u{2190}-\\u{21FF}', // Arrows
'\\u{2500}-\\u{257F}', // Box drawing
'\\u{2580}-\\u{259F}', // Block elements
'\\u{25A0}-\\u{25FF}', // Geometric shapes
'\\u{2100}-\\u{214F}', // Letterlike symbols
'\u00A9', '\u00AE', '\\u{2122}', // © ® ™
'\\u{00A7}', '\\u{00B6}', '\\u{2020}', '\\u{2021}', // § ¶ † ‡
'\\u{2022}', '\\u{2023}', '\\u{2043}', // • ‣ ⁃
)
}

result = result.replace(new RegExp(`[${ranges.join('')}]`, 'gu'), '')
}

// Pass 3: Strip standalone special characters
if (opts.stripStandaloneSpecialChars) {
// Matches standalone special chars surrounded by whitespace or at string boundaries.
// Consumes the surrounding whitespace to avoid double spaces.
result = result.replace(/(^|\s)[*#@|\\/~^`]+(?=\s|$)/g, '$1')
}

// Pass 4: Strip standalone math/operator symbols
if (opts.stripMathOperators) {
// Matches standalone operator sequences at string boundaries or surrounded by whitespace.
// Uses explicit (^|\s) grouping for boundary matching — more portable than lookbehind.
// Does NOT match when adjacent to non-whitespace characters (e.g., C++, A&B).
result = result.replace(/(^|\s)[+=\-<>&^~|\\/%]+(?=\s|$)/g, '$1')
}

// Pass 5: Collapse repeated punctuation
if (opts.collapseRepeatedPunctuation) {
result = result.replace(/!{3,}/g, '!')
result = result.replace(/\?{3,}/g, '?')
result = result.replace(/\.{4,}/g, '…')
result = result.replace(/-{3,}/g, '—')
result = result.replace(/~{2,}/g, '~')
}

// Collapse multiple spaces into one
result = result.replace(/ {2,}/g, ' ')

// Restore streaming control tokens
result = restoreStreamingTokens(result, tokens)

// Trim leading/trailing whitespace from the overall result
return result.trim()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

* stripUnreadableSymbols('Price is $5!!! Really???', { collapseRepeatedPunctuation: false })
* // -> 'Price is $5!!! Really???'
*/
export function stripUnreadableSymbols(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`stripUnreadableSymbols` has a cyclomatic complexity of 8 with "medium" risk


A function with high cyclomatic complexity can be hard to understand and
maintain. Cyclomatic complexity is a software metric that measures the number of
independent paths through a function. A higher cyclomatic complexity indicates
that the function has more decision points and is more complex.

…tch em/en dashes, arrows, and other symbols

Em dash (U+2014), en dash (U+2013), and other punctuation in the
General Punctuation block were leaking through to TTS. Adding
\u{2000}-\u{206F} to the decorative Unicode ranges ensures these
symbols are stripped along with arrows and box-drawing characters.
Comment on lines +85 to +180
export function stripUnreadableSymbols(
text: string,
options?: StripUnreadableSymbolsOptions,
): string {
const opts: Required<StripUnreadableSymbolsOptions> = { ...DEFAULT_OPTIONS, ...options }

// Protect streaming control tokens from stripping
const { processed: safeText, tokens } = extractStreamingTokens(text)

// Pass 1: Strip Markdown syntax (always run)
let result = stripMarkdownFromSpeech(safeText)

// Pass 2: Strip emoji, pictographic symbols, and decorative Unicode.
// Ranges are deduplicated: \u{1F300}-\u{1F9FF} already covers
// \u{1F600}-\u{1F64F} (emoticons) and \u{1F680}-\u{1F6FF} (transport),
// so those subsets are omitted. \u{2600}-\u{26FF} covers \u{2614}-\u{2615},
// \u{2648}-\u{2653}, etc. Variation selectors and ZWJ are included in the
// same regex to minimize .replace() calls.
if (opts.stripEmoji || opts.stripDecorativeUnicode) {
// Build a combined character class from all needed ranges
const ranges: string[] = []

if (opts.stripEmoji) {
// Variation selectors, ZWJ, keycap combining chars
ranges.push('\uFE0F', '\u200D', '\u20E3')
// Emoji & pictographic symbols (deduplicated — no subsets of the above)
ranges.push(
'\\u{1F300}-\\u{1F9FF}', // Misc symbols, emoticons, transport, supplemental
'\\u{1F1E0}-\\u{1F1FF}', // Regional indicator symbols (flags)
'\\u{1F3FB}-\\u{1F3FF}', // Skin tone modifiers
'\\u{1FA00}-\\u{1FAFF}', // Extended-A and beyond
'\\u{2600}-\\u{26FF}', // Misc symbols (covers \u{2614}-\u{2615}, \u{2648}-\u{2653}, etc.)
'\\u{2700}-\\u{27BF}', // Dingbats
'\\u{231A}-\\u{231B}', // Watch, hourglass
'\\u{23E9}-\\u{23F3}', // Media controls, clocks
'\\u{23F8}-\\u{23FA}', // Media controls
'\\u{25AA}-\\u{25AB}', '\\u{25B6}', '\\u{25C0}', '\\u{25FB}-\\u{25FE}', // Geometric shapes
'\\u{2934}-\\u{2935}', // Arrows
'\\u{2B05}-\\u{2B07}', // Arrows
'\\u{2B1B}-\\u{2B1C}', '\\u{2B50}', '\\u{2B55}', // Geometric shapes
'\\u{3030}', '\\u{303D}', '\\u{3297}', '\\u{3299}', // CJK symbols
)
}

if (opts.stripDecorativeUnicode) {
// Arrows, box-drawing, block elements, geometric shapes, letterlike symbols,
// general punctuation (em/en dashes, typographic quotes, ellipsis, etc.)
ranges.push(
'\\u{2190}-\\u{21FF}', // Arrows
'\\u{2500}-\\u{257F}', // Box drawing
'\\u{2580}-\\u{259F}', // Block elements
'\\u{25A0}-\\u{25FF}', // Geometric shapes
'\\u{2100}-\\u{214F}', // Letterlike symbols
'\\u{2000}-\\u{206F}', // General punctuation (em/en dashes, quotes, ellipsis, etc.)
'\u00A9', '\u00AE', '\\u{2122}', // © ® ™
'\\u{00A7}', '\\u{00B6}', '\\u{2020}', '\\u{2021}', // § ¶ † ‡
'\\u{2022}', '\\u{2023}', '\\u{2043}', // • ‣ ⁃
)
}

result = result.replace(new RegExp(`[${ranges.join('')}]`, 'gu'), '')
}

// Pass 3: Strip standalone special characters
if (opts.stripStandaloneSpecialChars) {
// Matches standalone special chars surrounded by whitespace or at string boundaries.
// Consumes the surrounding whitespace to avoid double spaces.
result = result.replace(/(^|\s)[*#@|\\/~^`]+(?=\s|$)/g, '$1')
}

// Pass 4: Strip standalone math/operator symbols
if (opts.stripMathOperators) {
// Matches standalone operator sequences at string boundaries or surrounded by whitespace.
// Uses explicit (^|\s) grouping for boundary matching — more portable than lookbehind.
// Does NOT match when adjacent to non-whitespace characters (e.g., C++, A&B).
result = result.replace(/(^|\s)[+=\-<>&^~|\\/%]+(?=\s|$)/g, '$1')
}

// Pass 5: Collapse repeated punctuation
if (opts.collapseRepeatedPunctuation) {
result = result.replace(/!{3,}/g, '!')
result = result.replace(/\?{3,}/g, '?')
result = result.replace(/\.{4,}/g, '…')
result = result.replace(/-{3,}/g, '—')
result = result.replace(/~{2,}/g, '~')
}

// Collapse multiple spaces into one
result = result.replace(/ {2,}/g, ' ')

// Restore streaming control tokens
result = restoreStreamingTokens(result, tokens)

// Trim leading/trailing whitespace from the overall result
return result.trim()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

- Add stripMarkdown option to TtsInputChunkOptions and SpeechPipelineOptions
- Add stripMarkdownFromText function to strip Markdown formatting
- Fix starsUnclosed to detect unclosed ** (bold) patterns by counting **
- Fix early return in segmenter to also check stripMarkdown option
- Pass stripMarkdown to segmenter in speech-pipeline.ts
- Add tests for stripMarkdown option in speech-pipeline.test.ts

Fixes TTS reading **bold** markers as STARSTARboldSTARSTAR when split across chunks.
Comment on lines +24 to +66
function stripMarkdownFromText(text: string): string {
let result = text

// Code fences (```...```) — must run before inline code
result = result.replace(/^```.*\n([\s\S]*?)^```$/gm, '$1')

// Inline code (`code`) — preserve inner text
result = result.replace(/`([^`]+)`/g, '$1')

// Bold (**text**) — preserve inner text
result = result.replace(/\*\*([^*]+?)\*\*/g, '$1')

// Strikethrough (~~text~~) — preserve inner text
result = result.replace(/~~([^~]+?)~~/g, '$1')

// Headings (# Heading) — remove # markers at line start, preserve text
result = result.replace(/^#{1,6}\s+/gm, '')

// Bullet lists (- item or * item) — remove marker at line start, preserve text
result = result.replace(/^[-*]\s+/gm, '')

// Numbered lists (1. item) — remove number+dot at line start, preserve text
result = result.replace(/^\d+\.\s+/gm, '')

// Blockquotes (> quote) — remove > marker at line start, preserve text
result = result.replace(/^>\s+/gm, '')

// Italic (*text*) — preserve inner text
result = result.replace(/\*([^*]+?)\*/g, '$1')

// Italic (_text_) — preserve inner text
result = result.replace(/_([^_]+?)_/g, '$1')

// Links [text](url) — preserve link text only
result = result.replace(/\[([^\]]+?)\]\([^)]+?\)/g, '$1')

// Horizontal rules (---, ***, ___) — remove entirely
result = result.replace(/^---+$/gm, '')
result = result.replace(/^\*\*\*+$/gm, '')
result = result.replace(/^___+$/gm, '')

return result
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

Comment on lines +320 to +323
async tts(request) {
ttsRequests.push(request)
return request.text
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found `async` function without any `await` expressions


A function that does not contain any await expressions should not be async (except for some edge cases in TypeScript which are discussed below). Asynchronous functions in JavaScript behave differently than other functions in two important ways:

Remove any remaining ** markers that survived the markdown pass.
This handles split markers like **bold + text** where neither chunk
has a complete **...** pattern.
Comment on lines +85 to +184
export function stripUnreadableSymbols(
text: string,
options?: StripUnreadableSymbolsOptions,
): string {
const opts: Required<StripUnreadableSymbolsOptions> = { ...DEFAULT_OPTIONS, ...options }

// Protect streaming control tokens from stripping
const { processed: safeText, tokens } = extractStreamingTokens(text)

// Pass 1: Strip Markdown syntax (always run)
let result = stripMarkdownFromSpeech(safeText)

// Aggressive star stripping for TTS: remove any ** that survived the markdown pass.
// This handles split markers like **bold + text** where neither chunk has complete **...**.
result = result.replace(/\*\*/g, '')

// Pass 2: Strip emoji, pictographic symbols, and decorative Unicode.
// Ranges are deduplicated: \u{1F300}-\u{1F9FF} already covers
// \u{1F600}-\u{1F64F} (emoticons) and \u{1F680}-\u{1F6FF} (transport),
// so those subsets are omitted. \u{2600}-\u{26FF} covers \u{2614}-\u{2615},
// \u{2648}-\u{2653}, etc. Variation selectors and ZWJ are included in the
// same regex to minimize .replace() calls.
if (opts.stripEmoji || opts.stripDecorativeUnicode) {
// Build a combined character class from all needed ranges
const ranges: string[] = []

if (opts.stripEmoji) {
// Variation selectors, ZWJ, keycap combining chars
ranges.push('\uFE0F', '\u200D', '\u20E3')
// Emoji & pictographic symbols (deduplicated — no subsets of the above)
ranges.push(
'\\u{1F300}-\\u{1F9FF}', // Misc symbols, emoticons, transport, supplemental
'\\u{1F1E0}-\\u{1F1FF}', // Regional indicator symbols (flags)
'\\u{1F3FB}-\\u{1F3FF}', // Skin tone modifiers
'\\u{1FA00}-\\u{1FAFF}', // Extended-A and beyond
'\\u{2600}-\\u{26FF}', // Misc symbols (covers \u{2614}-\u{2615}, \u{2648}-\u{2653}, etc.)
'\\u{2700}-\\u{27BF}', // Dingbats
'\\u{231A}-\\u{231B}', // Watch, hourglass
'\\u{23E9}-\\u{23F3}', // Media controls, clocks
'\\u{23F8}-\\u{23FA}', // Media controls
'\\u{25AA}-\\u{25AB}', '\\u{25B6}', '\\u{25C0}', '\\u{25FB}-\\u{25FE}', // Geometric shapes
'\\u{2934}-\\u{2935}', // Arrows
'\\u{2B05}-\\u{2B07}', // Arrows
'\\u{2B1B}-\\u{2B1C}', '\\u{2B50}', '\\u{2B55}', // Geometric shapes
'\\u{3030}', '\\u{303D}', '\\u{3297}', '\\u{3299}', // CJK symbols
)
}

if (opts.stripDecorativeUnicode) {
// Arrows, box-drawing, block elements, geometric shapes, letterlike symbols,
// general punctuation (em/en dashes, typographic quotes, ellipsis, etc.)
ranges.push(
'\\u{2190}-\\u{21FF}', // Arrows
'\\u{2500}-\\u{257F}', // Box drawing
'\\u{2580}-\\u{259F}', // Block elements
'\\u{25A0}-\\u{25FF}', // Geometric shapes
'\\u{2100}-\\u{214F}', // Letterlike symbols
'\\u{2000}-\\u{206F}', // General punctuation (em/en dashes, quotes, ellipsis, etc.)
'\u00A9', '\u00AE', '\\u{2122}', // © ® ™
'\\u{00A7}', '\\u{00B6}', '\\u{2020}', '\\u{2021}', // § ¶ † ‡
'\\u{2022}', '\\u{2023}', '\\u{2043}', // • ‣ ⁃
)
}

result = result.replace(new RegExp(`[${ranges.join('')}]`, 'gu'), '')
}

// Pass 3: Strip standalone special characters
if (opts.stripStandaloneSpecialChars) {
// Matches standalone special chars surrounded by whitespace or at string boundaries.
// Consumes the surrounding whitespace to avoid double spaces.
result = result.replace(/(^|\s)[*#@|\\/~^`]+(?=\s|$)/g, '$1')
}

// Pass 4: Strip standalone math/operator symbols
if (opts.stripMathOperators) {
// Matches standalone operator sequences at string boundaries or surrounded by whitespace.
// Uses explicit (^|\s) grouping for boundary matching — more portable than lookbehind.
// Does NOT match when adjacent to non-whitespace characters (e.g., C++, A&B).
result = result.replace(/(^|\s)[+=\-<>&^~|\\/%]+(?=\s|$)/g, '$1')
}

// Pass 5: Collapse repeated punctuation
if (opts.collapseRepeatedPunctuation) {
result = result.replace(/!{3,}/g, '!')
result = result.replace(/\?{3,}/g, '?')
result = result.replace(/\.{4,}/g, '…')
result = result.replace(/-{3,}/g, '—')
result = result.replace(/~{2,}/g, '~')
}

// Collapse multiple spaces into one
result = result.replace(/ {2,}/g, ' ')

// Restore streaming control tokens
result = restoreStreamingTokens(result, tokens)

// Trim leading/trailing whitespace from the overall result
return result.trim()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

Comment on lines +85 to +187
export function stripUnreadableSymbols(
text: string,
options?: StripUnreadableSymbolsOptions,
): string {
console.debug('[TTS DEBUG] stripUnreadableSymbols INPUT:', JSON.stringify(text))
const opts: Required<StripUnreadableSymbolsOptions> = { ...DEFAULT_OPTIONS, ...options }

// Protect streaming control tokens from stripping
const { processed: safeText, tokens } = extractStreamingTokens(text)

// Pass 1: Strip Markdown syntax (always run)
let result = stripMarkdownFromSpeech(safeText)
console.debug('[TTS DEBUG] after stripMarkdownFromSpeech:', JSON.stringify(result))

// Aggressive star stripping for TTS: remove any ** that survived the markdown pass.
// This handles split markers like **bold + text** where neither chunk has complete **...**.
result = result.replace(/\*\*/g, '')
console.debug('[TTS DEBUG] after aggressive star strip:', JSON.stringify(result))

// Pass 2: Strip emoji, pictographic symbols, and decorative Unicode.
// Ranges are deduplicated: \u{1F300}-\u{1F9FF} already covers
// \u{1F600}-\u{1F64F} (emoticons) and \u{1F680}-\u{1F6FF} (transport),
// so those subsets are omitted. \u{2600}-\u{26FF} covers \u{2614}-\u{2615},
// \u{2648}-\u{2653}, etc. Variation selectors and ZWJ are included in the
// same regex to minimize .replace() calls.
if (opts.stripEmoji || opts.stripDecorativeUnicode) {
// Build a combined character class from all needed ranges
const ranges: string[] = []

if (opts.stripEmoji) {
// Variation selectors, ZWJ, keycap combining chars
ranges.push('\uFE0F', '\u200D', '\u20E3')
// Emoji & pictographic symbols (deduplicated — no subsets of the above)
ranges.push(
'\\u{1F300}-\\u{1F9FF}', // Misc symbols, emoticons, transport, supplemental
'\\u{1F1E0}-\\u{1F1FF}', // Regional indicator symbols (flags)
'\\u{1F3FB}-\\u{1F3FF}', // Skin tone modifiers
'\\u{1FA00}-\\u{1FAFF}', // Extended-A and beyond
'\\u{2600}-\\u{26FF}', // Misc symbols (covers \u{2614}-\u{2615}, \u{2648}-\u{2653}, etc.)
'\\u{2700}-\\u{27BF}', // Dingbats
'\\u{231A}-\\u{231B}', // Watch, hourglass
'\\u{23E9}-\\u{23F3}', // Media controls, clocks
'\\u{23F8}-\\u{23FA}', // Media controls
'\\u{25AA}-\\u{25AB}', '\\u{25B6}', '\\u{25C0}', '\\u{25FB}-\\u{25FE}', // Geometric shapes
'\\u{2934}-\\u{2935}', // Arrows
'\\u{2B05}-\\u{2B07}', // Arrows
'\\u{2B1B}-\\u{2B1C}', '\\u{2B50}', '\\u{2B55}', // Geometric shapes
'\\u{3030}', '\\u{303D}', '\\u{3297}', '\\u{3299}', // CJK symbols
)
}

if (opts.stripDecorativeUnicode) {
// Arrows, box-drawing, block elements, geometric shapes, letterlike symbols,
// general punctuation (em/en dashes, typographic quotes, ellipsis, etc.)
ranges.push(
'\\u{2190}-\\u{21FF}', // Arrows
'\\u{2500}-\\u{257F}', // Box drawing
'\\u{2580}-\\u{259F}', // Block elements
'\\u{25A0}-\\u{25FF}', // Geometric shapes
'\\u{2100}-\\u{214F}', // Letterlike symbols
'\\u{2000}-\\u{206F}', // General punctuation (em/en dashes, quotes, ellipsis, etc.)
'\u00A9', '\u00AE', '\\u{2122}', // © ® ™
'\\u{00A7}', '\\u{00B6}', '\\u{2020}', '\\u{2021}', // § ¶ † ‡
'\\u{2022}', '\\u{2023}', '\\u{2043}', // • ‣ ⁃
)
}

result = result.replace(new RegExp(`[${ranges.join('')}]`, 'gu'), '')
}

// Pass 3: Strip standalone special characters
if (opts.stripStandaloneSpecialChars) {
// Matches standalone special chars surrounded by whitespace or at string boundaries.
// Consumes the surrounding whitespace to avoid double spaces.
result = result.replace(/(^|\s)[*#@|\\/~^`]+(?=\s|$)/g, '$1')
}

// Pass 4: Strip standalone math/operator symbols
if (opts.stripMathOperators) {
// Matches standalone operator sequences at string boundaries or surrounded by whitespace.
// Uses explicit (^|\s) grouping for boundary matching — more portable than lookbehind.
// Does NOT match when adjacent to non-whitespace characters (e.g., C++, A&B).
result = result.replace(/(^|\s)[+=\-<>&^~|\\/%]+(?=\s|$)/g, '$1')
}

// Pass 5: Collapse repeated punctuation
if (opts.collapseRepeatedPunctuation) {
result = result.replace(/!{3,}/g, '!')
result = result.replace(/\?{3,}/g, '?')
result = result.replace(/\.{4,}/g, '…')
result = result.replace(/-{3,}/g, '—')
result = result.replace(/~{2,}/g, '~')
}

// Collapse multiple spaces into one
result = result.replace(/ {2,}/g, ' ')

// Restore streaming control tokens
result = restoreStreamingTokens(result, tokens)

// Trim leading/trailing whitespace from the overall result
return result.trim()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

- Log in chat-orchestrator-runtime.ts (onLiteral, filterToSpeech, stripUnreadableSymbols)
- Log in unreadable-symbols-stripper.ts (input, after markdown, after aggressive strip)
- Log in pipeline-runtime.ts (applyToken)
- Log in streaming-pipeline.ts (appendText)
- Log in tts-session.ts (appendText segmenter path)
@vi70x3 vi70x3 closed this Jun 6, 2026
onLiteral: async (literal) => {
if (shouldAbort()) return

console.log('[TTS DEBUG] onLiteral received:', JSON.stringify(literal))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using console in code that runs on the browser


It is considered a best practice to avoid the use of any console methods in JavaScript code that will run on the browser.

NOTE: If your repository contains a server side project, you can add "nodejs" to the environment property of analyzer meta in .deepsource.toml.
This will prevent this issue from getting raised.
Documentation for the analyzer meta can be found here.
Alternatively, you can silence this issue for your repository as shown here.

If a specific console call is meant to stay for other reasons, you can add a skipcq comment to that line.
This will inform other developers about the reason behind the log's presence, and prevent DeepSource from flagging it.


const speechOnly = stripMarkdownFromSpeech(categorizer.filterToSpeech(literal, streamPosition))
const filtered = categorizer.filterToSpeech(literal, streamPosition)
console.log('[TTS DEBUG] after filterToSpeech:', JSON.stringify(filtered))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using console in code that runs on the browser


It is considered a best practice to avoid the use of any console methods in JavaScript code that will run on the browser.

NOTE: If your repository contains a server side project, you can add "nodejs" to the environment property of analyzer meta in .deepsource.toml.
This will prevent this issue from getting raised.
Documentation for the analyzer meta can be found here.
Alternatively, you can silence this issue for your repository as shown here.

If a specific console call is meant to stay for other reasons, you can add a skipcq comment to that line.
This will inform other developers about the reason behind the log's presence, and prevent DeepSource from flagging it.

const filtered = categorizer.filterToSpeech(literal, streamPosition)
console.log('[TTS DEBUG] after filterToSpeech:', JSON.stringify(filtered))
const speechOnly = stripUnreadableSymbols(filtered)
console.log('[TTS DEBUG] after stripUnreadableSymbols:', JSON.stringify(speechOnly))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using console in code that runs on the browser


It is considered a best practice to avoid the use of any console methods in JavaScript code that will run on the browser.

NOTE: If your repository contains a server side project, you can add "nodejs" to the environment property of analyzer meta in .deepsource.toml.
This will prevent this issue from getting raised.
Documentation for the analyzer meta can be found here.
Alternatively, you can silence this issue for your repository as shown here.

If a specific console call is meant to stay for other reasons, you can add a skipcq comment to that line.
This will inform other developers about the reason behind the log's presence, and prevent DeepSource from flagging it.

text: string,
options?: StripUnreadableSymbolsOptions,
): string {
console.log('[TTS DEBUG] stripUnreadableSymbols INPUT:', JSON.stringify(text))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using console in code that runs on the browser


It is considered a best practice to avoid the use of any console methods in JavaScript code that will run on the browser.

NOTE: If your repository contains a server side project, you can add "nodejs" to the environment property of analyzer meta in .deepsource.toml.
This will prevent this issue from getting raised.
Documentation for the analyzer meta can be found here.
Alternatively, you can silence this issue for your repository as shown here.

If a specific console call is meant to stay for other reasons, you can add a skipcq comment to that line.
This will inform other developers about the reason behind the log's presence, and prevent DeepSource from flagging it.


// Pass 1: Strip Markdown syntax (always run)
let result = stripMarkdownFromSpeech(safeText)
console.log('[TTS DEBUG] after stripMarkdownFromSpeech:', JSON.stringify(result))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using console in code that runs on the browser


It is considered a best practice to avoid the use of any console methods in JavaScript code that will run on the browser.

NOTE: If your repository contains a server side project, you can add "nodejs" to the environment property of analyzer meta in .deepsource.toml.
This will prevent this issue from getting raised.
Documentation for the analyzer meta can be found here.
Alternatively, you can silence this issue for your repository as shown here.

If a specific console call is meant to stay for other reasons, you can add a skipcq comment to that line.
This will inform other developers about the reason behind the log's presence, and prevent DeepSource from flagging it.

// Aggressive star stripping for TTS: remove any ** that survived the markdown pass.
// This handles split markers like **bold + text** where neither chunk has complete **...**.
result = result.replace(/\*\*/g, '')
console.log('[TTS DEBUG] after aggressive star strip:', JSON.stringify(result))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using console in code that runs on the browser


It is considered a best practice to avoid the use of any console methods in JavaScript code that will run on the browser.

NOTE: If your repository contains a server side project, you can add "nodejs" to the environment property of analyzer meta in .deepsource.toml.
This will prevent this issue from getting raised.
Documentation for the analyzer meta can be found here.
Alternatively, you can silence this issue for your repository as shown here.

If a specific console call is meant to stay for other reasons, you can add a skipcq comment to that line.
This will inform other developers about the reason behind the log's presence, and prevent DeepSource from flagging it.

return {
appendText(text: string) {
if (text.length === 0) return
console.log('[TTS STREAMING] appendText:', JSON.stringify(text))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using console in code that runs on the browser


It is considered a best practice to avoid the use of any console methods in JavaScript code that will run on the browser.

NOTE: If your repository contains a server side project, you can add "nodejs" to the environment property of analyzer meta in .deepsource.toml.
This will prevent this issue from getting raised.
Documentation for the analyzer meta can be found here.
Alternatively, you can silence this issue for your repository as shown here.

If a specific console call is meant to stay for other reasons, you can add a skipcq comment to that line.
This will inform other developers about the reason behind the log's presence, and prevent DeepSource from flagging it.

intentId: intent.intentId,
appendText: intent.writeLiteral,
appendText: (text) => {
console.log('[TTS SESSION] appendText (segmenter path):', JSON.stringify(text))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using console in code that runs on the browser


It is considered a best practice to avoid the use of any console methods in JavaScript code that will run on the browser.

NOTE: If your repository contains a server side project, you can add "nodejs" to the environment property of analyzer meta in .deepsource.toml.
This will prevent this issue from getting raised.
Documentation for the analyzer meta can be found here.
Alternatively, you can silence this issue for your repository as shown here.

If a specific console call is meant to stay for other reasons, you can add a skipcq comment to that line.
This will inform other developers about the reason behind the log's presence, and prevent DeepSource from flagging it.


const applyToken = (payload: SpeechIntentTokenPayload, writer: (intent: IntentHandle, value?: string) => void) => {
if (!payload || payload.originId === originId) return
console.log('[TTS PIPELINE] applyToken:', JSON.stringify(payload.value))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using console in code that runs on the browser


It is considered a best practice to avoid the use of any console methods in JavaScript code that will run on the browser.

NOTE: If your repository contains a server side project, you can add "nodejs" to the environment property of analyzer meta in .deepsource.toml.
This will prevent this issue from getting raised.
Documentation for the analyzer meta can be found here.
Alternatively, you can silence this issue for your repository as shown here.

If a specific console call is meant to stay for other reasons, you can add a skipcq comment to that line.
This will inform other developers about the reason behind the log's presence, and prevent DeepSource from flagging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants