Skip to content

feat(core-agent): add stripUnreadableSymbols for TTS text sanitization#13

Closed
vi70x3 wants to merge 1 commit into
mainfrom
feat/unreadable-symbols-stripper
Closed

feat(core-agent): add stripUnreadableSymbols for TTS text sanitization#13
vi70x3 wants to merge 1 commit into
mainfrom
feat/unreadable-symbols-stripper

Conversation

@vi70x3

@vi70x3 vi70x3 commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

Extends the plaintext-response-format spec by adding a new stripUnreadableSymbols function that strips emoji, decorative Unicode, standalone special characters, math operators, and repeated punctuation from TTS input text — complementing the existing stripMarkdownFromSpeech function.

What changed

  • New file: packages/core-agent/src/runtime/unreadable-symbols-stripper.ts

    • stripUnreadableSymbols(text, options?) — 6-pass sanitizer (Markdown + emoji + decorative Unicode + standalone special chars + math operators + repeated punctuation collapsing)
    • StripUnreadableSymbolsOptions interface with 5 configurable boolean flags (all default true)
    • Streaming control tokens (<|ACT|>, <|DELAY|>, <|CALL|>) preserved via Private Use Area placeholder extraction
    • stripMarkdownFromSpeech remains exported unchanged (backward compatibility)
  • New file: packages/core-agent/src/runtime/unreadable-symbols-stripper.test.ts

    • 63 tests covering all stripping categories, token preservation, options behavior, edge cases, and backward compatibility
  • Modified: packages/core-agent/src/runtime/chat-orchestrator-runtime.ts

    • Both stripMarkdownFromSpeech calls (streaming path + final categorization path) replaced with stripUnreadableSymbols
  • Modified: packages/core-agent/src/index.ts

    • Added exports for stripUnreadableSymbols and StripUnreadableSymbolsOptions
  • New spec: .roo/specs/unreadable-symbols-stripper/ (requirements.md, design.md, tasks.md)

How tested

  • pnpm -F @proj-airi/core-agent typecheck — passed
  • pnpm -F @proj-airi/core-agent exec vitest run — 164 tests passed (14 test files)

Extends stripMarkdownFromSpeech with additional passes to strip emoji,
decorative Unicode, standalone special chars, math operators, and
repeated punctuation from TTS input text. All passes are configurable
via options with sensible defaults (strip everything by default).

Streaming control tokens (<|ACT|>, <|DELAY|>, <|CALL|>) are preserved
via Private Use Area placeholder extraction.

Closes: extends plaintext-response-format spec
@deepsource-io

deepsource-io Bot commented Jun 6, 2026

Copy link
Copy Markdown

DeepSource Code Review

We reviewed changes in add3f5d...e5e7dfb on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

PR Report Card

Overall Grade  

Focus Area: Reliability
Security  

Reliability  

Complexity  

Hygiene  

Code Review Summary

Analyzer Status Updated (UTC) Details
JavaScript Jun 6, 2026 6:54p.m. Review ↗
Shell Jun 6, 2026 6:54p.m. Review ↗
C & C++ Jun 6, 2026 6:54p.m. Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.

@@ -21,6 +21,7 @@ import { createChatHooks } from './agent-hooks'
import { useLlmmarkerParser } from './llm-marker-parser'
import { categorizeResponse, createStreamingCategorizer } from './response-categoriser'
import { stripMarkdownFromSpeech } from './markdown-stripper'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'stripMarkdownFromSpeech' is defined but never used


Unused variables are generally considered a code smell and should be avoided.

Comment on lines +38 to +46
function extractStreamingTokens(text: string): { processed: string, tokens: string[] } {
const tokens: string[] = []
const processed = text.replace(/<\|[^|]+\|>/g, (match) => {
const index = tokens.length
tokens.push(match)
return `${TOKEN_PLACEHOLDER_BASE}${index.toString(36)}${TOKEN_PLACEHOLDER_BASE}`
})
return { processed, tokens }
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

Comment on lines +51 to +56
function restoreStreamingTokens(text: string, tokens: string[]): string {
return text.replace(
new RegExp(`${TOKEN_PLACEHOLDER_BASE}([0-9a-z]+)${TOKEN_PLACEHOLDER_BASE}`, 'g'),
(_, indexStr) => tokens[Number.parseInt(indexStr, 36)] ?? '',
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

Comment on lines +84 to +195
export function stripUnreadableSymbols(
text: string,
options?: StripUnreadableSymbolsOptions,
): string {
const opts: Required<StripUnreadableSymbolsOptions> = { ...DEFAULT_OPTIONS, ...options }

// Protect streaming control tokens from stripping
const { processed: safeText, tokens } = extractStreamingTokens(text)

// Pass 1: Strip Markdown syntax (always run)
let result = stripMarkdownFromSpeech(safeText)

// Pass 2: Strip emoji and Unicode pictographic symbols
if (opts.stripEmoji) {
// Remove variation selectors, ZWJ, keycap combining chars first
result = result.replace(/\uFE0F/gu, '')
result = result.replace(/\u200D/gu, '')
result = result.replace(/\u20E3/gu, '')

// Remove emoji Unicode ranges
// U+1F300-U+1F9FF: Misc symbols, emoticons, transport, supplemental
result = result.replace(/[\u{1F300}-\u{1F9FF}]/gu, '')
// U+2600-U+26FF: Misc symbols
result = result.replace(/[\u{2600}-\u{26FF}]/gu, '')
// U+2700-U+27BF: Dingbats
result = result.replace(/[\u{2700}-\u{27BF}]/gu, '')
// U+1F3FB-U+1F3FF: Skin tone modifiers
result = result.replace(/[\u{1F3FB}-\u{1F3FF}]/gu, '')
// U+1F1E0-U+1F1FF: Regional indicator symbols (flags)
result = result.replace(/[\u{1F1E0}-\u{1F1FF}]/gu, '')
// U+1F600-U+1F64F: Emoticons (faces)
result = result.replace(/[\u{1F600}-\u{1F64F}]/gu, '')
// U+1F680-U+1F6FF: Transport and map symbols
result = result.replace(/[\u{1F680}-\u{1F6FF}]/gu, '')
// U+1FA00-U+1FAFF: Extended-A and beyond
result = result.replace(/[\u{1FA00}-\u{1FAFF}]/gu, '')
// U+2702-U+27B0: Dingbats (additional)
result = result.replace(/[\u{2702}-\u{27B0}]/gu, '')
// U+231A-U+231B: Watch, hourglass
result = result.replace(/[\u{231A}-\u{231B}]/gu, '')
// U+23E9-U+23F3: Media controls, clocks
result = result.replace(/[\u{23E9}-\u{23F3}]/gu, '')
// U+23F8-U+23FA: Media controls
result = result.replace(/[\u{23F8}-\u{23FA}]/gu, '')
// U+25AA-U+25AB, U+25B6, U+25C0, U+25FB-U+25FE: Geometric shapes
result = result.replace(/[\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}]/gu, '')
// U+2614-U+2615: Umbrella, hot beverage
result = result.replace(/[\u{2614}-\u{2615}]/gu, '')
// U+2648-U+2653: Zodiac
result = result.replace(/[\u{2648}-\u{2653}]/gu, '')
// U+267F, U+2693, U+26A1, U+26AA-U+26AB, U+26BD-U+26BE, U+26C4-U+26C5, U+26CE, U+26D4, U+26EA, U+26F2-U+26F3, U+26F5, U+26FA, U+26FD: Misc
result = result.replace(/[\u{267F}\u{2693}\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}]/gu, '')
// U+2934-U+2935: Arrows
result = result.replace(/[\u{2934}-\u{2935}]/gu, '')
// U+2B05-U+2B07: Arrows
result = result.replace(/[\u{2B05}-\u{2B07}]/gu, '')
// U+2B1B-U+2B1C, U+2B50, U+2B55: Geometric shapes
result = result.replace(/[\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}]/gu, '')
// U+3030, U+303D, U+3297, U+3299: CJK symbols
result = result.replace(/[\u{3030}\u{303D}\u{3297}\u{3299}]/gu, '')
}

// Pass 3: Strip decorative Unicode (arrows, box-drawing, shapes, dingbats)
if (opts.stripDecorativeUnicode) {
// U+2190-U+21FF: Arrows
result = result.replace(/[\u{2190}-\u{21FF}]/gu, '')
// U+2500-U+257F: Box drawing
result = result.replace(/[\u{2500}-\u{257F}]/gu, '')
// U+2580-U+259F: Block elements
result = result.replace(/[\u{2580}-\u{259F}]/gu, '')
// U+25A0-U+25FF: Geometric shapes
result = result.replace(/[\u{25A0}-\u{25FF}]/gu, '')
// Specific decorative chars: © ® ™ § ¶ † ‡ • ‣ ⁃
result = result.replace(/[©®™§¶†‡•‣⁃]/g, '')
// U+2100-U+214F: Letterlike symbols
result = result.replace(/[\u{2100}-\u{214F}]/gu, '')
}

// Pass 4: Strip standalone special characters
if (opts.stripStandaloneSpecialChars) {
// Matches standalone special chars surrounded by whitespace or at string boundaries.
// Consumes the surrounding whitespace to avoid double spaces.
result = result.replace(/(^|\s)[*#@|\\/~^`]+(?=\s|$)/g, '$1')
}

// Pass 5: Strip standalone math/operator symbols
if (opts.stripMathOperators) {
// Matches standalone operator sequences surrounded by whitespace or boundaries.
// Uses lookbehind for whitespace/start and lookahead for whitespace/end.
// Does NOT match when adjacent to non-whitespace characters (e.g., C++, A&B).
// Consumes the trailing whitespace to avoid double spaces.
result = result.replace(/(?:^|(?<=\s))[+=\-<>&^~|\\/%]+(?=\s|$)/g, '')
}

// Pass 6: Collapse repeated punctuation
if (opts.collapseRepeatedPunctuation) {
result = result.replace(/!{3,}/g, '!')
result = result.replace(/\?{3,}/g, '?')
result = result.replace(/\.{4,}/g, '…')
result = result.replace(/-{3,}/g, '—')
result = result.replace(/~{2,}/g, '~')
}

// Collapse multiple spaces into one
result = result.replace(/ {2,}/g, ' ')

// Restore streaming control tokens
result = restoreStreamingTokens(result, tokens)

// Trim leading/trailing whitespace from the overall result
return result.trim()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unexpected function declaration in the global scope, wrap in an IIFE for a local variable, assign as global property for a global variable


It is considered a best practice to avoid 'polluting' the global scope with variables that are intended to be local to the script. Global variables created from a script can produce name collisions with global variables created from another script, which will usually lead to runtime errors or unexpected behavior. It is mostly useful for browser scripts.

* stripUnreadableSymbols('Price is $5!!! Really???', { collapseRepeatedPunctuation: false })
* // -> 'Price is $5!!! Really???'
*/
export function stripUnreadableSymbols(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`stripUnreadableSymbols` has a cyclomatic complexity of 6 with "medium" risk


A function with high cyclomatic complexity can be hard to understand and
maintain. Cyclomatic complexity is a software metric that measures the number of
independent paths through a function. A higher cyclomatic complexity indicates
that the function has more decision points and is more complex.

// U+25A0-U+25FF: Geometric shapes
result = result.replace(/[\u{25A0}-\u{25FF}]/gu, '')
// Specific decorative chars: © ® ™ § ¶ † ‡ • ‣ ⁃
result = result.replace(/[©®™§¶†‡•‣⁃]/g, '')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the 'u' flag with regular expressions


It is recommended to use the u flag with regular expressions.

@vi70x4 vi70x4 closed this Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants