fix(security): sanitize crawled web text in intake analyze prompts (#113)#121
fix(security): sanitize crawled web text in intake analyze prompts (#113)#121matthewod11-stack wants to merge 1 commit into
Conversation
) The scoring signal-extraction path defends against prompt injection via sanitize_untrusted_text, but the intake research path embedded raw Exa-crawled web text directly into LLM prompts. A candidate could plant injection on their own crawled page to steer analysis, search strategy, and findSimilar seeds — the same class of attacker-controlled content sanitized in one path and raw in the other. Apply the existing sanitize_untrusted_text to crawled text before embedding in both intake prompt builders: - analyze_company: sanitize content.text - analyze_profile: sanitize the fetched crawl text (the user's own structured input is left as-is — it isn't untrusted web text) The sanitizer is reused unchanged (defangs angle brackets so untrusted text can't forge the <evidence>/<profile> sandbox delimiters, strips control/ zero-width chars, truncates). Added an intake injection test that drives the real analyze_company path through a capturing provider and asserts the </evidence> forgery is defanged and C0 controls stripped before reaching the LLM seam — porting the scoring path's injection-defense coverage. Verification: cargo test recruiting::intake 74 passed (incl. new test); full lib suite 789 passed, 0 failed; clippy introduces no new warnings in the changed file. Resolves #113. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Closing as a duplicate of #119 (opened 2026-06-12, still open), which already resolves #113 via an extracted-prompt-builder + TDD approach. This PR (#121) was opened by the 2026-06-15 orchestrator run, which verified the issue was open and the code still unsanitized on |
There was a problem hiding this comment.
Pull request overview
This PR hardens the recruiting intake “research → analyze” LLM prompt construction by sanitizing Exa-crawled web text before it is embedded into prompts, aligning the intake path with the existing scoring prompt-injection defenses.
Changes:
- Apply
sanitize_untrusted_textto crawled company page text before it enters theanalyze_companyprompt. - Apply
sanitize_untrusted_textto fetched profile crawl text before it enters theanalyze_profileprompt. - Add an intake regression test that captures provider-bound messages and asserts sanitization occurred before reaching the LLM seam.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| content: format!( | ||
| "Company page ({}):\n\n{}", | ||
| content.url, content.text | ||
| content.url, | ||
| sanitize_untrusted_text(&content.text) | ||
| ), |
| // `fetched` is crawled web text (attacker-controlled); sanitize before | ||
| // embedding. `serialized` is the user's own structured input. (#113) | ||
| format!( | ||
| "Profile input:\n{serialized}\n\nFetched content:\n{}", | ||
| sanitize_untrusted_text(&fetched) | ||
| ) |
Auto-generated by portfolio-orchestrator nightly run on 2026-06-15 (live mode).
What (security — Audit finding 2.2)
The scoring signal-extraction path defends against prompt injection via
sanitize_untrusted_text, but the intake research path embedded raw Exa-crawled web text directly into LLM prompts. A candidate could plant injection on their own crawled page → steer the analysis, search strategy, and findSimilar seeds. Same class of attacker-controlled content, sanitized in one path and raw in the other.Fix
Apply the existing
sanitize_untrusted_textto crawled text before embedding in both intake prompt builders:analyze_company→ sanitizecontent.textanalyze_profile→ sanitize thefetchedcrawl text (the user's own structured input is left as-is — it isn't untrusted web text)The sanitizer is reused unchanged (defangs
</>so untrusted text can't forge the<evidence>/<profile>sandbox delimiters, strips control/zero-width chars, truncates). No sanitizer behavior change → no escalation perbail-if.Test
Added
analyze_company_sanitizes_crawled_text_before_llm: drives the realanalyze_companypath through a capturing provider and asserts the</evidence>forgery is defanged to full-width and C0 controls stripped before reaching the LLM seam — porting the scoring path's injection coverage to intake.Verification
cargo test recruiting::intake→ 74 passed (incl. the new test) ✓max-files-changed: 3) ✓do-not-touchrespected: change is entirely withinsrc-tauri/src/recruiting/intake/;sanitize.rsreused, not modified ✓Resolves #113.