Skip to content

fix(security): sanitize crawled web text in intake analyze prompts (#113)#121

Closed
matthewod11-stack wants to merge 1 commit into
mainfrom
chore/orchestrator-issue-113-2026-06-15
Closed

fix(security): sanitize crawled web text in intake analyze prompts (#113)#121
matthewod11-stack wants to merge 1 commit into
mainfrom
chore/orchestrator-issue-113-2026-06-15

Conversation

@matthewod11-stack

Copy link
Copy Markdown
Owner

Auto-generated by portfolio-orchestrator nightly run on 2026-06-15 (live mode).

What (security — Audit finding 2.2)

The scoring signal-extraction path defends against prompt injection via sanitize_untrusted_text, but the intake research path embedded raw Exa-crawled web text directly into LLM prompts. A candidate could plant injection on their own crawled page → steer the analysis, search strategy, and findSimilar seeds. Same class of attacker-controlled content, sanitized in one path and raw in the other.

Fix

Apply the existing sanitize_untrusted_text to crawled text before embedding in both intake prompt builders:

  • analyze_company → sanitize content.text
  • analyze_profile → sanitize the fetched crawl text (the user's own structured input is left as-is — it isn't untrusted web text)

The sanitizer is reused unchanged (defangs </> so untrusted text can't forge the <evidence>/<profile> sandbox delimiters, strips control/zero-width chars, truncates). No sanitizer behavior change → no escalation per bail-if.

Test

Added analyze_company_sanitizes_crawled_text_before_llm: drives the real analyze_company path through a capturing provider and asserts the </evidence> forgery is defanged to full-width and C0 controls stripped before reaching the LLM seam — porting the scoring path's injection coverage to intake.

Verification

  • cargo test recruiting::intake74 passed (incl. the new test) ✓
  • Full lib suite → 789 passed, 0 failed
  • Clippy: no new warnings in the changed file ✓
  • 1 file changed (issue cap: max-files-changed: 3) ✓
  • do-not-touch respected: change is entirely within src-tauri/src/recruiting/intake/; sanitize.rs reused, not modified ✓

Resolves #113.

)

The scoring signal-extraction path defends against prompt injection via
sanitize_untrusted_text, but the intake research path embedded raw Exa-crawled
web text directly into LLM prompts. A candidate could plant injection on their
own crawled page to steer analysis, search strategy, and findSimilar seeds —
the same class of attacker-controlled content sanitized in one path and raw in
the other.

Apply the existing sanitize_untrusted_text to crawled text before embedding in
both intake prompt builders:
- analyze_company: sanitize content.text
- analyze_profile: sanitize the fetched crawl text (the user's own structured
  input is left as-is — it isn't untrusted web text)

The sanitizer is reused unchanged (defangs angle brackets so untrusted text
can't forge the <evidence>/<profile> sandbox delimiters, strips control/
zero-width chars, truncates). Added an intake injection test that drives the
real analyze_company path through a capturing provider and asserts the
</evidence> forgery is defanged and C0 controls stripped before reaching the
LLM seam — porting the scoring path's injection-defense coverage.

Verification: cargo test recruiting::intake 74 passed (incl. new test); full
lib suite 789 passed, 0 failed; clippy introduces no new warnings in the
changed file.

Resolves #113.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@matthewod11-stack

Copy link
Copy Markdown
Owner Author

Closing as a duplicate of #119 (opened 2026-06-12, still open), which already resolves #113 via an extracted-prompt-builder + TDD approach. This PR (#121) was opened by the 2026-06-15 orchestrator run, which verified the issue was open and the code still unsanitized on main but did not detect the existing open PR #119 (its fix is unmerged, so neither the issue state nor main reflected it). Deferring to the older #119; deleting this branch. Apologies for the noise.

@matthewod11-stack matthewod11-stack deleted the chore/orchestrator-issue-113-2026-06-15 branch June 15, 2026 08:37

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the recruiting intake “research → analyze” LLM prompt construction by sanitizing Exa-crawled web text before it is embedded into prompts, aligning the intake path with the existing scoring prompt-injection defenses.

Changes:

  • Apply sanitize_untrusted_text to crawled company page text before it enters the analyze_company prompt.
  • Apply sanitize_untrusted_text to fetched profile crawl text before it enters the analyze_profile prompt.
  • Add an intake regression test that captures provider-bound messages and asserts sanitization occurred before reaching the LLM seam.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 330 to 334
content: format!(
"Company page ({}):\n\n{}",
content.url, content.text
content.url,
sanitize_untrusted_text(&content.text)
),
Comment on lines +367 to +372
// `fetched` is crawled web text (attacker-controlled); sanitize before
// embedding. `serialized` is the user's own structured input. (#113)
format!(
"Profile input:\n{serialized}\n\nFetched content:\n{}",
sanitize_untrusted_text(&fetched)
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intake analyze prompts embed raw Exa-crawled web text without sanitize_untrusted_text (scoring path sanitizes; intake doesn't)

2 participants