Skip to content

chore: resolve issue #113 — sanitize crawled web text in intake prompts#119

Open
matthewod11-stack wants to merge 1 commit into
mainfrom
chore/orchestrator-issue-113-2026-06-12
Open

chore: resolve issue #113 — sanitize crawled web text in intake prompts#119
matthewod11-stack wants to merge 1 commit into
mainfrom
chore/orchestrator-issue-113-2026-06-12

Conversation

@matthewod11-stack

Copy link
Copy Markdown
Owner

Auto-generated by portfolio-orchestrator nightly run on 2026-06-12 (live mode). Resolves #113.

What

The scoring path already defends against prompt injection (recruiting/scoring/sanitize.rs — defangs the <evidence>/<profile> sandbox delimiters, strips control/zero-width chars, applied at signal_extract.rs:56,85). The intake research path did not: analyze_company and analyze_profile embedded raw Exa-crawled web text directly into the LLM prompt. A candidate can plant injection text on their own crawled page → steer the company analysis, profile analysis, search strategy, and findSimilar seeds. Same attacker-controlled-content class, sanitized in one path and raw in the other (audit finding 2.2).

How

  • Extracted two pure prompt-builder helpers — build_company_user_prompt(url, text) and build_profile_user_prompt(serialized, fetched) — that run the existing sanitize_untrusted_text over the crawled text before it enters the prompt. No change to the sanitizer itself (bail-if: sanitizer needs intake-specific behavior changes did not trigger).
  • Ported the scoring injection test (defangs_angle_brackets_to_fullwidth) to the intake path: 3 new tests assert forged delimiters are defanged, zero-width/control chars stripped, and that trusted inputs (the seed URL, the user's own structured profile JSON) are preserved verbatim.

Verification

  • TDD red→green: with sanitization removed, both injection tests fail (raw delimiter must not survive); with it, all pass.
  • cargo test --manifest-path src-tauri/Cargo.toml790 passed / 0 failed (3 new).
  • cargo clippy --lib0 new warnings vs origin/main (intake/prod.rs clippy-clean).
  • Scope: 1 file (recruiting/intake/prod.rs), within max-files-changed: 3; nothing touched outside src-tauri/src/recruiting/.

Reviewer notes

  • This is a RECRUITING_ENABLED-flip gate per the app backlog — bounded today (recruiting flag-dead in prod, user's own key, structured output), so low blast radius, but it removes the raw-embed asymmetry before the flip.
  • Defense applies to crawled web text only; the seed URL and the user's structured profile input are intentionally left unsanitized (trusted seeds).

🤖 Generated with Claude Code

The scoring path already defends against prompt injection via
scoring::sanitize (defangs the <evidence>/<profile> sandbox delimiters,
strips control/zero-width chars), but the intake research path embedded
raw Exa-crawled web text directly into the analyze_company /
analyze_profile prompts. A candidate could plant injection text on their
own crawled page to steer analysis, search strategy, and findSimilar seeds.

Extract pure prompt-builder helpers (build_company_user_prompt /
build_profile_user_prompt) that run the existing sanitize_untrusted_text
over the crawled text before embedding, and port the scoring injection
test to the intake path. Trusted inputs (seed URL, the user's own
structured profile JSON) are left verbatim.

Verification: cargo test 790 passed / 0 failed (3 new); recruiting
clippy clean (0 new warnings vs origin/main). Gates the RECRUITING_ENABLED
flip per backlog.

Resolves #113.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds prompt-injection sanitization to the recruiting intake research path so Exa-crawled web text is treated the same way as scoring evidence text, closing the gap described in issue #113.

Changes:

  • Introduces build_company_user_prompt / build_profile_user_prompt helpers that apply sanitize_untrusted_text to untrusted crawled text before embedding it into LLM prompts.
  • Updates analyze_company / analyze_profile to use the new prompt-builder helpers instead of embedding raw crawled text.
  • Adds targeted unit tests to ensure forged delimiters are defanged, zero-width/control characters are stripped, and trusted inputs remain verbatim.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +320 to +329
fn build_profile_user_prompt(serialized: &str, fetched: &str) -> String {
if fetched.is_empty() {
format!("Profile input:\n{serialized}")
} else {
format!(
"Profile input:\n{serialized}\n\nFetched content:\n{}",
sanitize_untrusted_text(fetched)
)
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intake analyze prompts embed raw Exa-crawled web text without sanitize_untrusted_text (scoring path sanitizes; intake doesn't)

2 participants