chore: resolve issue #113 — sanitize crawled web text in intake prompts#119
Open
matthewod11-stack wants to merge 1 commit into
Open
chore: resolve issue #113 — sanitize crawled web text in intake prompts#119matthewod11-stack wants to merge 1 commit into
matthewod11-stack wants to merge 1 commit into
Conversation
The scoring path already defends against prompt injection via scoring::sanitize (defangs the <evidence>/<profile> sandbox delimiters, strips control/zero-width chars), but the intake research path embedded raw Exa-crawled web text directly into the analyze_company / analyze_profile prompts. A candidate could plant injection text on their own crawled page to steer analysis, search strategy, and findSimilar seeds. Extract pure prompt-builder helpers (build_company_user_prompt / build_profile_user_prompt) that run the existing sanitize_untrusted_text over the crawled text before embedding, and port the scoring injection test to the intake path. Trusted inputs (seed URL, the user's own structured profile JSON) are left verbatim. Verification: cargo test 790 passed / 0 failed (3 new); recruiting clippy clean (0 new warnings vs origin/main). Gates the RECRUITING_ENABLED flip per backlog. Resolves #113. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4 tasks
There was a problem hiding this comment.
Pull request overview
Adds prompt-injection sanitization to the recruiting intake research path so Exa-crawled web text is treated the same way as scoring evidence text, closing the gap described in issue #113.
Changes:
- Introduces
build_company_user_prompt/build_profile_user_prompthelpers that applysanitize_untrusted_textto untrusted crawled text before embedding it into LLM prompts. - Updates
analyze_company/analyze_profileto use the new prompt-builder helpers instead of embedding raw crawled text. - Adds targeted unit tests to ensure forged delimiters are defanged, zero-width/control characters are stripped, and trusted inputs remain verbatim.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+320
to
+329
| fn build_profile_user_prompt(serialized: &str, fetched: &str) -> String { | ||
| if fetched.is_empty() { | ||
| format!("Profile input:\n{serialized}") | ||
| } else { | ||
| format!( | ||
| "Profile input:\n{serialized}\n\nFetched content:\n{}", | ||
| sanitize_untrusted_text(fetched) | ||
| ) | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Auto-generated by portfolio-orchestrator nightly run on 2026-06-12 (live mode). Resolves #113.
What
The scoring path already defends against prompt injection (
recruiting/scoring/sanitize.rs— defangs the<evidence>/<profile>sandbox delimiters, strips control/zero-width chars, applied atsignal_extract.rs:56,85). The intake research path did not:analyze_companyandanalyze_profileembedded raw Exa-crawled web text directly into the LLM prompt. A candidate can plant injection text on their own crawled page → steer the company analysis, profile analysis, search strategy, andfindSimilarseeds. Same attacker-controlled-content class, sanitized in one path and raw in the other (audit finding 2.2).How
build_company_user_prompt(url, text)andbuild_profile_user_prompt(serialized, fetched)— that run the existingsanitize_untrusted_textover the crawled text before it enters the prompt. No change to the sanitizer itself (bail-if: sanitizer needs intake-specific behavior changesdid not trigger).defangs_angle_brackets_to_fullwidth) to the intake path: 3 new tests assert forged delimiters are defanged, zero-width/control chars stripped, and that trusted inputs (the seed URL, the user's own structured profile JSON) are preserved verbatim.Verification
raw delimiter must not survive); with it, all pass.cargo test --manifest-path src-tauri/Cargo.toml→ 790 passed / 0 failed (3 new).cargo clippy --lib→ 0 new warnings vsorigin/main(intake/prod.rs clippy-clean).recruiting/intake/prod.rs), withinmax-files-changed: 3; nothing touched outsidesrc-tauri/src/recruiting/.Reviewer notes
RECRUITING_ENABLED-flip gate per the app backlog — bounded today (recruiting flag-dead in prod, user's own key, structured output), so low blast radius, but it removes the raw-embed asymmetry before the flip.🤖 Generated with Claude Code