Harden triage skills against instructions embedded in report content#9
Open
Aboudoc wants to merge 1 commit into
Open
Harden triage skills against instructions embedded in report content#9Aboudoc wants to merge 1 commit into
Aboudoc wants to merge 1 commit into
Conversation
Report fields, attachments, and comments are submitter-authored. Declare them as untrusted data, add a pre-validation screening gate, gate write actions behind human confirmation, restrict responder comments to templates, and keep reports isolated in bulk mode so one report cannot steer another. Adds an untrusted-input reference and a benign regression corpus.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Report fields, attachments, and comments are authored by the submitter, but the triage skills
currently treat that content the same as their own instructions. This PR adds an explicit trust
boundary so report content is handled as untrusted data.
Changes
hackenproof-triageandhackenproof-bulk-triage: outputfrom
get_report_details,fetch_attachment,get_comments, andsearch_commentsis data,not instructions.
drive triage (fake "system/team/internal" notes, claimed out-of-band pre-validation or
overrides, direct severity/state requests, or requests to disclose program data) is
disregarded and flagged for human review.
triggers one.
text or program data.
recommendation, and program info never appears in the output.
references/untrusted-input-handling.md(screening checklist) andreferences/injection-test-corpus.md(benign regression cases).Why
Defense-in-depth. Today the workflow relies on the model declining to follow injected
instructions, which is model-dependent. The screening gate and human confirmation make triage
integrity independent of model choice. Background and a runnable proof of concept were shared
privately with the team.
Testing
Run the triage skill against the cases in
references/injection-test-corpus.md; each should bedecided on evidence with the embedded directive ignored. No behavior change for well-formed
reports.