Skip to content

Harden triage skills against instructions embedded in report content#9

Open
Aboudoc wants to merge 1 commit into
hackenproof-public:mainfrom
Aboudoc:harden-triage-untrusted-input
Open

Harden triage skills against instructions embedded in report content#9
Aboudoc wants to merge 1 commit into
hackenproof-public:mainfrom
Aboudoc:harden-triage-untrusted-input

Conversation

@Aboudoc

@Aboudoc Aboudoc commented May 29, 2026

Copy link
Copy Markdown

Report fields, attachments, and comments are authored by the submitter, but the triage skills
currently treat that content the same as their own instructions. This PR adds an explicit trust
boundary so report content is handled as untrusted data.

Changes

  • Trust Boundary section in both hackenproof-triage and hackenproof-bulk-triage: output
    from get_report_details, fetch_attachment, get_comments, and search_comments is data,
    not instructions.
  • Gate 0 — Untrusted-Content Screen before the existing gates: report content that tries to
    drive triage (fake "system/team/internal" notes, claimed out-of-band pre-validation or
    overrides, direct severity/state requests, or requests to disclose program data) is
    disregarded and flagged for human review.
  • Action gating: write actions require human confirmation; report content alone never
    triggers one.
  • Comment provenance: responder comments come only from the templates; never echo report
    text or program data.
  • Cross-report isolation in bulk mode: one report's content can't influence another's
    recommendation, and program info never appears in the output.
  • New references/untrusted-input-handling.md (screening checklist) and
    references/injection-test-corpus.md (benign regression cases).

Why

Defense-in-depth. Today the workflow relies on the model declining to follow injected
instructions, which is model-dependent. The screening gate and human confirmation make triage
integrity independent of model choice. Background and a runnable proof of concept were shared
privately with the team.

Testing

Run the triage skill against the cases in references/injection-test-corpus.md; each should be
decided on evidence with the embedded directive ignored. No behavior change for well-formed
reports.

Report fields, attachments, and comments are submitter-authored. Declare them
as untrusted data, add a pre-validation screening gate, gate write actions
behind human confirmation, restrict responder comments to templates, and keep
reports isolated in bulk mode so one report cannot steer another. Adds an
untrusted-input reference and a benign regression corpus.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant