Skip to content

[Evaluation] Fix RedTeam.scan() decoding encoded attack prompts in stored results#47536

Open
huliang-microsoft wants to merge 2 commits into
Azure:mainfrom
huliang-microsoft:fix/redteam-encoded-prompts-fidelity
Open

[Evaluation] Fix RedTeam.scan() decoding encoded attack prompts in stored results#47536
huliang-microsoft wants to merge 2 commits into
Azure:mainfrom
huliang-microsoft:fix/redteam-encoded-prompts-fidelity

Conversation

@huliang-microsoft

Copy link
Copy Markdown

Fixes #47228

Problem

RedTeam.scan() was storing the decoded plaintext objective in
evaluation_results.json / results.json for every converter-based attack
strategy (Base64, Flip, Morse, ROT13, Caesar, Leetspeak,
AsciiArt, AnsiAttack, Atbash, Binary, CharacterSpace, CharSwap,
Diacritic, StringJoin, SuffixAppend, UnicodeConfusable,
UnicodeSubstitution, Url, AsciiSmuggler, Tense). The target callback
actually received the encoded converted_value, but persisted
attack_details[].conversation[].content was the pre-converter
original_value. Customer impact (from #47228):

  • Impossible to audit/verify the attack surface post-scan.
  • Cannot debug why specific encoding variants succeeded / failed.
  • Cannot correlate target responses to the exact encoded prompts received.
  • attack_technique metadata correctly said e.g. "base64", but
    conversation content was plaintext — breaking reproducibility.

Root cause

In FoundryResultProcessor._build_messages_from_pieces, user turns
deliberately preferred original_value over converted_value, dropping the
on-wire payload from the persisted conversation.

Fix

  • content now always reflects what was sent on the wire
    (converted_value, falling back to original_value only when
    converted_value is empty) for both user and assistant turns.
  • The pre-converter adversarial objective is preserved as a new sibling
    original_value field on user messages only when it differs from
    content, so the audit trail of "what the attack meant to say" is not
    lost and baseline (non-encoded) strategies stay byte-identical to the
    old output.

Tests

CHANGELOG

Added a 1.17.1 (Unreleased) entry under ### Bugs Fixed.

cc @singankit @w-javed

For converter-based attack strategies (Base64, Flip, Morse, ROT13, Caesar,
Leetspeak, AsciiArt, AnsiAttack, Atbash, Binary, CharacterSpace, CharSwap,
Diacritic, StringJoin, SuffixAppend, UnicodeConfusable, UnicodeSubstitution,
Url, AsciiSmuggler, Tense), FoundryResultProcessor was emitting the decoded
'original_value' as the user-message content while the target was actually
receiving 'converted_value'. This made evaluation_results.json /
results.json show plaintext where the audit trail should show the encoded
payload, breaking post-scan auditability and per-variant debugging.

This change makes conversation[].content always reflect the on-wire value
(converted_value) for both user and assistant turns, and preserves the
pre-converter objective as a sibling 'original_value' field on user
messages whenever it differs. Baseline (non-encoded) strategies are
unaffected since original_value == converted_value.

Adds two regression tests in TestFoundryResultProcessor and a CHANGELOG
entry. Resolves Azure#47228.
Copilot AI review requested due to automatic review settings June 16, 2026 23:59
@huliang-microsoft huliang-microsoft requested a review from a team as a code owner June 16, 2026 23:59
@github-actions github-actions Bot added Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation labels Jun 17, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Thank you for your contribution @huliang-microsoft! We will review the pull request and get back to you soon.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes persisted red-team conversations so they reflect the actual encoded payload sent to the target (converter output), while preserving the pre-conversion prompt for auditability.

Changes:

  • Update FoundryResultProcessor._build_messages_from_pieces() to store converted_value as content and add original_value only when it differs.
  • Add unit tests covering encoded user prompts and fallback behavior when converted_value is missing.
  • Document the bug fix and new persisted field in the changelog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_foundry_result_processor.py Changes message serialization to persist wire payload and optionally include original_value for auditing.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_redteam/test_foundry.py Adds regression + behavior tests for encoded prompts and fallback behavior.
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md Documents the behavior change and new original_value field for persisted conversations.

Comment on lines +360 to +367
original = getattr(piece, "original_value", None)
converted = getattr(piece, "converted_value", None)
if isinstance(converted, str) and converted:
content = converted
elif isinstance(original, str) and original:
content = original
else:
content = getattr(piece, "converted_value", None) or getattr(piece, "original_value", "")
content = ""

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 493577f — dropped the isinstance(str) guards around content selection so non-string converted_value / original_value (bytes, structured/multimodal payloads) pass through unchanged. The str check is kept only on the original_value audit-field emission, where comparing two non-text values for inequality would be meaningless.

assistant_piece.prompt_metadata = {}
assistant_piece.labels = {}

messages = processor._build_messages_from_pieces([user_piece, assistant_piece])

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 493577ftest_build_messages_preserves_non_string_payloads covers a list-of-parts user payload and a bytes assistant payload, asserting both survive on content without being coerced to "".


### Bugs Fixed

- Fixed `RedTeam.scan()` storing decoded plaintext instead of the actual encoded payload for converter-based attack strategies (`Base64`, `Flip`, `Morse`, `ROT13`, `Caesar`, `Leetspeak`, `AsciiArt`, `AnsiAttack`, `Atbash`, `Binary`, `CharacterSpace`, `CharSwap`, `Diacritic`, `StringJoin`, `SuffixAppend`, `UnicodeConfusable`, `UnicodeSubstitution`, `Url`, `AsciiSmuggler`, `Tense`) in `evaluation_results.json` / `results.json`. The persisted `conversation[].content` for user turns now reflects what the target actually received (`converted_value`); the pre-converter adversarial objective is preserved on the same message as a new `original_value` field so the audit trail of what the attack meant to say is not lost. Baseline (non-encoded) strategies are unaffected. Resolves [Azure/azure-sdk-for-python#47228](https://github.com/Azure/azure-sdk-for-python/issues/47228).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 493577f — wrapped the 1.17.1 entry across multiple lines, dropped the exhaustive strategy enumeration (now reads Base64, Flip, Morse, ROT13, etc.), and kept the key behavior change (content uses converted_value, new original_value audit field) plus the issue link.

…en changelog

- _foundry_result_processor.py: stop forcing converted_value/original_value
  through isinstance(str) when computing content. Bytes / structured
  multimodal payloads now pass through unchanged; the original_value
  audit field is still gated on both sides being str so cross-type
  inequality cannot produce a misleading field.
- test_foundry.py: add test_build_messages_preserves_non_string_payloads
  covering list-of-parts and bytes payloads.
- CHANGELOG.md: wrap the 1.17.1 entry across multiple lines and drop the
  exhaustive strategy enumeration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community Contribution Community members are working on the issue customer-reported Issues that are reported by GitHub users external to the Azure organization. Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Azure AI Evaluation's RedTeam.scan() method decodes **all** encoded attack prompts when storing the result files

3 participants