[Evaluation] Fix RedTeam.scan() decoding encoded attack prompts in stored results#47536
[Evaluation] Fix RedTeam.scan() decoding encoded attack prompts in stored results#47536huliang-microsoft wants to merge 2 commits into
Conversation
For converter-based attack strategies (Base64, Flip, Morse, ROT13, Caesar, Leetspeak, AsciiArt, AnsiAttack, Atbash, Binary, CharacterSpace, CharSwap, Diacritic, StringJoin, SuffixAppend, UnicodeConfusable, UnicodeSubstitution, Url, AsciiSmuggler, Tense), FoundryResultProcessor was emitting the decoded 'original_value' as the user-message content while the target was actually receiving 'converted_value'. This made evaluation_results.json / results.json show plaintext where the audit trail should show the encoded payload, breaking post-scan auditability and per-variant debugging. This change makes conversation[].content always reflect the on-wire value (converted_value) for both user and assistant turns, and preserves the pre-converter objective as a sibling 'original_value' field on user messages whenever it differs. Baseline (non-encoded) strategies are unaffected since original_value == converted_value. Adds two regression tests in TestFoundryResultProcessor and a CHANGELOG entry. Resolves Azure#47228.
|
Thank you for your contribution @huliang-microsoft! We will review the pull request and get back to you soon. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes persisted red-team conversations so they reflect the actual encoded payload sent to the target (converter output), while preserving the pre-conversion prompt for auditability.
Changes:
- Update
FoundryResultProcessor._build_messages_from_pieces()to storeconverted_valueascontentand addoriginal_valueonly when it differs. - Add unit tests covering encoded user prompts and fallback behavior when
converted_valueis missing. - Document the bug fix and new persisted field in the changelog.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_foundry_result_processor.py | Changes message serialization to persist wire payload and optionally include original_value for auditing. |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_redteam/test_foundry.py | Adds regression + behavior tests for encoded prompts and fallback behavior. |
| sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | Documents the behavior change and new original_value field for persisted conversations. |
| original = getattr(piece, "original_value", None) | ||
| converted = getattr(piece, "converted_value", None) | ||
| if isinstance(converted, str) and converted: | ||
| content = converted | ||
| elif isinstance(original, str) and original: | ||
| content = original | ||
| else: | ||
| content = getattr(piece, "converted_value", None) or getattr(piece, "original_value", "") | ||
| content = "" |
There was a problem hiding this comment.
Addressed in 493577f — dropped the isinstance(str) guards around content selection so non-string converted_value / original_value (bytes, structured/multimodal payloads) pass through unchanged. The str check is kept only on the original_value audit-field emission, where comparing two non-text values for inequality would be meaningless.
| assistant_piece.prompt_metadata = {} | ||
| assistant_piece.labels = {} | ||
|
|
||
| messages = processor._build_messages_from_pieces([user_piece, assistant_piece]) |
There was a problem hiding this comment.
Added in 493577f — test_build_messages_preserves_non_string_payloads covers a list-of-parts user payload and a bytes assistant payload, asserting both survive on content without being coerced to "".
|
|
||
| ### Bugs Fixed | ||
|
|
||
| - Fixed `RedTeam.scan()` storing decoded plaintext instead of the actual encoded payload for converter-based attack strategies (`Base64`, `Flip`, `Morse`, `ROT13`, `Caesar`, `Leetspeak`, `AsciiArt`, `AnsiAttack`, `Atbash`, `Binary`, `CharacterSpace`, `CharSwap`, `Diacritic`, `StringJoin`, `SuffixAppend`, `UnicodeConfusable`, `UnicodeSubstitution`, `Url`, `AsciiSmuggler`, `Tense`) in `evaluation_results.json` / `results.json`. The persisted `conversation[].content` for user turns now reflects what the target actually received (`converted_value`); the pre-converter adversarial objective is preserved on the same message as a new `original_value` field so the audit trail of what the attack meant to say is not lost. Baseline (non-encoded) strategies are unaffected. Resolves [Azure/azure-sdk-for-python#47228](https://github.com/Azure/azure-sdk-for-python/issues/47228). |
There was a problem hiding this comment.
Addressed in 493577f — wrapped the 1.17.1 entry across multiple lines, dropped the exhaustive strategy enumeration (now reads Base64, Flip, Morse, ROT13, etc.), and kept the key behavior change (content uses converted_value, new original_value audit field) plus the issue link.
…en changelog - _foundry_result_processor.py: stop forcing converted_value/original_value through isinstance(str) when computing content. Bytes / structured multimodal payloads now pass through unchanged; the original_value audit field is still gated on both sides being str so cross-type inequality cannot produce a misleading field. - test_foundry.py: add test_build_messages_preserves_non_string_payloads covering list-of-parts and bytes payloads. - CHANGELOG.md: wrap the 1.17.1 entry across multiple lines and drop the exhaustive strategy enumeration.
Fixes #47228
Problem
RedTeam.scan()was storing the decoded plaintext objective inevaluation_results.json/results.jsonfor every converter-based attackstrategy (
Base64,Flip,Morse,ROT13,Caesar,Leetspeak,AsciiArt,AnsiAttack,Atbash,Binary,CharacterSpace,CharSwap,Diacritic,StringJoin,SuffixAppend,UnicodeConfusable,UnicodeSubstitution,Url,AsciiSmuggler,Tense). The target callbackactually received the encoded
converted_value, but persistedattack_details[].conversation[].contentwas the pre-converteroriginal_value. Customer impact (from #47228):attack_techniquemetadata correctly said e.g."base64", butconversationcontent was plaintext — breaking reproducibility.Root cause
In
FoundryResultProcessor._build_messages_from_pieces, user turnsdeliberately preferred
original_valueoverconverted_value, dropping theon-wire payload from the persisted conversation.
Fix
contentnow always reflects what was sent on the wire(
converted_value, falling back tooriginal_valueonly whenconverted_valueis empty) for both user and assistant turns.original_valuefield on user messages only when it differs fromcontent, so the audit trail of "what the attack meant to say" is notlost and baseline (non-encoded) strategies stay byte-identical to the
old output.
Tests
test_build_messages_from_piecesto assert nooriginal_valuefield appears when there is no encoding (no regression for baseline).
test_build_messages_preserves_encoded_user_prompt(Base64example) — primary regression test for Azure AI Evaluation's
RedTeam.scan()method decodes **all** encoded attack prompts when storing the result files #47228.test_build_messages_falls_back_to_original_when_converted_missing— covers
converted_value is Nonepath.CHANGELOG
Added a
1.17.1 (Unreleased)entry under### Bugs Fixed.cc @singankit @w-javed