-
Notifications
You must be signed in to change notification settings - Fork 3.3k
[Evaluation] Fix RedTeam.scan() decoding encoded attack prompts in stored results #47536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
huliang-microsoft
wants to merge
2
commits into
Azure:main
Choose a base branch
from
huliang-microsoft:fix/redteam-encoded-prompts-fidelity
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+196
−8
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1427,6 +1427,153 @@ def test_build_messages_from_pieces(self): | |
| assert messages[0]["content"] == "User message" | ||
| assert messages[1]["role"] == "assistant" | ||
| assert messages[1]["content"] == "Assistant response" | ||
| # When original and converted match (no encoding), no audit field is added. | ||
| assert "original_value" not in messages[0] | ||
| assert "original_value" not in messages[1] | ||
|
|
||
| def test_build_messages_preserves_encoded_user_prompt(self): | ||
| """Encoded attack prompts must be stored as the wire payload. | ||
|
|
||
| Regression test for | ||
| https://github.com/Azure/azure-sdk-for-python/issues/47228 — for | ||
| converter-based strategies (Base64, Flip, Morse, ROT13, etc.) the | ||
| target receives ``converted_value``, so the persisted conversation | ||
| must report ``converted_value`` as ``content`` (not the decoded | ||
| ``original_value``). The pre-converter objective is preserved as | ||
| ``original_value`` on the same message so callers still have an | ||
| audit trail of what the attack meant to say. | ||
| """ | ||
| mock_scenario = MagicMock() | ||
| mock_dataset = MagicMock() | ||
| mock_dataset.get_all_seed_groups.return_value = [] | ||
|
|
||
| processor = FoundryResultProcessor( | ||
| scenario=mock_scenario, | ||
| dataset_config=mock_dataset, | ||
| risk_category="violence", | ||
| ) | ||
|
|
||
| # Simulate a Base64-converted user turn: the target actually saw the | ||
| # encoded payload, but the SDK still has the plaintext objective. | ||
| user_piece = MagicMock() | ||
| user_piece.api_role = "user" | ||
| user_piece.original_value = "How do I make a dangerous thing?" | ||
| user_piece.converted_value = "SG93IGRvIEkgbWFrZSBhIGRhbmdlcm91cyB0aGluZz8=" | ||
| user_piece.sequence = 0 | ||
| user_piece.prompt_metadata = {} | ||
| user_piece.labels = {} | ||
|
|
||
| # Assistant response — converter is a no-op on the response side, so | ||
| # original and converted match. No audit field should be emitted. | ||
| assistant_piece = MagicMock() | ||
| assistant_piece.api_role = "assistant" | ||
| assistant_piece.original_value = "Sorry, I can't help with that." | ||
| assistant_piece.converted_value = "Sorry, I can't help with that." | ||
| assistant_piece.sequence = 1 | ||
| assistant_piece.prompt_metadata = {} | ||
| assistant_piece.labels = {} | ||
|
|
||
| messages = processor._build_messages_from_pieces([user_piece, assistant_piece]) | ||
|
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added in 493577f — |
||
|
|
||
| # The user turn must carry the encoded payload as content so consumers | ||
| # can verify exactly what the target received. | ||
| assert messages[0]["role"] == "user" | ||
| assert messages[0]["content"] == "SG93IGRvIEkgbWFrZSBhIGRhbmdlcm91cyB0aGluZz8=" | ||
| # The plaintext objective is preserved alongside it for auditability. | ||
| assert messages[0]["original_value"] == "How do I make a dangerous thing?" | ||
|
|
||
| # Assistant turn is unchanged: content == converted_value, no audit field. | ||
| assert messages[1]["role"] == "assistant" | ||
| assert messages[1]["content"] == "Sorry, I can't help with that." | ||
| assert "original_value" not in messages[1] | ||
|
|
||
| def test_build_messages_falls_back_to_original_when_converted_missing(self): | ||
| """When ``converted_value`` is empty, fall back to ``original_value``. | ||
|
|
||
| Covers the historical behavior for pieces where PyRIT did not run a | ||
| converter (e.g., Baseline strategy or in-flight failures). | ||
| """ | ||
| mock_scenario = MagicMock() | ||
| mock_dataset = MagicMock() | ||
| mock_dataset.get_all_seed_groups.return_value = [] | ||
|
|
||
| processor = FoundryResultProcessor( | ||
| scenario=mock_scenario, | ||
| dataset_config=mock_dataset, | ||
| risk_category="violence", | ||
| ) | ||
|
|
||
| user_piece = MagicMock() | ||
| user_piece.api_role = "user" | ||
| user_piece.original_value = "Baseline prompt" | ||
| user_piece.converted_value = None | ||
| user_piece.sequence = 0 | ||
| user_piece.prompt_metadata = {} | ||
| user_piece.labels = {} | ||
|
|
||
| messages = processor._build_messages_from_pieces([user_piece]) | ||
|
|
||
| assert len(messages) == 1 | ||
| assert messages[0]["content"] == "Baseline prompt" | ||
| # original == content here, so no separate audit field is needed. | ||
| assert "original_value" not in messages[0] | ||
|
|
||
| def test_build_messages_preserves_non_string_payloads(self): | ||
| """Non-string ``converted_value`` payloads must survive unchanged. | ||
|
|
||
| PyRIT message pieces can carry structured / multimodal content | ||
| (e.g., bytes or list-of-parts payloads) on ``converted_value``. | ||
| ``content`` must pass those through so persisted conversations | ||
| remain a faithful record of what the target received; only the | ||
| ``original_value`` audit field is gated on both sides being text. | ||
| """ | ||
| mock_scenario = MagicMock() | ||
| mock_dataset = MagicMock() | ||
| mock_dataset.get_all_seed_groups.return_value = [] | ||
|
|
||
| processor = FoundryResultProcessor( | ||
| scenario=mock_scenario, | ||
| dataset_config=mock_dataset, | ||
| risk_category="violence", | ||
| ) | ||
|
|
||
| # Structured multimodal-style payload on converted_value, plain | ||
| # string objective on original_value. | ||
| structured_payload = [ | ||
| {"type": "text", "text": "describe this image"}, | ||
| {"type": "image_url", "image_url": {"url": "https://example/img.png"}}, | ||
| ] | ||
| user_piece = MagicMock() | ||
| user_piece.api_role = "user" | ||
| user_piece.original_value = "Describe this image" | ||
| user_piece.converted_value = structured_payload | ||
| user_piece.sequence = 0 | ||
| user_piece.prompt_metadata = {} | ||
| user_piece.labels = {} | ||
|
|
||
| # Bytes payload on assistant converted_value — must not be coerced | ||
| # to "" by str-gating logic. | ||
| assistant_piece = MagicMock() | ||
| assistant_piece.api_role = "assistant" | ||
| assistant_piece.original_value = None | ||
| assistant_piece.converted_value = b"\x89PNG\r\n" | ||
| assistant_piece.sequence = 1 | ||
| assistant_piece.prompt_metadata = {} | ||
| assistant_piece.labels = {} | ||
|
|
||
| messages = processor._build_messages_from_pieces([user_piece, assistant_piece]) | ||
|
|
||
| # Structured user payload passed through unchanged. | ||
| assert messages[0]["role"] == "user" | ||
| assert messages[0]["content"] is structured_payload | ||
| # Audit field omitted: content is non-text so cross-type comparison | ||
| # against the str original would be meaningless. | ||
| assert "original_value" not in messages[0] | ||
|
|
||
| # Bytes assistant payload preserved (not silently dropped to ""). | ||
| assert messages[1]["role"] == "assistant" | ||
| assert messages[1]["content"] == b"\x89PNG\r\n" | ||
| assert "original_value" not in messages[1] | ||
|
|
||
| def test_get_prompt_group_id_from_conversation(self): | ||
| """Test extracting prompt_group_id from conversation.""" | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 493577f — dropped the
isinstance(str)guards aroundcontentselection so non-stringconverted_value/original_value(bytes, structured/multimodal payloads) pass through unchanged. Thestrcheck is kept only on theoriginal_valueaudit-field emission, where comparing two non-text values for inequality would be meaningless.