FEAT: Score partial content from content-filtered responses by jsong468 · Pull Request #1689 · microsoft/PyRIT

jsong468 · 2026-05-04T22:53:57Z

Description

When OpenAI/Azure content filters trigger mid-generation (HTTP 200 with finish_reason=content_filter), the model may have already produced partial content before being cut off. Currently, PyRIT discards this partial content and treats the response identically to a full block (HTTP 400), so scorers return hardcoded failures and attacks backtrack. From an adversary's perspective, partial harmful content may constitute a successful attack even though the output was eventually filtered.

This PR introduces a score_blocked_content flag that allows attacks to opt in to evaluating partial content from blocked responses instead of automatically treating them as failures.

Target layer — partial content extraction:

Added _extract_partial_content(response) template method to OpenAITarget base class, returning None by default
OpenAIChatTarget overrides it to extract response.choices[0].message.content
OpenAIResponseTarget overrides it to extract text from response.output MESSAGE sections
_handle_content_filter_response calls the hook and attaches the result to prompt_metadata["partial_content"] on the blocked MessagePiece before it is persisted to the DB
No changes to response_error="blocked" — full backward compatibility

Scorer layer — call-site flag on score_async:

Added score_blocked_content: bool = False parameter to Scorer.score_async(), _score_async(), score_response_async(), and score_response_multiple_scorers_async()
When True, _score_async creates a temporary in-memory substitute MessagePiece from the blocked piece's partial_content metadata with converted_value=<partial text>, converted_value_data_type="text", and response_error="none"
The substitute replaces the original blocked piece (if already in supported pieces) or is added (if the original was filtered out by the validator). This means:
- Scorers that accept all types (e.g., refusal scorer) get the substitute instead of the original, so the if response_error == "blocked" short-circuit does not fire and the scorer evaluates the actual content
- Scorers that only accept text (e.g., task achieved scorer) get the substitute added where the original was filtered out
When skip_on_error_result=True and score_blocked_content=True, blocked messages with partial content are not skipped — allowing both flags to coexist. Note that these flags serve different purposes as skip_on_error_result = True will still skip other non-content-filter errors. When both are True, real processing/API response errors will be skipped for scoring but partial content from content filter triggers will be used for scoring.
The substitute is never persisted to the DB. The resulting Score references the original blocked piece's ID
Updated TrueFalseScorer._score_async, FloatScaleThresholdScorer._score_async, and ConversationScorer._score_async to accept and forward the new parameter

Config + attack layer:

Added score_blocked_content: bool = False to AttackScoringConfig and TAPAttackScoringConfig
All attacks that use AttackScoringConfig store the flag and pass it through to scoring calls:
- PromptSendingAttack — passes to Scorer.score_response_async
- CrescendoAttack — passes to both refusal scorer and objective scorer calls
- RedTeamingAttack — passes to score_async
- MultiPromptSendingAttack — passes to Scorer.score_response_async
- TreeOfAttacksWithPruning — passes through TAPAttackScoringConfig → node construction → node scoring

Design decisions

Flag on score_async (call-site), not on Scorer.__init__ (instance):

Follows the existing skip_on_error_result pattern — a runtime scoring policy parameter, not a scorer property
Avoids modifying __init__ signatures of ~24 scorer subclasses
Allows the same scorer instance to be used with different policies at different call sites (e.g., the same SelfAskScaleScorer used both inside an attack with the flag on and externally with the flag off)

Flag on AttackScoringConfig, not directly on individual attacks:

AttackScoringConfig is the single place where all scoring policy lives — objective_scorer, refusal_scorer, auxiliary_scorers, use_score_as_feedback are already there. Adding score_blocked_content keeps scoring concerns centralized rather than scattering them across attack constructors.
All attacks already read from AttackScoringConfig in their __init__, so the propagation pattern (self._score_blocked_content = attack_scoring_config.score_blocked_content) is consistent with how use_score_as_feedback is handled.

Users configure scoring behavior in one place:

AttackScoringConfig(
    objective_scorer=my_scorer,
    refusal_scorer=my_refusal_scorer,
    score_blocked_content=True,  # alongside other scoring policy
)

Tests

TestCreateTextPieceFromBlocked (6 tests in test_scorer.py) — substitute piece creation, field preservation, response_error="none", None when no partial content
TestScoreAsyncWithBlockedContent (7 tests in test_scorer.py) — text-only scorer filtering, substitute scoring, refusal scorer short-circuit bypass, no-partial-content handling, mixed pieces, normal piece unaffected
TestSkipOnErrorWithBlockedContent (3 tests in test_scorer.py) — interaction between skip_on_error_result=True and score_blocked_content=True
TestScoreResponseAsyncBlockedContent (3 tests in test_scorer.py) — flag passthrough via score_response_async and score_response_multiple_scorers_async
TestAttackScoringConfig (3 tests in test_attack_config.py) — default value, explicit True, validation with scorers
TestExtractPartialContentChatTarget (4 tests in test_openai_chat_target.py) — extraction from Chat Completions responses, None edge cases
TestContentFilterPreservesPartialContent (2 tests in test_openai_chat_target.py) — end-to-end: 200 + content_filter preserves metadata, no metadata when no content
TestExtractPartialContentResponseTarget (3 tests in test_openai_response_target.py) — extraction from Response API output sections, non-message section filtering
Updated test_score_response_async_parallel_execution in test_scorer.py for new parameter in assert_any_call

fdubut · 2026-05-05T00:02:10Z

Thanks for implementing this feature! I did only a cursory review of the code but read the thorough PR description and the overall design looks good to me.

adrian-gavrila · 2026-05-05T21:22:12Z

            scores = await self._score_async(
                message,
                objective=objective,
+                score_blocked_content=score_blocked_content,


Just calling out that this addition is not fully back compatible since adding this unconditional forward in _score_async breaks subclasses using the previous signature ((self, message, *, objective=None)). This can be verified by creating a Scorer subclass based on the old signature and a default score_async. That causes a keyword error on score_blocked_content. Since back-compat is called out in the PR description you may want to either call this out, or change this to a conditional forward?

good point, the logic for substituting partial content can be moved to the public-facing score_async to preserve backward compat.

rlundeen2 · 2026-05-05T21:43:53Z

        return find_objective_metrics_by_eval_hash(eval_hash=eval_hash, file_path=result_file)

-    async def _score_async(self, message: Message, *, objective: Optional[str] = None) -> list[Score]:
+    async def _score_async(


I know this is a design change, but I think this would make more sense as an attribute of the scorer itself.

I think a scorer should be configured if whether or not it should score blocked content (e.g. take it in init). If this were an attribute, _score_async would just check self._score_blocked_content instead of receiving a param, and zero attacks would need to thread it.

And I also think it'd be nice, because we could have default True, which I think we'd mostly want.

WDYT?

It might also be worth a second opinion here if you are on the fence. It isn't a blocker for me necessarily but I think it's less error prone and easier to debug.

I actually had this design initially (hence the branch being called partial_blocked_design2 😄 ). The reason why I moved away from it is that to me, it's more intuitive that scoring_blocked_content is an attack runtime scoring policy and not an inherent property of a scorer (i.e., tweaking config level param for your attack without having to re-instantiate scorers and also applying to the objective, refusal, and auxiliary scorers all at once). It also matches the pattern we use for the skip_on_error_result param in that we pass it into each score_async / score_response_async call rather than defining it at the object level. (It also avoids having to change every scorer init signature although that's more minor.)

I do see what you mean by being more error prone since it's a param that needs to be passed around more throughout attacks and not centralized to an instance variable.

All that said, I think there's a good argument for either 😄 let me know if you still lean toward the init option though and I'm okay with implementing that.

For now, I made some changes to fix the conversation_scorer bug (which I think is a unique case) and also changed private _score_async to no longer take score_blocked_content as a param and centralizing the content substitution to the public score_async method so there's less threading of the param involved.

rlundeen2 · 2026-05-05T21:46:50Z

+    # When True, blocked responses that contain partial model output (e.g., from Azure Content Safety
+    # triggering mid-generation) will be evaluated by scorers instead of being skipped or
+    # auto-classified as failures/refusals.
+    score_blocked_content: bool = False


If this were an attribute on the scorer, it wouldn't change this

rlundeen2 · 2026-05-06T00:08:21Z

        role_filter: Optional[ChatMessageRole] = None,
        skip_on_error_result: bool = False,
        infer_objective_from_request: bool = False,
+        score_blocked_content: bool = False,


I think default True is better

I think default False is preferable for now because having a response filtered is still a meaningful result from the model that should be captured by default rather than silently going past and continuing an attack.

romanlutz · 2026-05-06T11:36:57Z

        # Check if the response has a mapped error before attempting normal scoring.
        # This prevents scorer failures when the target returns a blocked/filtered response
        # (e.g., content policy violations from image generation targets).
        if self._error_score_map and response.is_error():


If it's really blocked you'd actually end up in here. The change wouldn't affect scoring at all.

romanlutz · 2026-05-06T11:48:32Z

+                return None
+            parts: list[str] = []
+            for section in response.output:
+                if getattr(section, "type", None) == MessagePieceType.MESSAGE:


There's a lot of getattr here. The response type isn't actually Any, is it? We should have a well-defined object from the openai SDK and should be able to check section.type etc.

Am I missing something here?

romanlutz · 2026-05-06T11:50:46Z

+                    if use_partial_content and piece.is_blocked() and "partial_content" in piece.prompt_metadata:
+                        text = str(piece.prompt_metadata["partial_content"])
+                    else:
+                        text = piece.converted_value
+                    conversation_text += f"{role_display}: {text}\n"


We should perhaps tell the scorer that this is partial content because the full response was blocked? Isn't that relevant information?

I think it's relevant but kind of defeats the point of scoring blocked content. The scorer may be biased in an unknown way by doing that; if the user did not want the partial content to be scored, they can switch flag to False. Otherwise, partial content will be scored as if it were normal content (which is what the flag is for) or else we would have a blurry middle ground.

jsong468 added 3 commits May 4, 2026 15:24

score blocked content

668512d

docstring

e6fae92

merge conflicts

3beff4a

jsong468 changed the title ~~score blocked content~~ FEAT: Score partial content from content-filtered responses May 4, 2026

fix unit tests

9a4505a

adrian-gavrila reviewed May 5, 2026

View reviewed changes

rlundeen2 reviewed May 5, 2026

View reviewed changes

Comment thread pyrit/score/conversation_scorer.py

rlundeen2 reviewed May 5, 2026

View reviewed changes

jsong468 added 2 commits May 5, 2026 16:53

fix conversation_scorer bug and score_async

fc6c7e7

minor truthiness change

c49debc

rlundeen2 reviewed May 6, 2026

View reviewed changes

romanlutz reviewed May 6, 2026

View reviewed changes

Comment thread pyrit/prompt_target/openai/openai_target.py

romanlutz reviewed May 6, 2026

View reviewed changes

Comment thread tests/unit/prompt_target/target/test_openai_chat_target.py

Conversation

jsong468 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Design decisions

Tests

Uh oh!

fdubut commented May 5, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rlundeen2 May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jsong468 commented May 4, 2026 •

edited

Loading

rlundeen2 May 5, 2026 •

edited

Loading