FEAT: Score partial content from content-filtered responses#1689
FEAT: Score partial content from content-filtered responses#1689jsong468 wants to merge 6 commits intomicrosoft:mainfrom
Conversation
|
Thanks for implementing this feature! I did only a cursory review of the code but read the thorough PR description and the overall design looks good to me. |
| scores = await self._score_async( | ||
| message, | ||
| objective=objective, | ||
| score_blocked_content=score_blocked_content, |
There was a problem hiding this comment.
Just calling out that this addition is not fully back compatible since adding this unconditional forward in _score_async breaks subclasses using the previous signature ((self, message, *, objective=None)). This can be verified by creating a Scorer subclass based on the old signature and a default score_async. That causes a keyword error on score_blocked_content. Since back-compat is called out in the PR description you may want to either call this out, or change this to a conditional forward?
There was a problem hiding this comment.
good point, the logic for substituting partial content can be moved to the public-facing score_async to preserve backward compat.
| return find_objective_metrics_by_eval_hash(eval_hash=eval_hash, file_path=result_file) | ||
|
|
||
| async def _score_async(self, message: Message, *, objective: Optional[str] = None) -> list[Score]: | ||
| async def _score_async( |
There was a problem hiding this comment.
I know this is a design change, but I think this would make more sense as an attribute of the scorer itself.
I think a scorer should be configured if whether or not it should score blocked content (e.g. take it in init). If this were an attribute, _score_async would just check self._score_blocked_content instead of receiving a param, and zero attacks would need to thread it.
And I also think it'd be nice, because we could have default True, which I think we'd mostly want.
WDYT?
There was a problem hiding this comment.
It might also be worth a second opinion here if you are on the fence. It isn't a blocker for me necessarily but I think it's less error prone and easier to debug.
There was a problem hiding this comment.
I actually had this design initially (hence the branch being called partial_blocked_design2 😄 ). The reason why I moved away from it is that to me, it's more intuitive that scoring_blocked_content is an attack runtime scoring policy and not an inherent property of a scorer (i.e., tweaking config level param for your attack without having to re-instantiate scorers and also applying to the objective, refusal, and auxiliary scorers all at once). It also matches the pattern we use for the skip_on_error_result param in that we pass it into each score_async / score_response_async call rather than defining it at the object level. (It also avoids having to change every scorer init signature although that's more minor.)
I do see what you mean by being more error prone since it's a param that needs to be passed around more throughout attacks and not centralized to an instance variable.
All that said, I think there's a good argument for either 😄 let me know if you still lean toward the init option though and I'm okay with implementing that.
There was a problem hiding this comment.
For now, I made some changes to fix the conversation_scorer bug (which I think is a unique case) and also changed private _score_async to no longer take score_blocked_content as a param and centralizing the content substitution to the public score_async method so there's less threading of the param involved.
| # When True, blocked responses that contain partial model output (e.g., from Azure Content Safety | ||
| # triggering mid-generation) will be evaluated by scorers instead of being skipped or | ||
| # auto-classified as failures/refusals. | ||
| score_blocked_content: bool = False |
There was a problem hiding this comment.
If this were an attribute on the scorer, it wouldn't change this
| role_filter: Optional[ChatMessageRole] = None, | ||
| skip_on_error_result: bool = False, | ||
| infer_objective_from_request: bool = False, | ||
| score_blocked_content: bool = False, |
There was a problem hiding this comment.
I think default True is better
There was a problem hiding this comment.
I think default False is preferable for now because having a response filtered is still a meaningful result from the model that should be captured by default rather than silently going past and continuing an attack.
| # Check if the response has a mapped error before attempting normal scoring. | ||
| # This prevents scorer failures when the target returns a blocked/filtered response | ||
| # (e.g., content policy violations from image generation targets). | ||
| if self._error_score_map and response.is_error(): |
There was a problem hiding this comment.
If it's really blocked you'd actually end up in here. The change wouldn't affect scoring at all.
| return None | ||
| parts: list[str] = [] | ||
| for section in response.output: | ||
| if getattr(section, "type", None) == MessagePieceType.MESSAGE: |
There was a problem hiding this comment.
There's a lot of getattr here. The response type isn't actually Any, is it? We should have a well-defined object from the openai SDK and should be able to check section.type etc.
Am I missing something here?
| if use_partial_content and piece.is_blocked() and "partial_content" in piece.prompt_metadata: | ||
| text = str(piece.prompt_metadata["partial_content"]) | ||
| else: | ||
| text = piece.converted_value | ||
| conversation_text += f"{role_display}: {text}\n" |
There was a problem hiding this comment.
We should perhaps tell the scorer that this is partial content because the full response was blocked? Isn't that relevant information?
There was a problem hiding this comment.
I think it's relevant but kind of defeats the point of scoring blocked content. The scorer may be biased in an unknown way by doing that; if the user did not want the partial content to be scored, they can switch flag to False. Otherwise, partial content will be scored as if it were normal content (which is what the flag is for) or else we would have a blurry middle ground.
Description
When OpenAI/Azure content filters trigger mid-generation (HTTP 200 with
finish_reason=content_filter), the model may have already produced partial content before being cut off. Currently, PyRIT discards this partial content and treats the response identically to a full block (HTTP 400), so scorers return hardcoded failures and attacks backtrack. From an adversary's perspective, partial harmful content may constitute a successful attack even though the output was eventually filtered.This PR introduces a
score_blocked_contentflag that allows attacks to opt in to evaluating partial content from blocked responses instead of automatically treating them as failures.Target layer — partial content extraction:
_extract_partial_content(response)template method toOpenAITargetbase class, returningNoneby defaultOpenAIChatTargetoverrides it to extractresponse.choices[0].message.contentOpenAIResponseTargetoverrides it to extract text fromresponse.outputMESSAGE sections_handle_content_filter_responsecalls the hook and attaches the result toprompt_metadata["partial_content"]on the blockedMessagePiecebefore it is persisted to the DBresponse_error="blocked"— full backward compatibilityScorer layer — call-site flag on
score_async:score_blocked_content: bool = Falseparameter toScorer.score_async(),_score_async(),score_response_async(), andscore_response_multiple_scorers_async()True,_score_asynccreates a temporary in-memory substituteMessagePiecefrom the blocked piece'spartial_contentmetadata withconverted_value=<partial text>,converted_value_data_type="text", andresponse_error="none"if response_error == "blocked"short-circuit does not fire and the scorer evaluates the actual contentskip_on_error_result=Trueandscore_blocked_content=True, blocked messages with partial content are not skipped — allowing both flags to coexist. Note that these flags serve different purposes asskip_on_error_result = Truewill still skip other non-content-filter errors. When both are True, real processing/API response errors will be skipped for scoring but partial content from content filter triggers will be used for scoring.Scorereferences the original blocked piece's IDTrueFalseScorer._score_async,FloatScaleThresholdScorer._score_async, andConversationScorer._score_asyncto accept and forward the new parameterConfig + attack layer:
score_blocked_content: bool = FalsetoAttackScoringConfigandTAPAttackScoringConfigAttackScoringConfigstore the flag and pass it through to scoring calls:PromptSendingAttack— passes toScorer.score_response_asyncCrescendoAttack— passes to both refusal scorer and objective scorer callsRedTeamingAttack— passes toscore_asyncMultiPromptSendingAttack— passes toScorer.score_response_asyncTreeOfAttacksWithPruning— passes throughTAPAttackScoringConfig→ node construction → node scoringDesign decisions
Flag on
score_async(call-site), not onScorer.__init__(instance):skip_on_error_resultpattern — a runtime scoring policy parameter, not a scorer property__init__signatures of ~24 scorer subclassesSelfAskScaleScorerused both inside an attack with the flag on and externally with the flag off)Flag on
AttackScoringConfig, not directly on individual attacks:AttackScoringConfigis the single place where all scoring policy lives —objective_scorer,refusal_scorer,auxiliary_scorers,use_score_as_feedbackare already there. Addingscore_blocked_contentkeeps scoring concerns centralized rather than scattering them across attack constructors.AttackScoringConfigin their__init__, so the propagation pattern (self._score_blocked_content = attack_scoring_config.score_blocked_content) is consistent with howuse_score_as_feedbackis handled.Tests
TestCreateTextPieceFromBlocked(6 tests intest_scorer.py) — substitute piece creation, field preservation,response_error="none",Nonewhen no partial contentTestScoreAsyncWithBlockedContent(7 tests intest_scorer.py) — text-only scorer filtering, substitute scoring, refusal scorer short-circuit bypass, no-partial-content handling, mixed pieces, normal piece unaffectedTestSkipOnErrorWithBlockedContent(3 tests intest_scorer.py) — interaction betweenskip_on_error_result=Trueandscore_blocked_content=TrueTestScoreResponseAsyncBlockedContent(3 tests intest_scorer.py) — flag passthrough viascore_response_asyncandscore_response_multiple_scorers_asyncTestAttackScoringConfig(3 tests intest_attack_config.py) — default value, explicitTrue, validation with scorersTestExtractPartialContentChatTarget(4 tests intest_openai_chat_target.py) — extraction from Chat Completions responses,Noneedge casesTestContentFilterPreservesPartialContent(2 tests intest_openai_chat_target.py) — end-to-end: 200 + content_filter preserves metadata, no metadata when no contentTestExtractPartialContentResponseTarget(3 tests intest_openai_response_target.py) — extraction from Response API output sections, non-message section filteringtest_score_response_async_parallel_executionintest_scorer.pyfor new parameter inassert_any_call