Skip to content

Fix: In-Group Evals#2057

Open
hubert-marek wants to merge 6 commits intoPrimeIntellect-ai:mainfrom
hubert-marek:eval-fixes
Open

Fix: In-Group Evals#2057
hubert-marek wants to merge 6 commits intoPrimeIntellect-ai:mainfrom
hubert-marek:eval-fixes

Conversation

@hubert-marek
Copy link
Copy Markdown

@hubert-marek hubert-marek commented Mar 20, 2026

Fix eval grouping and preserve environment-declared state columns during deferred scoring.

Description

Two related fixes in vf_utils.py:

1. Fix eval group construction

evaluate() was passing rollouts_per_example to _get_eval_inputs() (which repeats examples) and then setting rollouts_per_example=1 on generate(). This meant each "group" had only one rollout, breaking group-based scoring and pass@k semantics.

Fixed by passing rollouts_per_example=1 to _get_eval_inputs() to get unique examples, then passing the real rollouts_per_example to generate() so groups are constructed correctly.

Fixes:

image

2. Preserve environment-declared state columns

run_rollout() now reads env.state_columns (if present) and includes those fields in the serialization columns. This allows environments to declare which custom state fields (e.g. eval_score, eval_error) should survive state_to_output() serialization, so deferred group scoring can access them without the orchestrator needing to hardcode environment-specific field names.

Resolves to the task-specific sub-environment via get_env_for_task() when running under EnvGroup.

Falls back gracefully: if the environment has no state_columns attribute, getattr returns [] and behavior is unchanged.

Companion PR: PrimeIntellect-ai/verifiers#1054

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass when running uv run pytest locally.
  • New tests have been added to cover the changes

Verified locally that state_columns declared by an environment are correctly appended to serialization columns. The eval grouping fix was validated by confirming _get_eval_inputs(..., 1) produces unique examples and generate() constructs proper groups.

Checklist

  • My code follows the style guidelines of this project as outlined in AGENTS.md
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

The state_columns feature depends on PrimeIntellect-ai/verifiers#1054, which adds the state_columns attribute to Environment.__init__. Without that PR, the getattr fallback returns [] and the behavior is identical to today.


Note

Medium Risk
Changes evaluation batching semantics and rollout serialization columns, which can affect pass@k/group-scoring correctness and downstream metrics, but is localized to vf_utils wrappers.

Overview
Fixes eval grouping so evaluate() builds true groups of size rollouts_per_example by requesting unique eval inputs (_get_eval_inputs(..., rollouts_per_example=1)) and passing the real rollouts_per_example through to generate().

Updates run_rollout() to preserve environment-declared custom state fields by appending task-specific state_columns (via get_env_for_task() when available) to the serialized output columns, ensuring deferred/group scoring can access them.

Written by Cursor Bugbot for commit ef25835. This will update automatically on new commits. Configure here.

@hubert-marek hubert-marek changed the title Fix In-Group Evals Fix: In-Group Evals Mar 24, 2026
@hubert-marek hubert-marek marked this pull request as ready for review March 24, 2026 12:22
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread src/prime_rl/orchestrator/vf_utils.py Outdated
@hubert-marek hubert-marek marked this pull request as draft March 24, 2026 12:39
@hubert-marek hubert-marek marked this pull request as ready for review March 24, 2026 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant