Fix: In-Group Evals by hubert-marek · Pull Request #2057 · PrimeIntellect-ai/prime-rl

hubert-marek · 2026-03-20T18:06:44Z

Fix eval grouping and preserve environment-declared state columns during deferred scoring.

Description

Two related fixes in vf_utils.py:

1. Fix eval group construction

evaluate() was passing rollouts_per_example to _get_eval_inputs() (which repeats examples) and then setting rollouts_per_example=1 on generate(). This meant each "group" had only one rollout, breaking group-based scoring and pass@k semantics.

Fixed by passing rollouts_per_example=1 to _get_eval_inputs() to get unique examples, then passing the real rollouts_per_example to generate() so groups are constructed correctly.

Fixes:

2. Preserve environment-declared state columns

run_rollout() now reads env.state_columns (if present) and includes those fields in the serialization columns. This allows environments to declare which custom state fields (e.g. eval_score, eval_error) should survive state_to_output() serialization, so deferred group scoring can access them without the orchestrator needing to hardcode environment-specific field names.

Resolves to the task-specific sub-environment via get_env_for_task() when running under EnvGroup.

Falls back gracefully: if the environment has no state_columns attribute, getattr returns [] and behavior is unchanged.

Companion PR: PrimeIntellect-ai/verifiers#1054

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Verified locally that state_columns declared by an environment are correctly appended to serialization columns. The eval grouping fix was validated by confirming _get_eval_inputs(..., 1) produces unique examples and generate() constructs proper groups.

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

The state_columns feature depends on PrimeIntellect-ai/verifiers#1054, which adds the state_columns attribute to Environment.__init__. Without that PR, the getattr fallback returns [] and the behavior is identical to today.

Note

Medium Risk
Changes evaluation batching semantics and rollout serialization columns, which can affect pass@k/group-scoring correctness and downstream metrics, but is localized to vf_utils wrappers.

Overview
Fixes eval grouping so evaluate() builds true groups of size rollouts_per_example by requesting unique eval inputs (_get_eval_inputs(..., rollouts_per_example=1)) and passing the real rollouts_per_example through to generate().

Updates run_rollout() to preserve environment-declared custom state fields by appending task-specific state_columns (via get_env_for_task() when available) to the serialized output columns, ensuring deferred/group scoring can access them.

^{Written by Cursor Bugbot for commit ef25835. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

hubert-marek added 5 commits March 20, 2026 23:36

Fix In-Group Evals

a2e6484

Fix In-Group Evals

e9e7040

Logic

3397ad4

Enironemnt state custom arguments

6d71697

State

ef25835

hubert-marek mentioned this pull request Mar 23, 2026

Allow environments to declare preserved state columns PrimeIntellect-ai/verifiers#1054

Open

13 tasks

hubert-marek changed the title ~~Fix In-Group Evals~~ Fix: In-Group Evals Mar 24, 2026

hubert-marek marked this pull request as ready for review March 24, 2026 12:22

cursor Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/vf_utils.py Outdated

hubert-marek marked this pull request as draft March 24, 2026 12:39

Run Group

6a70749

hubert-marek marked this pull request as ready for review March 24, 2026 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: In-Group Evals#2057

Fix: In-Group Evals#2057
hubert-marek wants to merge 6 commits intoPrimeIntellect-ai:mainfrom
hubert-marek:eval-fixes

hubert-marek commented Mar 20, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hubert-marek commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

1. Fix eval group construction

2. Preserve environment-declared state columns

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hubert-marek commented Mar 20, 2026 •

edited

Loading