Fix a bug of preparing Ulysses inputs for loss function in PPO#434
Fix a bug of preparing Ulysses inputs for loss function in PPO#434garrett4wade merged 1 commit intomainfrom
Conversation
Summary of ChangesHello @rchardx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical bug in the Ulysses pipeline's PPO loss calculation by ensuring the correct, unsliced loss mask is utilized. It also enhances the overall input preparation logic for distributed training, making it more explicit and robust, particularly concerning how various tensors like Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a critical bug in the Ulysses sequence parallelism implementation where the loss function was using sliced loss masks instead of the full mask, which could lead to incorrect loss calculations in distributed training scenarios.
- Corrected loss mask handling to preserve full masks for loss computation while using sliced versions for model inputs
- Refactored input preparation logic in
ulysses_prepare_inputswith improved tensor type checking and key-specific handling - Cleaned up unused variables and improved code formatting for better maintainability
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| areal/utils/ulysses.py | Enhanced ulysses_prepare_inputs function with proper loss mask handling and improved tensor slicing logic |
| areal/engine/ppo/actor.py | Fixed loss function to use full_loss_mask instead of sliced loss_mask |
| areal/engine/fsdp_engine.py | Removed unused variables and reformatted function calls for better readability |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request aims to fix a bug in how inputs are prepared for the PPO loss function when using Ulysses sequence parallelism. The changes correctly delegate input slicing to ulysses_prepare_inputs and introduce a full_loss_mask to be used by the loss function. The code is also refactored for clarity by cleaning up unused variables and improving function call formatting.
While the refactoring is good, I've identified two critical issues with the implementation that could lead to incorrect behavior or runtime errors during training with sequence parallelism:
- In
ulysses.py, the set of tensors being sliced is incomplete. Important tensors for the actor loss likeadvantages,logprobs, andprox_logpare not being sliced, which will cause a shape mismatch. - In
actor.py, the actor's loss function is changed to use thefull_loss_mask. However, since the other tensors it operates on are sliced, this will also result in a shape mismatch. The slicedloss_maskshould be used instead.
I've left detailed comments with suggestions to fix these issues. Additionally, a minor comment points out a potentially misleading code comment.
This pull request refactors how loss masks and input slicing are handled in the Ulysses pipeline, with a focus on improving clarity and correctness for distributed (SP) training. The main changes ensure that the full (unsliced) loss mask is preserved for the loss function, while the sliced version is used for model inputs. Additionally, the code for preparing and slicing inputs has been clarified and made more robust.
Loss mask handling improvements:
The loss function (
grpo_loss_fninactor.py) now usesfull_loss_maskinstead of the slicedloss_mask, ensuring the loss is computed on the correct mask. The slicing ofloss_maskis now delegated toulysses_prepare_inputs.In
ulysses_prepare_inputs(ulysses.py), when processing theloss_maskkey, the code now saves the full (unsliced) version asfull_loss_maskfor use in the loss function, and also stores the sliced version for model inputs.Input preparation and slicing:
The
ulysses_prepare_inputsfunction has been refactored for clarity, with improved logic for which tensors to slice and how to handle keys likevalues,returns, andloss_mask. This includes a more robust check for tensor types and shapes.The function signatures and calls to
ulysses_prepare_inputshave been updated throughout the codebase (in bothfsdp_engine.pyandulysses.py) to use multi-line arguments for clarity and maintain consistent parameter ordering. [1] [2] [3]Minor cleanups:
fsdp_engine.py, simplifying the code and reducing potential confusion. [1] [2] [3]