You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new global flag, acl_graph_decode_batch_size_limit, to manage the maximum batch size for ACL graph decoding. If the actual decode batch size surpasses this limit, the system will automatically revert to eager mode to prevent Out-Of-Memory (OOM) issues. This change necessitated refactoring various decoder layers (GLM4, GLM4-MoE, Qwen3, Qwen3-MoE) to support distinct parameter sets and execution nodes for both graph and eager modes, along with updates to their forward and build_node_variant_pack methods for dynamic mode selection. A new test case was also added to validate this fallback mechanism. The review comments suggest improving the handling of the acl_graph_decode_batch_size_limit flag by either documenting its std::max(1, ...) behavior or treating non-positive inputs as errors. Additionally, the manual copying and overriding of enableAclGraphPagedAttention for eager decode parameters across multiple layers is identified as repetitive and error-prone, recommending encapsulation through copy constructors, factory methods, or a shared utility for better safety and maintainability.
The reason will be displayed to describe this comment to others. Learn more.
The use of std::max(1, ...) is a good safety measure, but it should be explicitly documented or handled as a configuration error if the user provides a non-positive threshold, as this silently overrides the user's intent.
The reason will be displayed to describe this comment to others. Learn more.
Manual assignment of decode_eager_param_ from decode_graph_param_ followed by a specific member override is error-prone. Consider adding a copy constructor or a dedicated factory method to ChatglmLayerParam to handle this initialization safely.
The reason will be displayed to describe this comment to others. Learn more.
Similar to other decoder implementations, manual copying and modification of decode_eager_param_ is fragile. Please encapsulate this logic within the parameter struct or a factory method to ensure consistency.
The reason will be displayed to describe this comment to others. Learn more.
The manual override of enableAclGraphPagedAttention after copying the parameter struct is prone to maintenance issues. Encapsulate this initialization logic to prevent future regressions.
The reason will be displayed to describe this comment to others. Learn more.
The manual initialization of decode_eager_param_ by copying and overriding a flag is repetitive across different layer implementations. Consider refactoring this into a shared utility or a constructor-based approach.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.