Skip to content

feat: add configurable decode ACL-graph fallback threshold.#1233

Open
DongheJin wants to merge 1 commit intojd-opensource:mainfrom
DongheJin:bugfix/aclgraph_oom_main
Open

feat: add configurable decode ACL-graph fallback threshold.#1233
DongheJin wants to merge 1 commit intojd-opensource:mainfrom
DongheJin:bugfix/aclgraph_oom_main

Conversation

@DongheJin
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new global flag, acl_graph_decode_batch_size_limit, to manage the maximum batch size for ACL graph decoding. If the actual decode batch size surpasses this limit, the system will automatically revert to eager mode to prevent Out-Of-Memory (OOM) issues. This change necessitated refactoring various decoder layers (GLM4, GLM4-MoE, Qwen3, Qwen3-MoE) to support distinct parameter sets and execution nodes for both graph and eager modes, along with updates to their forward and build_node_variant_pack methods for dynamic mode selection. A new test case was also added to validate this fallback mechanism. The review comments suggest improving the handling of the acl_graph_decode_batch_size_limit flag by either documenting its std::max(1, ...) behavior or treating non-positive inputs as errors. Additionally, the manual copying and overriding of enableAclGraphPagedAttention for eager decode parameters across multiple layers is identified as repetitive and error-prone, recommending encapsulation through copy constructors, factory methods, or a shared utility for better safety and maintainability.

Comment on lines +990 to +991
const uint32_t decode_batch_size_limit =
std::max(1, FLAGS_acl_graph_decode_batch_size_limit);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of std::max(1, ...) is a good safety measure, but it should be explicitly documented or handled as a configuration error if the user provides a non-positive threshold, as this silently overrides the user's intent.

Comment on lines +90 to +91
decode_eager_param_ = decode_graph_param_;
decode_eager_param_.enableAclGraphPagedAttention = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Manual assignment of decode_eager_param_ from decode_graph_param_ followed by a specific member override is error-prone. Consider adding a copy constructor or a dedicated factory method to ChatglmLayerParam to handle this initialization safely.

Comment on lines +51 to +52
decode_eager_param_ = decode_graph_param_;
decode_eager_param_.enableAclGraphPagedAttention = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to other decoder implementations, manual copying and modification of decode_eager_param_ is fragile. Please encapsulate this logic within the parameter struct or a factory method to ensure consistency.

Comment on lines +161 to +162
decode_eager_param_ = decode_graph_param_;
decode_eager_param_.enableAclGraphPagedAttention = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The manual override of enableAclGraphPagedAttention after copying the parameter struct is prone to maintenance issues. Encapsulate this initialization logic to prevent future regressions.

Comment on lines +59 to +60
decode_eager_param_ = decode_graph_param_;
decode_eager_param_.enableAclGraphPagedAttention = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The manual initialization of decode_eager_param_ by copying and overriding a flag is repetitive across different layer implementations. Consider refactoring this into a shared utility or a constructor-based approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant