Conversation
- Add DeviceRuntimeInfo dataclass to io_struct.py for runtime memory stats - Add clear_memory() method to all platform classes (cuda, npu, cpu, unknown) - Add get_device_stats() abstract method to TrainEngine API - Implement get_device_stats() in FSDPEngine and MegatronEngine - Update all usages from log_gpu_stats() to engine.get_device_stats().log() - Delete areal/utils/device.py as functionality moved to platform/engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary of ChangesHello @garrett4wade, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the handling of device-specific operations, particularly memory management and logging. The core intent is to resolve an issue where GPU statistics were erroneously collected on the CPU in single-controller mode, and to establish a more robust, platform-aware mechanism for managing device resources. By integrating device utilities directly into the engine and platform classes, the changes enhance the reliability and consistency of device interaction across the codebase. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a solid refactoring that moves device-specific utilities like memory logging and clearing into platform-specific classes, fixing a bug where log_gpu_stats was running on the wrong node. The introduction of DeviceRuntimeInfo and the clear_memory method in platform classes is a good architectural improvement. The changes are well-executed across the codebase. I've identified a couple of areas for improvement: one to make the new DeviceRuntimeInfo class more robust against incorrect usage, and a potential bug in the UnknownPlatform where the clear_memory implementation seems to be missing.
Description
In the single-controller mode, the
log_gpu_statsfunction runs on the CPU node and it is a bug. This PR attaches the device logging method into the engine to avoid this issue.🤖 Generated with Claude Code
Type of Change
work as expected)
Checklist
jb build docs/gemini review)