Skip to content

refactor(core): simplify HTTP client lifecycle with event loop cleanup#823

Merged
rchardx merged 3 commits intomainfrom
fw/session
Jan 13, 2026
Merged

refactor(core): simplify HTTP client lifecycle with event loop cleanup#823
rchardx merged 3 commits intomainfrom
fw/session

Conversation

@garrett4wade
Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade commented Jan 12, 2026

Description

This PR refactors HTTP client lifecycle management by replacing the AsyncTaskRunner shutdown hook mechanism with an asyncio-atexit pattern. The changes simplify the architecture by moving cleanup responsibility from AsyncTaskRunner to the event loop itself.

Key improvements:

  • Removed shutdown hook mechanism from AsyncTaskRunner (52 lines removed)
  • Removed related tests (128 lines removed)
  • Implemented per-thread HttpClientManager with event loop cleanup registration
  • Added register_loop_cleanup utility in areal/utils/concurrent.py for atexit-like behavior on loop closure
  • Fixed event loop change detection and stale client handling
  • Made get_aiohttp_session and get_httpx_client async to properly detect running loop
  • Removed WorkflowExecutor coupling to HTTP client lifecycle

Related Issue

Fixes thread-safety issues in HTTP client management where sessions could be reused across threads or event loops.

Type of Change

  • Code refactoring (no functional changes)
  • Bug fix (non-breaking change that fixes an issue)

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • My branch is up to date with main
  • No critical issues raised by AI reviewers

Additional Context

This change reduces overall complexity (net -19 lines) while improving thread safety and testability. The asyncio-atexit pattern is more idiomatic for async resource cleanup than custom shutdown hooks.

The changes affect:

  • areal/core/async_task_runner.py - Removed shutdown hooks
  • areal/core/workflow_context.py - Per-thread HttpClientManager with loop cleanup
  • areal/core/remote_inf_engine.py - Updated to await async session getter
  • areal/core/workflow_executor.py - Removed HTTP client configuration
  • areal/utils/concurrent.py - Added event loop cleanup utilities
  • areal/tests/test_async_task_runner.py - Removed shutdown hook tests

Need help? Check the Contributing Guide or ask in GitHub Discussions!

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @garrett4wade, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where aiohttp.ClientSession instances could retain stale event loop references, leading to problems when AsyncTaskRunner instances were destroyed and recreated within the same thread. The solution involves a significant refactoring of how HTTP clients are managed, transitioning to a thread-specific approach and tightly integrating their lifecycle with the AsyncTaskRunner via new initialization and updated shutdown hooks. This ensures robust and correct client behavior across thread lifecycles without functional changes to the application's external behavior.

Highlights

  • Initialization Hooks for AsyncTaskRunner: Introduced a new mechanism for registering asynchronous initialization hooks within the AsyncTaskRunner, allowing setup functions to run in the background thread's event loop at startup.
  • Enhanced Shutdown Hooks: Modified existing shutdown hooks in AsyncTaskRunner to receive the runner instance as a parameter, providing more context for cleanup operations.
  • Thread-Specific HTTP Client Management: Refactored HTTP client management from a module-level singleton to a global dictionary indexed by thread ID, ensuring each thread has its own HttpClientManager instance and preventing stale aiohttp.ClientSession references.
  • Lifecycle Integration of HTTP Clients: Integrated the configuration and cleanup of HTTP clients directly into the AsyncTaskRunner's lifecycle using the new initialization and enhanced shutdown hooks, ensuring proper client setup and teardown in the correct thread context.
  • Explicit Client Manager Removal: Added explicit removal of the HttpClientManager instance from the global thread-indexed dictionary during cleanup, ensuring a fresh instance is created upon subsequent reinitialization in the same thread.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@garrett4wade garrett4wade changed the title refactor: improve HTTP client management across threads fix: improve HTTP client management across threads Jan 12, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the HTTP client management to address potential issues with stale event loop references in aiohttp.ClientSession. The move from a singleton/threading.local() approach to a global dictionary of per-thread HttpClientManager instances, keyed by thread ID, is a solid solution. The introduction of initialization hooks in AsyncTaskRunner is a key part of this fix, ensuring that HTTP clients are configured within the correct background thread and event loop. The changes are well-contained and consistently applied across the codebase, including necessary updates to tests. I have one suggestion to improve the robustness of the new initialization hook registration.

Comment thread areal/core/async_task_runner.py Outdated
@garrett4wade garrett4wade added the safe-to-test Ready to run unit-tests in a PR. label Jan 12, 2026
@garrett4wade garrett4wade force-pushed the fw/session branch 3 times, most recently from b06a3ff to 46cb88e Compare January 13, 2026 06:57
@garrett4wade garrett4wade changed the title fix: improve HTTP client management across threads refactor(core): simplify HTTP client lifecycle with event loop cleanup Jan 13, 2026
@garrett4wade
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-motivated refactoring of the HTTP client lifecycle management. Moving from a custom shutdown hook to an asyncio-atexit pattern simplifies the architecture and improves thread safety, which is a great improvement. The changes are clean and the decoupling of WorkflowExecutor from the client lifecycle is a good design choice. I've found a critical memory leak in the new event loop cleanup utility and a potential unhandled exception that could crash operations. I've provided detailed comments and code suggestions to address these issues. Once these are fixed, this will be an excellent contribution.

Comment thread areal/utils/concurrent.py Outdated
Comment thread areal/utils/concurrent.py Outdated
Comment thread areal/core/workflow_context.py Outdated
@garrett4wade
Copy link
Copy Markdown
Collaborator Author

Resolved Review Comments

I've addressed all the critical and high-priority issues raised by @gemini-code-assist:

1. ✅ Fixed Memory Leak in _LoopCleanupEntry (Critical)

  • Simplified the class to only store callbacks
  • Removed the _close_ref and get_original_close() logic that created a circular reference
  • The lambda closure lambda: loop._cleanup_orig_close was preventing the event loop from being garbage collected

2. ✅ Updated _patched_loop_close (Critical)

  • Now calls loop._cleanup_orig_close() directly instead of using entry.get_original_close()()
  • Eliminates unnecessary complexity while maintaining the same functionality

3. ✅ Added Error Handling for Stale Client Closure (High)

  • Wrapped aiohttp.ClientSession.close() and httpx.AsyncClient.aclose() calls in try-except blocks
  • Added warning logs for any exceptions that occur when closing stale clients
  • This prevents crashes when closing clients from different event loops

All changes have been committed and pushed. The fixes improve memory management and make the cleanup logic more robust.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Thanks for the update, @garrett4wade! It's great to see that the critical and high-priority issues have been addressed. The detailed breakdown of the fixes for the memory leak, _patched_loop_close update, and error handling for stale client closure sounds comprehensive and robust. I appreciate your prompt action in resolving these. I'll proceed with a re-review of the changes.

@garrett4wade
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great refactoring that simplifies HTTP client lifecycle management by moving from a custom shutdown hook to a more idiomatic asyncio-atexit pattern. The changes significantly improve thread safety by introducing a per-thread HttpClientManager and enhance robustness by handling event loop changes gracefully. The new register_loop_cleanup utility is a clean and effective way to manage resource cleanup within the event loop's lifecycle.

I've identified a potential resource leak issue in the stale client cleanup logic and an opportunity to reduce code duplication for better maintainability. Overall, this is a high-quality contribution that reduces complexity and improves the codebase.

Comment thread areal/core/workflow_context.py
Comment thread areal/core/workflow_context.py
@garrett4wade
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a well-executed refactoring that simplifies HTTP client lifecycle management. It successfully replaces the custom shutdown hook mechanism with a more idiomatic asyncio-atexit pattern, improving thread safety and reducing complexity. The introduction of a per-thread HttpClientManager and the new register_loop_cleanup utility are excellent changes. The overall implementation is robust and clean. I have a couple of minor suggestions to make the cleanup logic in workflow_context.py slightly more concise.

Comment thread areal/core/workflow_context.py
Comment thread areal/core/workflow_context.py
@garrett4wade
Copy link
Copy Markdown
Collaborator Author

Resolved High-Priority Resource Leak Issue

I've addressed the resource leak concern raised in the latest review:

✅ Fixed State Reset Ordering (High Priority - Line 107)

Issue: State was being reset (_aiohttp_session = None) before attempting to close the clients. If the close() operation failed with an exception, the HttpClientManager would be in an inconsistent state - the attribute would be None but the underlying client object might not have been closed, leading to resource leaks.

Fix: Moved the state reset to after the close operations. Now the cleanup logic:

  1. Saves references to old clients
  2. Attempts to close them (with error handling)
  3. Only then resets the state attributes

This ensures we don't lose track of unclosed client objects even if exceptions occur during the close operations.

Note: I'm leaving the code duplication issue unresolved as it's acceptable in this context for clarity and maintainability.

garrett4wade and others added 3 commits January 13, 2026 15:54
Replace AsyncTaskRunner shutdown hooks with asyncio-atexit pattern for HTTP client cleanup:
- Remove shutdown hook mechanism from AsyncTaskRunner (52 lines) and related tests (128 lines)
- Implement per-thread HttpClientManager with event loop cleanup registration
- Add register_loop_cleanup utility in concurrent.py for atexit-like behavior on loop closure
- Fix event loop change detection and stale client handling
- Make get_aiohttp_session/get_httpx_client async to properly detect running loop
- Remove WorkflowExecutor coupling to HTTP client lifecycle

This simplifies the architecture by moving cleanup responsibility from AsyncTaskRunner
to the event loop itself, reducing complexity and improving testability.

Co-Authored-By: Claude <noreply@anthropic.com>
…op cleanup

Address code review feedback from Gemini Code Assist:
- Fix memory leak in _LoopCleanupEntry by removing circular reference through lambda closure
- Simplify _LoopCleanupEntry to only store callbacks, removing unnecessary _close_ref logic
- Update _patched_loop_close to call loop._cleanup_orig_close directly
- Add error handling when closing stale HTTP clients from different event loops

These changes improve memory management and robustness when event loops are reused or changed.

Co-Authored-By: Claude <noreply@anthropic.com>
…leaks

Address high-priority review comment from Gemini Code Assist:
- Move state reset (_aiohttp_session, _httpx_client, _event_loop = None) to after
  the close() operations instead of before
- This prevents inconsistent state if close() raises an exception
- Ensures we don't lose track of unclosed client objects, avoiding resource leaks

Previously, if an exception occurred during close(), the state would be reset but
the client might not have been closed, leading to resource leaks.

Co-Authored-By: Claude <noreply@anthropic.com>
@garrett4wade garrett4wade added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Jan 13, 2026
Copy link
Copy Markdown
Collaborator

@rchardx rchardx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guud

@rchardx rchardx merged commit b3622c7 into main Jan 13, 2026
7 checks passed
@rchardx rchardx deleted the fw/session branch January 13, 2026 11:43
leandermaben pushed a commit to leandermaben/AReaL that referenced this pull request Mar 24, 2026
inclusionAI#823)

* refactor(core): simplify HTTP client lifecycle with event loop cleanup

Replace AsyncTaskRunner shutdown hooks with asyncio-atexit pattern for HTTP client cleanup:
- Remove shutdown hook mechanism from AsyncTaskRunner (52 lines) and related tests (128 lines)
- Implement per-thread HttpClientManager with event loop cleanup registration
- Add register_loop_cleanup utility in concurrent.py for atexit-like behavior on loop closure
- Fix event loop change detection and stale client handling
- Make get_aiohttp_session/get_httpx_client async to properly detect running loop
- Remove WorkflowExecutor coupling to HTTP client lifecycle

This simplifies the architecture by moving cleanup responsibility from AsyncTaskRunner
to the event loop itself, reducing complexity and improving testability.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(core): address memory leak and improve error handling in event loop cleanup

Address code review feedback from Gemini Code Assist:
- Fix memory leak in _LoopCleanupEntry by removing circular reference through lambda closure
- Simplify _LoopCleanupEntry to only store callbacks, removing unnecessary _close_ref logic
- Update _patched_loop_close to call loop._cleanup_orig_close directly
- Add error handling when closing stale HTTP clients from different event loops

These changes improve memory management and robustness when event loops are reused or changed.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(core): move state reset after client closure to prevent resource leaks

Address high-priority review comment from Gemini Code Assist:
- Move state reset (_aiohttp_session, _httpx_client, _event_loop = None) to after
  the close() operations instead of before
- This prevents inconsistent state if close() raises an exception
- Ensures we don't lose track of unclosed client objects, avoiding resource leaks

Previously, if an exception occurred during close(), the state would be reset but
the client might not have been closed, leading to resource leaks.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants