Skip to content

fix: swarm bug "Failed to detach context" with opentelemetry#2281

Open
mehtarac wants to merge 2 commits into
strands-agents:mainfrom
mehtarac:fix_swarm_bug
Open

fix: swarm bug "Failed to detach context" with opentelemetry#2281
mehtarac wants to merge 2 commits into
strands-agents:mainfrom
mehtarac:fix_swarm_bug

Conversation

@mehtarac
Copy link
Copy Markdown
Member

Description

When using the Swarm multiagent pattern with OpenTelemetry tracing enabled, users encounter repeated "Failed to detach context" errors with ValueError: was created in a different Context. This happens because _stream_with_timeout uses asyncio.wait_for() to wrap each anext() call on the async generator. Each wait_for invocation creates a new asyncio.Task with a copied contextvars.Context, so OTEL span tokens attached in one iteration's context cannot be detached in a subsequent iteration's different context.

Note:
On Python 3.10 only, a node that hangs indefinitely mid-await (e.g., unresponsive model API that never returns) will not be interrupted until the next event arrives. This is an unavoidable limitation of Python 3.10's async APIs and affects an edge case on a version approaching EOL (Oct 2026). Python 3.11+ retains full mid-await cancellation semantics via asyncio.timeout().

Related Issues

#1316

Documentation PR

Type of Change

Bug fix

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@mehtarac mehtarac changed the title fix: swarm bug fix: swarm bug "Failed to detach context" with opentelemetry May 11, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@mehtarac mehtarac marked this pull request as ready for review May 11, 2026 16:50
@mehtarac mehtarac temporarily deployed to manual-approval May 11, 2026 16:53 — with GitHub Actions Inactive
Comment thread src/strands/multiagent/swarm.py Outdated
Comment thread src/strands/multiagent/swarm.py
@github-actions
Copy link
Copy Markdown

Assessment: Comment

The fix correctly identifies the root cause (asyncio.wait_for() creating new tasks with copied contexts breaks OTEL span token attachment) and addresses it with an appropriate version-branched approach. The Python 3.11+ path using asyncio.timeout() is clean and correct. The Python 3.10 fallback trades timeout precision for correctness, which is a reasonable tradeoff for a version approaching EOL.

Review Details
  • Documentation: The docstring still describes the old wait_for behavior and should be updated to reflect the new version-branched semantics and 3.10 limitations.
  • API usage: asyncio.get_event_loop() should be asyncio.get_running_loop() for correctness and to avoid deprecation concerns.
  • Maintainability: The Python 3.10 fallback has a significant behavioral difference (no mid-await cancellation) that should be documented inline with a note about when it can be removed.

The approach is sound and the change is well-scoped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant