Skip to content

Deadlock during GC with Sentry + Opentelemetry + Tornado + aiohttp #5586

@brettdh

Description

@brettdh

How do you use Sentry?

Sentry Saas (sentry.io)

Version

2.53.0

Steps to Reproduce

Detailed minimal repro project here:

https://github.com/brettdh/sentry-aiohttp-deadlock-repro/

Note: this is a very specific example of a more general issue: if an error occurs in a context where a lock is held, Sentry's inspection and serialization of the stack variables can cause a deadlock when inspecting a variable causes another attempt to acquire a lock that's already held.

In this specific case, these details fill in the general pattern above:

  1. the context is an aiohttp client session's __del__ via garbage collection
    • This error also occurs with other aiohttp __del__ methods
    • but it could have been a different chain of events that led to the inspection of an object on the stack that implicitly tries to grab a lock that's already held by the same thread
  2. the lock is held in the Opentelemetry SDK, on an internal variable used when starting or ending a span
    • because the span starts in Sentry's Tornado instrumentation, which itself relies on the Opentelemetry SDK
    • but it could have been anything in any code, first- or third-party
  3. the lock is being held by the thread that is attempting to re-acquire it
    • so in this case, making it a re-entrant lock avoids the deadlock
    • but in another case, it could be held by another thread, and acquiring it during garbage collection (while other threads are stopped) could still lead to a deadlock
    • for example, the garbage-collection could have been triggered on the main thread while a worker thread was holding the lock, and the deadlock would still occur, since the worker thread cannot wake up and release the lock
    • in fact, we think we have seen this happen in production as well (but not 100% sure)
      • edit: it seems like this might not be the case after all, looking at the PEP 703 description of stop-the-world garbage collection in Python 3.12 and 3.13

and these workarounds are available:

  1. Fix all the "Unclosed connector" issues in our application that cause the logging during __del__
    • but this isn't always straightforward, and it may regress
  2. Ignore the aiohttp logger in Sentry config (or disable all logging integration)
    • but then we don't get notified in Sentry about the above warnings or others
    • and actually, this was a red herring and doesn't fix the problem
    • The __del__ method calls the loop's exception handler, which is what logs at error level and gets captured by Sentry
    • we'd have to ignore the asyncio logger instead - but we couldn't know this without digging in to the code
    • what other risky code paths do we not know about yet?
  3. Disable stack variable capture altogether
    • obvious downsides; local variable capture is extremely valuable for debugging
  4. Something else that I haven't thought of yet

As the repro project shows, this particular issue can be resolved by changing the Lock to a RLock in the otel SDK. That may be a good change to make, or it may have unacceptable side effects; I'm not sure yet. The reason I am reporting this as a Sentry issue is that enabling Sentry in our application has had the surprising effect of exposing us to the risk of deadlock (which we have seen happen reliably in production), and we've had to disable Sentry in our Python contexts for the time being.

I'd like to start a conversation about what changes in the SDK might be possible to reduce or eliminate this risk and similar risks caused by a large operation such as stack introspection in various contexts where locks could be held.

Expected Result

No deadlock triggered by Sentry, regardless of what code/stack Sentry is attempting to inspect

Actual Result

Deadlock triggered by Sentry (does not occur when Sentry is disabled)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions