-
Notifications
You must be signed in to change notification settings - Fork 588
Description
How do you use Sentry?
Sentry Saas (sentry.io)
Version
2.53.0
Steps to Reproduce
Detailed minimal repro project here:
https://github.com/brettdh/sentry-aiohttp-deadlock-repro/
Note: this is a very specific example of a more general issue: if an error occurs in a context where a lock is held, Sentry's inspection and serialization of the stack variables can cause a deadlock when inspecting a variable causes another attempt to acquire a lock that's already held.
In this specific case, these details fill in the general pattern above:
- the context is an aiohttp client session's
__del__via garbage collection- This error also occurs with other aiohttp
__del__methods - but it could have been a different chain of events that led to the inspection of an object on the stack that implicitly tries to grab a lock that's already held by the same thread
- This error also occurs with other aiohttp
- the lock is held in the Opentelemetry SDK, on an internal variable used when starting or ending a span
- because the span starts in Sentry's Tornado instrumentation, which itself relies on the Opentelemetry SDK
- but it could have been anything in any code, first- or third-party
- the lock is being held by the thread that is attempting to re-acquire it
- so in this case, making it a re-entrant lock avoids the deadlock
- but in another case, it could be held by another thread, and acquiring it during garbage collection (while other threads are stopped) could still lead to a deadlock
- for example, the garbage-collection could have been triggered on the main thread while a worker thread was holding the lock, and the deadlock would still occur, since the worker thread cannot wake up and release the lock
- in fact, we think we have seen this happen in production as well (but not 100% sure)
- edit: it seems like this might not be the case after all, looking at the PEP 703 description of stop-the-world garbage collection in Python 3.12 and 3.13
and these workarounds are available:
- Fix all the "Unclosed connector" issues in our application that cause the logging during
__del__- but this isn't always straightforward, and it may regress
- Ignore the
aiohttplogger in Sentry config (or disable all logging integration)- but then we don't get notified in Sentry about the above warnings or others
- and actually, this was a red herring and doesn't fix the problem
- The
__del__method calls the loop's exception handler, which is what logs aterrorlevel and gets captured by Sentry - we'd have to ignore the
asynciologger instead - but we couldn't know this without digging in to the code - what other risky code paths do we not know about yet?
- Disable stack variable capture altogether
- obvious downsides; local variable capture is extremely valuable for debugging
- Something else that I haven't thought of yet
As the repro project shows, this particular issue can be resolved by changing the Lock to a RLock in the otel SDK. That may be a good change to make, or it may have unacceptable side effects; I'm not sure yet. The reason I am reporting this as a Sentry issue is that enabling Sentry in our application has had the surprising effect of exposing us to the risk of deadlock (which we have seen happen reliably in production), and we've had to disable Sentry in our Python contexts for the time being.
I'd like to start a conversation about what changes in the SDK might be possible to reduce or eliminate this risk and similar risks caused by a large operation such as stack introspection in various contexts where locks could be held.
Expected Result
No deadlock triggered by Sentry, regardless of what code/stack Sentry is attempting to inspect
Actual Result
Deadlock triggered by Sentry (does not occur when Sentry is disabled)
Metadata
Metadata
Assignees
Projects
Status