Skip to content

fix: don't report coroutine cancellations as resolve errors in telemetry#242

Merged
fabriziodemaria merged 5 commits intomainfrom
fix/cancellation-not-error
May 4, 2026
Merged

fix: don't report coroutine cancellations as resolve errors in telemetry#242
fabriziodemaria merged 5 commits intomainfrom
fix/cancellation-not-error

Conversation

@fabriziodemaria
Copy link
Copy Markdown
Member

Summary

  • CancellationException in RemoteFlagResolver.resolve() was caught by the generic catch (e: Exception) branch and tracked as STATUS_ERROR in telemetry. This inflated the confidence_sdk_telemetry_resolve_latency_seconds_count{status="STATUS_ERROR"} metric significantly — every putContext(), fetchAndActivate(), or asyncFetch() call cancels any in-flight resolve before starting a new one, and each cancellation was misreported as an error.
  • With a simulated cancel-and-replace pattern (3 rapid context changes per cycle), the SDK would report ~47% STATUS_ERROR with zero actual network failures. This matches the consistently high error ratios observed in production across all PLATFORM_KOTLIN customers.
  • The fix catches CancellationException before the generic Exception handler and skips telemetry tracking entirely for cancelled requests.
  • Also fixes a pre-existing nullability mismatch in TelemetryTest.kt (encodedHeaderValue() returns String? but decodeMonitoring() expected non-null String).

Test plan

  • testCancelledResolveDoesNotTrackTelemetry — verifies cancelled resolve produces no telemetry trace
  • testSuccessfulResolveTracksTelemetry — verifies successful resolve still produces a telemetry trace
  • testFailedResolveTracksTelemetryAsError — verifies HTTP errors still produce a telemetry trace
  • All existing ConfidenceRemoteClientTests and TelemetryTest pass

Made with Cursor

CancellationException was caught by the generic `catch (e: Exception)`
branch in RemoteFlagResolver, causing cancelled resolves to be tracked
as STATUS_ERROR. This inflated the resolve error ratio significantly
because every putContext() / fetchAndActivate() call cancels any
in-flight resolve before starting a new one.

Skip telemetry tracking entirely for cancelled requests.

Also fix pre-existing nullability mismatch in TelemetryTest.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
fabriziodemaria and others added 3 commits May 4, 2026 11:24
Co-authored-by: Cursor <cursoragent@cursor.com>
ConfidenceError.HttpError extends java.lang.Error, not Exception, so
it was bypassing the catch block and being tracked as STATUS_SUCCESS.
Add a catch(Error) block so HTTP errors are correctly reported as
STATUS_ERROR.

Also strengthen test assertions to decode the telemetry header and
verify the actual RequestTrace.Status value.

Co-authored-by: Cursor <cursoragent@cursor.com>
When the device has no network (DNS resolution fails or there is no
route to host), the SDK now reports STATUS_OFFLINE instead of
STATUS_ERROR in telemetry. This lets us distinguish genuine errors
from expected offline scenarios in metrics.

Co-authored-by: Cursor <cursoragent@cursor.com>
@fabriziodemaria fabriziodemaria merged commit 9f37e84 into main May 4, 2026
2 checks passed
@fabriziodemaria fabriziodemaria deleted the fix/cancellation-not-error branch May 4, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants