Skip to content

feat(envoy-client): add ping health check#5001

Merged
NathanFlurry merged 1 commit into
mainfrom
05-08-feat_envoy-client_add_ping_health_check
May 9, 2026
Merged

feat(envoy-client): add ping health check#5001
NathanFlurry merged 1 commit into
mainfrom
05-08-feat_envoy-client_add_ping_health_check

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@MasterPtato MasterPtato requested a review from NathanFlurry May 9, 2026 02:25
@MasterPtato MasterPtato mentioned this pull request May 9, 2026
11 tasks
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 9, 2026

Code Review: feat(envoy-client): add ping health check

Overview

This PR adds an application-level ping health check to the envoy-client. When the engine goes silent (no pings for >20 s), EnvoyHandle::is_ping_healthy() returns false, and the /health HTTP endpoint returns 503 so container hosts (Cloud Run, k8s, etc.) can recycle the container. A new FreezeProxy test utility makes it possible to simulate a true network black-hole (Toxiproxy always relays FINs, which would defeat the test). The approach is sound and fills a real gap.


Positive

  • last_ping_ts initialized to construction time avoids false-negative recycling during startup before the first engine ping arrives.
  • Release/Acquire ordering on the AtomicI64 store/load pair is the right choice; SeqCst would be overkill.
  • FreezeProxy design rationale is clearly documented in the module doc-comment. The black-hole behavior (drain-and-discard without forwarding EOF) is exactly what the test needs.
  • Duplicate pub mod auth; removal in envoy/mod.rs is a good cleanup.
  • CLAUDE.md clarification on version converters is helpful.

Issues

1. last_ping_ts is not reset on reconnect

SharedContext and therefore last_ping_ts is created once in start_envoy_sync_inner and persists across reconnects. After a reconnect, the old stale timestamp carries over and no new ping arrives until the engine acks the new connection. If reconnect takes longer than ~20 s (e.g. engine restart), is_ping_healthy() returns false and the container host may recycle the actor while the envoy-client is already in the middle of re-establishing the link.

Suggestion: reset last_ping_ts to now_millis() when a new WebSocket connection is successfully established, in the connection handler just before starting to forward messages.

2. FreezeProxy spawns tasks with no shutdown path

The listener and per-connection forward tasks are spawned via tokio::spawn with no JoinHandle stored and no cancellation token. When FreezeProxy is dropped the spawned tasks keep running until the runtime exits. Connections frozen with pending().await are permanently stuck. For a test-only utility this is acceptable, but tasks are never joined so panics are silently swallowed, and any future test that wants to unfreeze or reuse the proxy cannot.

Suggestion: store the accept JoinHandle and add an AbortHandle or CancellationToken so the proxy can be shut down gracefully if tests need it.

3. timeout_upstream in toxiproxy.rs is never used

timeout_upstream is added to ToxiproxyProxy but not referenced anywhere in this PR. Either use it in a test here or remove it and add it when actually needed.

4. Clock skew edge case in is_ping_healthy

If the system clock is stepped backward (NTP correction, container clock sync), last can exceed now_millis(), producing a large negative i64 that compares less than 20,000 and reports healthy. This is the safer direction (false-positive healthy beats false-negative recycle), but it is not obvious. A brief comment noting the intentional behavior on negative delta would help.

5. Test relies on implicit timing relationship to engine ping timeout

The new test sleeps 25 s before the freeze and another 25 s after. The comment says the engine ping timeout is 15 s, but that value is not expressed as a constant visible in the test file. If the engine ping timeout is tuned, the wait may become either too short (flake) or needlessly long. Exposing the engine ping-timeout value as a test constant alongside PING_HEALTHY_THRESHOLD_MS would make the sleep duration derivable from actual thresholds rather than hand-picked numbers.


Minor

  • FreezeProxy::freeze uses Ordering::SeqCst while the production path uses the tighter Release/Acquire pair. No correctness issue, but inconsistent; prefer Release/Acquire in the test utility for symmetry.
  • now_millis() casts u128 to i64. No practical concern, but a brief inline comment acknowledging the narrowing cast would be clarifying.
  • The PR description template is not filled in (type of change, test description, checklist unchecked).

Summary

The core mechanism is correct and well-motivated. The main actionable items are: reset last_ping_ts on reconnect to avoid spurious recycling, remove or use timeout_upstream, and add a shutdown path to FreezeProxy. The timing-constant concern is lower-priority polish.

@NathanFlurry NathanFlurry force-pushed the 05-01-chore_engine_roxiproxi_tests branch from a5be52d to ea3cb2a Compare May 9, 2026 07:33
@NathanFlurry NathanFlurry force-pushed the 05-08-feat_envoy-client_add_ping_health_check branch from 2605f01 to 72cbc7c Compare May 9, 2026 07:33
Base automatically changed from 05-01-chore_engine_roxiproxi_tests to main May 9, 2026 07:50
@NathanFlurry NathanFlurry merged commit 72cbc7c into main May 9, 2026
11 of 19 checks passed
@NathanFlurry NathanFlurry deleted the 05-08-feat_envoy-client_add_ping_health_check branch May 9, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants