Skip to content

feat(rivetkit): expose low-cardinality metrics routes#5019

Draft
NathanFlurry wants to merge 1 commit into
counter-latency/serverless-docsfrom
counter-latency/rivetkit-metrics-routes
Draft

feat(rivetkit): expose low-cardinality metrics routes#5019
NathanFlurry wants to merge 1 commit into
counter-latency/serverless-docsfrom
counter-latency/rivetkit-metrics-routes

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 11, 2026

PR Review: feat(rivetkit): expose low-cardinality metrics routes

Overview

This PR does two related things:

  1. Metrics refactor reduces cardinality by collapsing per-actor-instance labels (actor_id_gen, actor_key, envoy_key) into a single actor_name label, renames all metrics to use consistent component prefixes (rivetkit_actor_*, envoy_*, pegboard_*), and converts some gauge metrics to histograms.
  2. New routes API exposes health, metadata, and metrics (Prometheus) as first-class endpoints via CoreRegistry in NAPI and as a routes property on the TypeScript Registry class.

The cardinality reduction is a meaningful improvement for deployments with many actor instances. The routes API is a clean addition. The code is generally well-structured.


Issues

Security regression: metrics endpoint lost auth check

File: rivetkit-rust/packages/rivetkit-core/src/registry/http.rs, handle_metrics_fetch

The old code checked authorize_metrics_request (gated on RIVETKIT_METRICS_TOKEN) before serving Prometheus metrics. The new code removed that check entirely. The new public metrics_endpoint functions are also exported without any auth wrapper. If this endpoint is reachable without an existing auth layer, metrics are now unauthenticated. The decision to remove auth may be intentional (caller is expected to gate access), but should be made explicit.

No-op methods leave dead computation at call sites

File: rivetkit-rust/packages/rivetkit-core/src/actor/metrics.rs

Several methods are now empty no-ops: set_queue_depth, set_active_connections, set_lifecycle_inbox_depth, set_dispatch_inbox_depth, set_lifecycle_event_inbox_depth, begin_user_task. The call sites still compute the values being passed. If these metrics are permanently dropped (not just deferred), the call sites should be cleaned up too. If deferred, a tracking note would help.

Test update reveals silent health degradation window

Files: tests/context.rs, tests/schedule.rs, tests/sqlite.rs, tests/task.rs

All test EnvoyHandleShared constructors now initialize last_ping_ts: AtomicI64::new(i64::MAX). In production, a freshly-connected registry starts with ping_healthy = false until the first engine ping arrives, so health() returns 503 with engine_ping_stale during that startup window. This is probably correct behavior, but load balancers watching the health endpoint will see a brief 503 on startup. Worth documenting.

Metadata throws before serve; inconsistent with other routes

File: rivetkit-typescript/packages/rivetkit-napi/src/registry.rs, metadata()

metadata() returns a hard NAPI error if route_metadata has not been populated yet, while health() and prometheusMetrics() both return graceful non-throwing 503 responses in the "not started" case. metadata() should do the same for consistency.

WasmCoreRuntime only implements registryHealth

registryMetadata and registryMetrics are absent from WasmCoreRuntime, so registry.routes.metadata() and registry.routes.prometheusMetrics() return 501 for wasm users. Probably intentional, but worth a comment or doc note.


Observations

Good: Removing the RETAINED_ACTORS map and the 10-minute metric retention window significantly simplifies the metrics lifecycle. The new Arc<ActorMetricInner> + Drop approach is cleaner and correct.

Good: serve_config_from_js DRYs up the duplicated JsServeConfig to ServeConfig conversion that previously existed in two call sites.

Good: The switch from per-instance gauges (high-cardinality) to per-actor-type counters/histograms is the right call for production deployments. Label naming is now consistent: rivetkit_actor_* for rivetkit-core, envoy_* for pegboard-envoy, pegboard_* for pegboard.

Minor: The renamed metrics (sqlite_commit_envoy_* to envoy_sqlite_commit_*, actor_active to rivetkit_actor_active_count, etc.) are a breaking change for existing dashboards and alerts. SQLITE_METRICS.md is updated, but the ~20 other renames have no migration note. A brief operator-facing entry somewhere would help.

Minor: health_response in registry.rs falls back to b"{}" if JSON serialization fails, silently swallowing the error. A tracing::warn! on that path would be more consistent with the project logging conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant