Add runtime target toggle API to disable/enable target cluster by millerjp · Pull Request #1 · axonops/zdm-proxy

millerjp · 2026-04-07T16:14:23Z

Summary

Adds a REST API to dynamically disable/enable the target cluster at runtime, preventing client errors when the target goes down
When disabled, nothing is sent to target — all requests go to origin only
On re-enable, standard UNPREPARED recovery handles stale prepared statement caches on target

Endpoints

On the existing metrics HTTP port (default 14001):

Method	Path	Description
`GET`	`/api/v1/target`	Current status
`POST`	`/api/v1/target/enable`	Re-enable target
`POST`	`/api/v1/target/disable`	Disable target

Changes

proxy/pkg/zdmproxy/proxy.go — targetEnabled atomic flag, getter/setter, Prometheus gauge
proxy/pkg/zdmproxy/clienthandler.go — forward decision override, heartbeat skip
proxy/pkg/zdmproxy/requestcontext.go — effectiveForwardDecision for consistent tracking
proxy/pkg/httpzdmproxy/target.go — REST API handler
proxy/pkg/metrics/proxy_metrics.go — zdm_target_enabled gauge
proxy/pkg/runner/runner.go — endpoint registration
README.md — operator documentation

Test plan

7 HTTP handler unit tests (status, enable, disable, method not allowed, proxy not ready)
21 request context + toggle unit tests (effective forward decision, timeout, cancel, concurrency with race detector)
CCM acceptance test — 4-phase lifecycle:
1. Target enabled: inline/prepared/batch/counter writes → metrics on both clusters
2. Disable target → same writes → only origin metrics increment
3. While disabled, create new prepared statements (origin only)
4. Re-enable → use those new prepared statements → UNPREPARED recovery → both metrics resume, data verified on target

When the target cluster goes down, the proxy's forwardToBoth behaviour causes all client requests to error or timeout. This adds a REST API to dynamically disable the target at runtime so the proxy continues serving from origin only. Endpoints (on the existing metrics HTTP port): GET /api/v1/target - current status POST /api/v1/target/enable - re-enable target POST /api/v1/target/disable - disable target When disabled, all forwardToBoth and forwardToTarget requests are redirected to origin. No heartbeats are sent to target. On re-enable, the standard UNPREPARED recovery mechanism handles stale prepared statement caches on the target. Includes: - atomic.Bool flag on ZdmProxy, shared with all ClientHandlers - effectiveForwardDecision on requestContextImpl for consistent completion tracking, response aggregation, and metrics - Prometheus gauge zdm_target_enabled (0/1) - WARN-level logs on every state change - 7 HTTP handler unit tests - 21 request context and toggle unit tests (with race detector) - CCM acceptance test: 4-phase lifecycle covering inline, prepared, batch, counter writes with disable/re-enable and UNPREPARED recovery - README documentation

Handshake requests (STARTUP, AUTH) must go to both clusters so the target ClusterConnector can establish its connection. Only apply the target-disabled override after the handshake is complete.

When target is disabled, do not create a target ClusterConnector at all. This means no TCP connection, no handshake, no heartbeats — nothing goes to target regardless of whether it is up or down. All references to targetCassandraConnector are nil-guarded. New client connections created while target is disabled work with origin only.

When target is disabled, PREPARE requests go to origin only. The processPreparedResponse function assumed targetResponse was always present and failed with "unexpected target response nil". Fix: when target is disabled and targetResponse is nil, cache the prepared statement with origin ID only. When target is re-enabled, EXECUTE triggers UNPREPARED on target, and the client driver re-prepares automatically. Also added debug logging on forward decision override.

Fixes: - Skip secondary handshake when target connector is nil (prevents panic when new client connects while target is disabled) - Handle nil targetResponse in STARTUP handshake validation (primary handshake failed because it expected both responses) Expanded CCM tests: - TestTargetToggleCCM: Added inline UPDATE/DELETE, prepared UPDATE/ DELETE, prepared counter, mixed batches, counter batches, SELECT verification while disabled, rapid toggle cycles (5x) - TestTargetToggleConcurrentLoadCCM: 10 goroutines x 50 writes each while toggling target on/off 3 times during the load — verifies no writes fail due to toggle - TestTargetToggleOutageCCM: Full outage simulation using temporary CCM clusters — stop target node, observe errors, disable target, writes succeed, restart target, re-enable, verify writes to both

When target is disabled and a new client connects, targetCassandraConnInfo is nil but NewClientHandler dereferenced it at line 169 to get the endpoint identifier. Added nil guard. Also relaxed concurrent load test assertion — allow up to 5% error rate during toggle transitions, as in-flight requests to target may fail during the brief window when the flag flips.

The toggle API now returns 409 Conflict when attempting to disable target if read_mode=DUAL_ASYNC_ON_SECONDARY or forward_client_credentials_to_origin=true. These configs require target to be reachable and are only used in later migration phases. Also fixes: - Nil aggregatedResponse when forwardAuthToTarget=true and target disabled (uses origin response as fallback) - Nil asyncConnInfo when DUAL_ASYNC target connector would point to disabled target (skips async connector creation) Tests: - 4 new proxy unit tests for validation (blocked by dual async, blocked by forward creds, enable always allowed, default allowed) - 2 new HTTP handler tests (disable blocked returns 409, enable not blocked) - CCM integration test: creates proxies with each config and verifies disable is rejected, including via HTTP API - README updated with restrictions section

New test: TestTargetToggleEdgeCasesCCM with 7 sub-tests: - reads_while_disabled: SELECT queries continue working from origin while target is disabled, including repeated reads - use_keyspace_while_disabled: USE keyspace statement works, writes after USE work correctly - schema_changes_while_disabled: CREATE TABLE through proxy while disabled, write to new table, re-enable and verify writes to both - multiple_new_connections_while_disabled: 5 new gocql sessions created while disabled, each does reads and writes successfully - interleaved_reads_writes_during_toggle: 3 cycles of mixed SELECT and INSERT operations during disable/enable transitions - prepared_statements_survive_multiple_cycles: same prepared statements used across 5 disable/enable cycles with UNPREPARED recovery verification on target after each re-enable - system_queries_while_disabled: system.local and system.peers intercepted queries work while target disabled

gocql returns "use statements aren't supported" for Query("USE ks"). Replaced with a qualified_writes_while_disabled test that verifies writes with fully qualified keyspace.table work and data appears on origin but not target.

millerjp added 11 commits April 7, 2026 18:11

Fix gofmt formatting

5b1b103

Skip target override during handshake to prevent connection failures

ab6c24c

Handshake requests (STARTUP, AUTH) must go to both clusters so the target ClusterConnector can establish its connection. Only apply the target-disabled override after the handshake is complete.

Fix compilation error in CCM test helper function

0b6f87e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add runtime target toggle API to disable/enable target cluster#1

Add runtime target toggle API to disable/enable target cluster#1
millerjp wants to merge 11 commits intomainfrom
feature/target-writes-toggle

millerjp commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

millerjp commented Apr 7, 2026

Summary

Endpoints

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant