|
| 1 | +# Graphiti Connection Manager Refactor Plan |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +Remove session-creation stalls caused by Graphiti connection setup by moving MCP |
| 6 | +transport lifecycle management into a dedicated connection manager that starts |
| 7 | +on plugin launch, stays alive for the process lifetime, reconnects |
| 8 | +automatically, buffers requests while connecting, and transparently drops new |
| 9 | +requests while offline so higher-level memory features fail open. |
| 10 | + |
| 11 | +## Current Problem |
| 12 | + |
| 13 | +- `src/index.ts` awaits `client.connect()` during plugin initialization. |
| 14 | +- OpenCode appears to instantiate the plugin lazily on first real session use, |
| 15 | + so the first session pays the MCP connection warmup cost. |
| 16 | +- Higher-level methods in `src/services/client.ts` mix transport lifecycle, |
| 17 | + retry logic, request execution, and response parsing in one class. |
| 18 | +- Timeouts and disconnects are handled per call, but there is no separate |
| 19 | + always-on connection state machine. |
| 20 | + |
| 21 | +## Target Design |
| 22 | + |
| 23 | +Introduce a dedicated `GraphitiConnectionManager` layer under `src/services/`. |
| 24 | + |
| 25 | +Responsibilities: |
| 26 | + |
| 27 | +- Own the MCP `Client` and `StreamableHTTPClientTransport` lifecycle. |
| 28 | +- Start connecting as soon as the plugin launches, without blocking hook |
| 29 | + registration. |
| 30 | +- Maintain explicit connection state: `connecting`, `connected`, `offline`, and |
| 31 | + `closing`. |
| 32 | +- Auto-reconnect after disconnect with exponential backoff (see |
| 33 | + [Reconnect Strategy](#reconnect-strategy)). |
| 34 | +- Classify transport-level failures (session expiry, network errors, timeouts) |
| 35 | + internally so callers never inspect raw transport errors. |
| 36 | +- Queue requests that arrive while state is `connecting`, subject to per-request |
| 37 | + deadlines. |
| 38 | +- Reject requests that arrive while state is `offline` with a typed error, |
| 39 | + allowing higher-level APIs to degrade gracefully instead of stalling. |
| 40 | +- Expose a readiness signal (`ready(): Promise<boolean>`) that resolves when the |
| 41 | + first connection succeeds or a caller-supplied timeout elapses, so |
| 42 | + first-message hooks can bound their wait. |
| 43 | +- Expose a single request API for tool execution so `GraphitiClient` becomes a |
| 44 | + thin domain adapter. |
| 45 | + |
| 46 | +Non-goals: |
| 47 | + |
| 48 | +- No durable disk-backed queue. |
| 49 | +- No guaranteed delivery while Graphiti is offline. |
| 50 | +- No change to memory search, injection, or compaction semantics beyond their |
| 51 | + behavior during transport failure. |
| 52 | + |
| 53 | +## Proposed Architecture |
| 54 | + |
| 55 | +### 1. New connection-manager service |
| 56 | + |
| 57 | +Create `src/services/connection-manager.ts` with: |
| 58 | + |
| 59 | +- A connection-state union type: |
| 60 | + `"connecting" | "connected" | "offline" | |
| 61 | + "closing"`. |
| 62 | +- A manager class that stores: |
| 63 | + - endpoint |
| 64 | + - MCP client instance |
| 65 | + - transport instance |
| 66 | + - current state |
| 67 | + - in-flight connect promise (serialized; see below) |
| 68 | + - bounded queue of pending requests created during `connecting` |
| 69 | + - reconnect backoff metadata (attempt count, next delay, timer handle) |
| 70 | + - a readiness `Promise<boolean>` that resolves on first successful connect or |
| 71 | + on a configurable startup timeout |
| 72 | +- Methods: |
| 73 | + - `start()` — begin background connection on plugin launch; transitions |
| 74 | + immediately to `connecting`. |
| 75 | + - `stop()` — transition to `closing`, drain or reject queued requests, close |
| 76 | + the MCP client, cancel any pending reconnect timer, then become inert. After |
| 77 | + `stop()` all subsequent `callTool` calls reject immediately. |
| 78 | + - `ready(timeoutMs?)` — returns a promise that resolves `true` when the |
| 79 | + manager reaches `connected`, or `false` if the timeout elapses first. |
| 80 | + Callers such as first-message hooks can use this to bound their wait. |
| 81 | + - `callTool(name, args, deadlineMs?)` — route requests according to current |
| 82 | + state; accepts an optional per-request deadline. |
| 83 | + - `reconnect()` — rebuild client and transport after disconnect/session loss. |
| 84 | + Serialized: concurrent callers share a single in-flight attempt. |
| 85 | + |
| 86 | +#### State behavior |
| 87 | + |
| 88 | +- **`connecting`** — execute `client.connect()`. Incoming `callTool` requests |
| 89 | + are enqueued. Each queued request carries a per-request deadline (default: |
| 90 | + configurable, e.g. 15 s). If the deadline fires before the connection is |
| 91 | + established, the request rejects with a typed timeout error so hook flows do |
| 92 | + not hang indefinitely. |
| 93 | +- **`connected`** — execute `callTool` immediately. If a call fails with a |
| 94 | + transport error (network reset, socket hang-up, etc.) or an MCP 404 |
| 95 | + session-expiry error, the manager transitions to `connecting` and triggers a |
| 96 | + serialized reconnect. The failed request is retried once after the reconnect |
| 97 | + succeeds. |
| 98 | +- **`offline`** — the manager enters this state when a connect or reconnect |
| 99 | + attempt fails after exhausting the current backoff step. Incoming `callTool` |
| 100 | + requests reject immediately with a typed offline error. A background reconnect |
| 101 | + timer continues with exponential backoff; on success the manager transitions |
| 102 | + back to `connected`. |
| 103 | +- **`closing`** — entered by `stop()`. All queued requests are rejected. No new |
| 104 | + requests are accepted. The MCP client is closed and the reconnect timer is |
| 105 | + cancelled. |
| 106 | + |
| 107 | +#### Failure classification |
| 108 | + |
| 109 | +The connection manager owns all transport-error classification so that callers |
| 110 | +never inspect raw error shapes: |
| 111 | + |
| 112 | +- **Session expiry** — MCP error code 404. Action: rebuild client + transport, |
| 113 | + retry the request once. |
| 114 | +- **Transport failure** — network errors, socket resets, connection refused, |
| 115 | + unexpected stream termination. Action: transition to `connecting`, trigger |
| 116 | + serialized reconnect. |
| 117 | +- **Request timeout** — MCP error code -32001 or message matching |
| 118 | + `request timed out`. Action: surface to caller as a typed timeout error (no |
| 119 | + reconnect needed). |
| 120 | + |
| 121 | +This keeps transport concerns encapsulated inside the connection manager. |
| 122 | + |
| 123 | +#### Serialized reconnects |
| 124 | + |
| 125 | +All reconnect triggers (failed requests, transport errors, backoff timer) funnel |
| 126 | +through a single `reconnect()` path that deduplicates concurrent attempts. If a |
| 127 | +reconnect is already in flight, additional callers await the same promise. This |
| 128 | +prevents thundering-herd behavior when multiple concurrent requests fail |
| 129 | +simultaneously. |
| 130 | + |
| 131 | +#### Reconnect strategy |
| 132 | + |
| 133 | +Auto-reconnect is mandatory, not optional. Use exponential backoff with jitter: |
| 134 | + |
| 135 | +- Initial delay: 1 s. |
| 136 | +- Max delay: 60 s. |
| 137 | +- Multiplier: 2. |
| 138 | +- Jitter: +/- 25%. |
| 139 | +- Reset delay to initial on successful connect. |
| 140 | + |
| 141 | +The backoff timer runs in `offline` state. On each tick the manager transitions |
| 142 | +to `connecting` and attempts a reconnect. If the attempt fails, the manager |
| 143 | +returns to `offline` with an increased delay. |
| 144 | + |
| 145 | +### 2. Refactor GraphitiClient into a domain adapter |
| 146 | + |
| 147 | +Update `src/services/client.ts` so it: |
| 148 | + |
| 149 | +- Depends on the new connection manager instead of directly owning MCP transport |
| 150 | + state. |
| 151 | +- Keeps response parsing and Graphiti-specific helpers such as `searchFacts`, |
| 152 | + `searchNodes`, `getEpisodes`, and `addEpisode`. |
| 153 | +- Treats offline errors as soft failures for **read** operations by returning |
| 154 | + empty results and logging at warn/debug level. |
| 155 | +- Treats offline errors as soft failures for **write** operations by logging and |
| 156 | + **re-throwing** the error so higher-level code can decide whether to retry. In |
| 157 | + particular, `SessionManager.flushPendingMessages` already re-queues messages |
| 158 | + on failure; silently dropping writes here would break that retry path. The |
| 159 | + connection manager's typed offline error makes it easy for callers to |
| 160 | + distinguish "server unreachable" from permanent failures. |
| 161 | + |
| 162 | +### 3. Update plugin initialization and impacted files |
| 163 | + |
| 164 | +**`src/index.ts`** — primary changes: |
| 165 | + |
| 166 | +- Construct the connection manager first. |
| 167 | +- Call `connectionManager.start()` without awaiting a full connect. |
| 168 | +- Pass the manager into `GraphitiClient`. |
| 169 | +- Optionally expose a cleanup hook that calls `connectionManager.stop()` if the |
| 170 | + plugin API supports lifecycle teardown. |
| 171 | + |
| 172 | +**`src/session.ts`** — `SessionManager.flushPendingMessages` already re-queues |
| 173 | +messages on `addEpisode` failure. No semantic change needed, but verify that the |
| 174 | +new typed offline error propagates correctly through the catch block so the |
| 175 | +re-queue path still triggers. |
| 176 | + |
| 177 | +**`src/handlers/event.ts`** — calls `flushPendingMessages` and |
| 178 | +`client.addEpisode` in session-idle and session-delete flows. These call sites |
| 179 | +should continue to catch and log failures; no behavioral change beyond receiving |
| 180 | +typed errors instead of raw transport errors. |
| 181 | + |
| 182 | +**`src/handlers/chat.ts`** — calls `searchFacts`, `searchNodes` during memory |
| 183 | +injection. These are read operations that already return empty on failure. |
| 184 | +Optionally, the chat handler can call `connectionManager.ready(timeoutMs)` |
| 185 | +before the first memory injection to avoid injecting empty context when the |
| 186 | +connection is still warming up. |
| 187 | + |
| 188 | +**`src/handlers/compacting.ts`** — calls `searchFacts` and `getEpisodes` via |
| 189 | +`getCompactionContext`. Read-path only; same fail-open behavior as today. |
| 190 | + |
| 191 | +**`src/services/client.ts`** — refactored as described in section 2. |
| 192 | + |
| 193 | +### 4. Error model |
| 194 | + |
| 195 | +Add typed internal errors or discriminators for: |
| 196 | + |
| 197 | +- **offline** — request rejected because the manager is in `offline` or |
| 198 | + `closing` state. |
| 199 | +- **queue-timeout** — request was queued during `connecting` but its per-request |
| 200 | + deadline elapsed before the connection was established. |
| 201 | +- **transport-failure** — a connected call failed due to a network-level error |
| 202 | + (not a Graphiti application error); the manager is now reconnecting. |
| 203 | +- **session-expired** — MCP 404; the manager is rebuilding the session. |
| 204 | + |
| 205 | +These typed errors let `GraphitiClient` and `SessionManager` distinguish |
| 206 | +transient transport problems from permanent failures without inspecting raw |
| 207 | +error text. |
| 208 | + |
| 209 | +### 5. Queue policy |
| 210 | + |
| 211 | +Use a small bounded in-memory queue only for the `connecting` state. |
| 212 | + |
| 213 | +- FIFO dispatch order. |
| 214 | +- Cap queue length (e.g. 32) to avoid unbounded growth if many requests arrive |
| 215 | + during a slow connect. |
| 216 | +- Each queued request carries a per-request deadline (default configurable, e.g. |
| 217 | + 15 s). When the deadline fires, the request is removed from the queue and |
| 218 | + rejected with a `queue-timeout` error. |
| 219 | +- When the queue is full, **drop the oldest entry** (reject it with a |
| 220 | + `queue-timeout` error) and enqueue the new request. Rationale: in a |
| 221 | + hook-driven system the most recent request is likelier to carry the most |
| 222 | + relevant context (e.g. the latest user message). Older queued requests are |
| 223 | + already stale by the time the connection recovers. |
| 224 | + |
| 225 | +This preserves the requested semantics: buffering while connecting, but |
| 226 | +rejecting requests when the manager is offline. |
| 227 | + |
| 228 | +## Implementation Steps |
| 229 | + |
| 230 | +1. Add `src/services/connection-manager.ts` with state machine, queue with |
| 231 | + per-request deadlines, serialized reconnect, exponential backoff, readiness |
| 232 | + signal, and typed error classes. |
| 233 | +2. Refactor `src/services/client.ts` to delegate raw tool calls to the manager. |
| 234 | + Remove transport/session-expiry logic from `GraphitiClient`. Preserve |
| 235 | + write-error propagation for `addEpisode` so |
| 236 | + `SessionManager.flushPendingMessages` retry semantics are maintained. |
| 237 | +3. Update `src/index.ts` to construct the connection manager, call `start()` |
| 238 | + without awaiting, and pass it into `GraphitiClient`. |
| 239 | +4. Verify `src/session.ts` — confirm `flushPendingMessages` catch block handles |
| 240 | + the new typed offline error correctly (re-queue path). |
| 241 | +5. Verify `src/handlers/event.ts`, `src/handlers/chat.ts`, and |
| 242 | + `src/handlers/compacting.ts` — confirm read-path fail-open behavior is |
| 243 | + unchanged. Optionally add `ready()` call in `chat.ts` before first memory |
| 244 | + injection. |
| 245 | +6. Update tests in `src/services/client.test.ts` and add focused tests for the |
| 246 | + connection manager (see [Testing Plan](#testing-plan)). |
| 247 | +7. Run `deno test`, `deno check src/index.ts`, and any relevant linting. |
| 248 | + |
| 249 | +## Testing Plan |
| 250 | + |
| 251 | +Add or update tests for: |
| 252 | + |
| 253 | +- startup does not block on a successful or failed background connect |
| 254 | +- `ready()` resolves `true` on successful connect, `false` on timeout |
| 255 | +- requests issued during `connecting` are queued and later resolved |
| 256 | +- queued requests that exceed their per-request deadline reject with |
| 257 | + `queue-timeout` |
| 258 | +- requests issued during `offline` reject immediately with typed offline error |
| 259 | +- mid-session transport disconnect triggers serialized reconnect and retries the |
| 260 | + failed request once |
| 261 | +- expired-session (MCP 404) errors trigger one reconnect and one retry |
| 262 | +- concurrent transport failures share a single reconnect attempt (no thundering |
| 263 | + herd) |
| 264 | +- auto-reconnect backoff fires in `offline` state and transitions back to |
| 265 | + `connected` on success |
| 266 | +- read APIs return empty collections on offline/timeout conditions |
| 267 | +- write APIs (`addEpisode`) propagate offline errors so |
| 268 | + `SessionManager.flushPendingMessages` can re-queue |
| 269 | +- queue-full policy drops oldest entry, not newest |
| 270 | +- `stop()` transitions to `closing`, rejects queued requests, cancels reconnect |
| 271 | + timer |
| 272 | + |
| 273 | +## Resolved Design Decisions |
| 274 | + |
| 275 | +- **Auto-reconnect is mandatory.** The manager always runs exponential backoff |
| 276 | + in `offline` state. There is no "stay offline until explicit trigger" mode. |
| 277 | +- **No `idle` state.** `start()` transitions directly to `connecting`. Before |
| 278 | + `start()` is called the manager does not exist; after `stop()` it is inert. |
| 279 | +- **Write errors propagate to callers.** `addEpisode` failures (offline or |
| 280 | + otherwise) throw so that higher-level retry logic such as |
| 281 | + `SessionManager.flushPendingMessages` can re-queue. Read operations continue |
| 282 | + to fail open with empty results. |
| 283 | + |
| 284 | +## Open Questions |
| 285 | + |
| 286 | +- Exact default values for per-request deadline and queue capacity (proposed: 15 |
| 287 | + s and 32; confirm during implementation). |
| 288 | +- Whether `ready()` timeout should be configurable per call site or set once at |
| 289 | + construction. |
| 290 | + |
| 291 | +## Expected Outcome |
| 292 | + |
| 293 | +The first OpenCode session should no longer stall on Graphiti warmup. Graphiti |
| 294 | +availability becomes a background concern managed by one process-wide transport |
| 295 | +layer, while memory features continue to operate on a best-effort basis with |
| 296 | +fast failure when the backend is unavailable. |
0 commit comments