Skip to content

Commit ada7c0b

Browse files
committed
fix: add resilient Graphiti connection management
Move Graphiti transport lifecycle into a process-wide connection manager so session startup no longer blocks on warmup. Keep config loading and tests resilient under restricted runtimes while preserving home-directory fallback behavior.
1 parent f0316ba commit ada7c0b

15 files changed

Lines changed: 2176 additions & 911 deletions

plans/ConnectionManager.md

Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
# Graphiti Connection Manager Refactor Plan
2+
3+
## Goal
4+
5+
Remove session-creation stalls caused by Graphiti connection setup by moving MCP
6+
transport lifecycle management into a dedicated connection manager that starts
7+
on plugin launch, stays alive for the process lifetime, reconnects
8+
automatically, buffers requests while connecting, and transparently drops new
9+
requests while offline so higher-level memory features fail open.
10+
11+
## Current Problem
12+
13+
- `src/index.ts` awaits `client.connect()` during plugin initialization.
14+
- OpenCode appears to instantiate the plugin lazily on first real session use,
15+
so the first session pays the MCP connection warmup cost.
16+
- Higher-level methods in `src/services/client.ts` mix transport lifecycle,
17+
retry logic, request execution, and response parsing in one class.
18+
- Timeouts and disconnects are handled per call, but there is no separate
19+
always-on connection state machine.
20+
21+
## Target Design
22+
23+
Introduce a dedicated `GraphitiConnectionManager` layer under `src/services/`.
24+
25+
Responsibilities:
26+
27+
- Own the MCP `Client` and `StreamableHTTPClientTransport` lifecycle.
28+
- Start connecting as soon as the plugin launches, without blocking hook
29+
registration.
30+
- Maintain explicit connection state: `connecting`, `connected`, `offline`, and
31+
`closing`.
32+
- Auto-reconnect after disconnect with exponential backoff (see
33+
[Reconnect Strategy](#reconnect-strategy)).
34+
- Classify transport-level failures (session expiry, network errors, timeouts)
35+
internally so callers never inspect raw transport errors.
36+
- Queue requests that arrive while state is `connecting`, subject to per-request
37+
deadlines.
38+
- Reject requests that arrive while state is `offline` with a typed error,
39+
allowing higher-level APIs to degrade gracefully instead of stalling.
40+
- Expose a readiness signal (`ready(): Promise<boolean>`) that resolves when the
41+
first connection succeeds or a caller-supplied timeout elapses, so
42+
first-message hooks can bound their wait.
43+
- Expose a single request API for tool execution so `GraphitiClient` becomes a
44+
thin domain adapter.
45+
46+
Non-goals:
47+
48+
- No durable disk-backed queue.
49+
- No guaranteed delivery while Graphiti is offline.
50+
- No change to memory search, injection, or compaction semantics beyond their
51+
behavior during transport failure.
52+
53+
## Proposed Architecture
54+
55+
### 1. New connection-manager service
56+
57+
Create `src/services/connection-manager.ts` with:
58+
59+
- A connection-state union type:
60+
`"connecting" | "connected" | "offline" |
61+
"closing"`.
62+
- A manager class that stores:
63+
- endpoint
64+
- MCP client instance
65+
- transport instance
66+
- current state
67+
- in-flight connect promise (serialized; see below)
68+
- bounded queue of pending requests created during `connecting`
69+
- reconnect backoff metadata (attempt count, next delay, timer handle)
70+
- a readiness `Promise<boolean>` that resolves on first successful connect or
71+
on a configurable startup timeout
72+
- Methods:
73+
- `start()` — begin background connection on plugin launch; transitions
74+
immediately to `connecting`.
75+
- `stop()` — transition to `closing`, drain or reject queued requests, close
76+
the MCP client, cancel any pending reconnect timer, then become inert. After
77+
`stop()` all subsequent `callTool` calls reject immediately.
78+
- `ready(timeoutMs?)` — returns a promise that resolves `true` when the
79+
manager reaches `connected`, or `false` if the timeout elapses first.
80+
Callers such as first-message hooks can use this to bound their wait.
81+
- `callTool(name, args, deadlineMs?)` — route requests according to current
82+
state; accepts an optional per-request deadline.
83+
- `reconnect()` — rebuild client and transport after disconnect/session loss.
84+
Serialized: concurrent callers share a single in-flight attempt.
85+
86+
#### State behavior
87+
88+
- **`connecting`** — execute `client.connect()`. Incoming `callTool` requests
89+
are enqueued. Each queued request carries a per-request deadline (default:
90+
configurable, e.g. 15 s). If the deadline fires before the connection is
91+
established, the request rejects with a typed timeout error so hook flows do
92+
not hang indefinitely.
93+
- **`connected`** — execute `callTool` immediately. If a call fails with a
94+
transport error (network reset, socket hang-up, etc.) or an MCP 404
95+
session-expiry error, the manager transitions to `connecting` and triggers a
96+
serialized reconnect. The failed request is retried once after the reconnect
97+
succeeds.
98+
- **`offline`** — the manager enters this state when a connect or reconnect
99+
attempt fails after exhausting the current backoff step. Incoming `callTool`
100+
requests reject immediately with a typed offline error. A background reconnect
101+
timer continues with exponential backoff; on success the manager transitions
102+
back to `connected`.
103+
- **`closing`** — entered by `stop()`. All queued requests are rejected. No new
104+
requests are accepted. The MCP client is closed and the reconnect timer is
105+
cancelled.
106+
107+
#### Failure classification
108+
109+
The connection manager owns all transport-error classification so that callers
110+
never inspect raw error shapes:
111+
112+
- **Session expiry** — MCP error code 404. Action: rebuild client + transport,
113+
retry the request once.
114+
- **Transport failure** — network errors, socket resets, connection refused,
115+
unexpected stream termination. Action: transition to `connecting`, trigger
116+
serialized reconnect.
117+
- **Request timeout** — MCP error code -32001 or message matching
118+
`request timed out`. Action: surface to caller as a typed timeout error (no
119+
reconnect needed).
120+
121+
This keeps transport concerns encapsulated inside the connection manager.
122+
123+
#### Serialized reconnects
124+
125+
All reconnect triggers (failed requests, transport errors, backoff timer) funnel
126+
through a single `reconnect()` path that deduplicates concurrent attempts. If a
127+
reconnect is already in flight, additional callers await the same promise. This
128+
prevents thundering-herd behavior when multiple concurrent requests fail
129+
simultaneously.
130+
131+
#### Reconnect strategy
132+
133+
Auto-reconnect is mandatory, not optional. Use exponential backoff with jitter:
134+
135+
- Initial delay: 1 s.
136+
- Max delay: 60 s.
137+
- Multiplier: 2.
138+
- Jitter: +/- 25%.
139+
- Reset delay to initial on successful connect.
140+
141+
The backoff timer runs in `offline` state. On each tick the manager transitions
142+
to `connecting` and attempts a reconnect. If the attempt fails, the manager
143+
returns to `offline` with an increased delay.
144+
145+
### 2. Refactor GraphitiClient into a domain adapter
146+
147+
Update `src/services/client.ts` so it:
148+
149+
- Depends on the new connection manager instead of directly owning MCP transport
150+
state.
151+
- Keeps response parsing and Graphiti-specific helpers such as `searchFacts`,
152+
`searchNodes`, `getEpisodes`, and `addEpisode`.
153+
- Treats offline errors as soft failures for **read** operations by returning
154+
empty results and logging at warn/debug level.
155+
- Treats offline errors as soft failures for **write** operations by logging and
156+
**re-throwing** the error so higher-level code can decide whether to retry. In
157+
particular, `SessionManager.flushPendingMessages` already re-queues messages
158+
on failure; silently dropping writes here would break that retry path. The
159+
connection manager's typed offline error makes it easy for callers to
160+
distinguish "server unreachable" from permanent failures.
161+
162+
### 3. Update plugin initialization and impacted files
163+
164+
**`src/index.ts`** — primary changes:
165+
166+
- Construct the connection manager first.
167+
- Call `connectionManager.start()` without awaiting a full connect.
168+
- Pass the manager into `GraphitiClient`.
169+
- Optionally expose a cleanup hook that calls `connectionManager.stop()` if the
170+
plugin API supports lifecycle teardown.
171+
172+
**`src/session.ts`**`SessionManager.flushPendingMessages` already re-queues
173+
messages on `addEpisode` failure. No semantic change needed, but verify that the
174+
new typed offline error propagates correctly through the catch block so the
175+
re-queue path still triggers.
176+
177+
**`src/handlers/event.ts`** — calls `flushPendingMessages` and
178+
`client.addEpisode` in session-idle and session-delete flows. These call sites
179+
should continue to catch and log failures; no behavioral change beyond receiving
180+
typed errors instead of raw transport errors.
181+
182+
**`src/handlers/chat.ts`** — calls `searchFacts`, `searchNodes` during memory
183+
injection. These are read operations that already return empty on failure.
184+
Optionally, the chat handler can call `connectionManager.ready(timeoutMs)`
185+
before the first memory injection to avoid injecting empty context when the
186+
connection is still warming up.
187+
188+
**`src/handlers/compacting.ts`** — calls `searchFacts` and `getEpisodes` via
189+
`getCompactionContext`. Read-path only; same fail-open behavior as today.
190+
191+
**`src/services/client.ts`** — refactored as described in section 2.
192+
193+
### 4. Error model
194+
195+
Add typed internal errors or discriminators for:
196+
197+
- **offline** — request rejected because the manager is in `offline` or
198+
`closing` state.
199+
- **queue-timeout** — request was queued during `connecting` but its per-request
200+
deadline elapsed before the connection was established.
201+
- **transport-failure** — a connected call failed due to a network-level error
202+
(not a Graphiti application error); the manager is now reconnecting.
203+
- **session-expired** — MCP 404; the manager is rebuilding the session.
204+
205+
These typed errors let `GraphitiClient` and `SessionManager` distinguish
206+
transient transport problems from permanent failures without inspecting raw
207+
error text.
208+
209+
### 5. Queue policy
210+
211+
Use a small bounded in-memory queue only for the `connecting` state.
212+
213+
- FIFO dispatch order.
214+
- Cap queue length (e.g. 32) to avoid unbounded growth if many requests arrive
215+
during a slow connect.
216+
- Each queued request carries a per-request deadline (default configurable, e.g.
217+
15 s). When the deadline fires, the request is removed from the queue and
218+
rejected with a `queue-timeout` error.
219+
- When the queue is full, **drop the oldest entry** (reject it with a
220+
`queue-timeout` error) and enqueue the new request. Rationale: in a
221+
hook-driven system the most recent request is likelier to carry the most
222+
relevant context (e.g. the latest user message). Older queued requests are
223+
already stale by the time the connection recovers.
224+
225+
This preserves the requested semantics: buffering while connecting, but
226+
rejecting requests when the manager is offline.
227+
228+
## Implementation Steps
229+
230+
1. Add `src/services/connection-manager.ts` with state machine, queue with
231+
per-request deadlines, serialized reconnect, exponential backoff, readiness
232+
signal, and typed error classes.
233+
2. Refactor `src/services/client.ts` to delegate raw tool calls to the manager.
234+
Remove transport/session-expiry logic from `GraphitiClient`. Preserve
235+
write-error propagation for `addEpisode` so
236+
`SessionManager.flushPendingMessages` retry semantics are maintained.
237+
3. Update `src/index.ts` to construct the connection manager, call `start()`
238+
without awaiting, and pass it into `GraphitiClient`.
239+
4. Verify `src/session.ts` — confirm `flushPendingMessages` catch block handles
240+
the new typed offline error correctly (re-queue path).
241+
5. Verify `src/handlers/event.ts`, `src/handlers/chat.ts`, and
242+
`src/handlers/compacting.ts` — confirm read-path fail-open behavior is
243+
unchanged. Optionally add `ready()` call in `chat.ts` before first memory
244+
injection.
245+
6. Update tests in `src/services/client.test.ts` and add focused tests for the
246+
connection manager (see [Testing Plan](#testing-plan)).
247+
7. Run `deno test`, `deno check src/index.ts`, and any relevant linting.
248+
249+
## Testing Plan
250+
251+
Add or update tests for:
252+
253+
- startup does not block on a successful or failed background connect
254+
- `ready()` resolves `true` on successful connect, `false` on timeout
255+
- requests issued during `connecting` are queued and later resolved
256+
- queued requests that exceed their per-request deadline reject with
257+
`queue-timeout`
258+
- requests issued during `offline` reject immediately with typed offline error
259+
- mid-session transport disconnect triggers serialized reconnect and retries the
260+
failed request once
261+
- expired-session (MCP 404) errors trigger one reconnect and one retry
262+
- concurrent transport failures share a single reconnect attempt (no thundering
263+
herd)
264+
- auto-reconnect backoff fires in `offline` state and transitions back to
265+
`connected` on success
266+
- read APIs return empty collections on offline/timeout conditions
267+
- write APIs (`addEpisode`) propagate offline errors so
268+
`SessionManager.flushPendingMessages` can re-queue
269+
- queue-full policy drops oldest entry, not newest
270+
- `stop()` transitions to `closing`, rejects queued requests, cancels reconnect
271+
timer
272+
273+
## Resolved Design Decisions
274+
275+
- **Auto-reconnect is mandatory.** The manager always runs exponential backoff
276+
in `offline` state. There is no "stay offline until explicit trigger" mode.
277+
- **No `idle` state.** `start()` transitions directly to `connecting`. Before
278+
`start()` is called the manager does not exist; after `stop()` it is inert.
279+
- **Write errors propagate to callers.** `addEpisode` failures (offline or
280+
otherwise) throw so that higher-level retry logic such as
281+
`SessionManager.flushPendingMessages` can re-queue. Read operations continue
282+
to fail open with empty results.
283+
284+
## Open Questions
285+
286+
- Exact default values for per-request deadline and queue capacity (proposed: 15
287+
s and 32; confirm during implementation).
288+
- Whether `ready()` timeout should be configurable per call site or set once at
289+
construction.
290+
291+
## Expected Outcome
292+
293+
The first OpenCode session should no longer stall on Graphiti warmup. Graphiti
294+
availability becomes a background concern managed by one process-wide transport
295+
layer, while memory features continue to operate on a best-effort basis with
296+
fast failure when the backend is unavailable.

0 commit comments

Comments
 (0)