Skip to content

Commit 7c5573a

Browse files
aksOpsclaude
andcommitted
docs: update memory optimization spec with 3-command architecture (index/enrich/serve)
H2 for writes (indexing), Neo4j for reads (serving). Each store optimized for its purpose. Three CLI commands with clean separation. Also added service topology spec. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent ff517b7 commit 7c5573a

2 files changed

Lines changed: 421 additions & 0 deletions

File tree

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# Memory Optimization Design
2+
3+
**Date:** 2026-03-29
4+
**Status:** Approved
5+
**Target:** 50K files on K8s 800m CPU / 4GB RAM (currently OOMs)
6+
7+
## Problem
8+
9+
Current pipeline holds all nodes/edges in ArrayList before flushing. On 10K files (spring-boot): 3.9GB peak. On 50K files: OOM.
10+
11+
Also: SQLite analysis cache uses JNI which pins virtual threads to platform threads, reducing parallelism.
12+
13+
## Three-Command Architecture
14+
15+
### Why: Neo4j is read-optimized, not write-optimized
16+
17+
Neo4j Embedded excels at graph traversal reads (Cypher, shortest path, impact trace). It's not designed for high-throughput sequential writes during indexing:
18+
- Write amplification (indexes, WAL, relationship maintenance per insert)
19+
- 500MB runtime overhead even when only writing
20+
- Wastes memory on CI runners that only need to write, not query
21+
22+
**Solution: H2 for writes (indexing), Neo4j for reads (serving).**
23+
24+
### Three CLI Commands
25+
26+
```
27+
code-iq index /repo → H2 file (fast sequential writes, pure Java, 2.5MB overhead)
28+
code-iq enrich → H2 read → Neo4j bulk load → linkers → classify → topology
29+
code-iq serve → Neo4j read-only (Cypher, graph algorithms, MCP)
30+
```
31+
32+
| Command | Runs on | Memory | Store | What |
33+
|---|---|---|---|---|
34+
| `index` | CI (800m/4GB) | ~1.5GB | H2 file only | Scan + detect + batch write to H2 |
35+
| `enrich` | CI or dev machine | ~2-3GB | H2 read → Neo4j write | Bulk load, linkers, classify, topology |
36+
| `serve` | K8s pods (HPA) | ~1-2GB | Neo4j read-only | REST + MCP + UI, instant startup |
37+
38+
### Phase 1: H2 as Primary Store During Indexing
39+
40+
Replace in-memory ArrayLists with H2 file-based storage during `index`:
41+
- Remove `sqlite-jdbc` dependency
42+
- Add `h2` dependency (already in Spring Boot BOM)
43+
- H2 schema: files, nodes, edges, analysis_runs tables
44+
- Pure Java — no JNI, no virtual thread pinning
45+
- MVCC concurrency — no `synchronized` blocks needed
46+
- Batched writes: flush every 500 files to H2 (not in-memory lists)
47+
- File path: `.code-intelligence/index.mv.db`
48+
49+
**Batch flush to H2:**
50+
```
51+
Discover files → batch 500 → analyze → INSERT nodes/edges to H2 → release memory → next batch
52+
```
53+
54+
Peak memory: ~200MB per batch + 50MB H2 overhead = **under 1.5GB total**.
55+
56+
### Phase 2: Enrich Command (H2 → Neo4j)
57+
58+
Separate command that:
59+
1. Opens H2 file (read-only)
60+
2. Starts Neo4j Embedded
61+
3. Bulk-loads all nodes from H2 → Neo4j (batch INSERT, no per-row overhead)
62+
4. Bulk-loads all edges from H2 → Neo4j
63+
5. Runs linkers (TopicLinker, EntityLinker, ModuleContainmentLinker) — these need graph traversal, Neo4j excels here
64+
6. Runs LayerClassifier
65+
7. Runs ServiceDetector (topology)
66+
8. Creates Neo4j indexes for fast queries
67+
9. Shuts down, produces `graph.db/` directory
68+
69+
**Why separate:** Linkers need graph traversal (find all topics, match producers to consumers). H2 is a relational DB — it can't do graph traversal efficiently. Neo4j can.
70+
71+
### Phase 3: Serve (Neo4j Read-Only)
72+
73+
`serve` loads pre-enriched `graph.db/`:
74+
- No indexing, no enrichment on startup
75+
- Instant startup — mount graph.db and go
76+
- HPA scales freely — all pods mount same graph.db from PVC/S3
77+
- Hazelcast caches query results across pods
78+
79+
### CI Pipeline
80+
81+
```bash
82+
# On CI runner (800m/4GB)
83+
code-iq index /repo --no-cache # Produces: .code-intelligence/index.mv.db
84+
code-iq enrich # Produces: .osscodeiq/graph.db/
85+
86+
# Bundle for artifact
87+
code-iq bundle --tag v1.0 # ZIP: graph.db + source + flow.html
88+
89+
# On triage server (K8s, HPA)
90+
code-iq serve --graph /path/to/graph.db # Instant startup, read-only
91+
```
92+
93+
### Memory Profile (50K files)
94+
95+
| Phase | Component | Memory |
96+
|---|---|---|
97+
| **index** | JVM + Spring (no web) | 400MB |
98+
| | H2 file store | 50MB |
99+
| | Batch buffer (500 files) | 200MB |
100+
| | ANTLR/JavaParser (ThreadLocal) | 100MB |
101+
| | Virtual thread stacks | 100MB |
102+
| | **Total** | **~850MB** |
103+
| **enrich** | JVM + Spring | 400MB |
104+
| | Neo4j Embedded | 500MB |
105+
| | H2 reader | 50MB |
106+
| | Linker working set | 500MB |
107+
| | **Total** | **~1.5GB** |
108+
| **serve** | JVM + Spring (web) | 500MB |
109+
| | Neo4j Embedded (read-only) | 500MB |
110+
| | Hazelcast cache | 200MB |
111+
| | **Total** | **~1.2GB** |
112+
113+
**All three phases fit comfortably in 4GB.**
114+
115+
### Data Integrity
116+
117+
- `index`: Every node/edge written to H2 in batch transactions. Rollback on failure.
118+
- `enrich`: Bulk load with count assertion — H2 node count == Neo4j node count after load.
119+
- `serve`: Read-only — no data mutation possible.
120+
- No silent data loss — exception stops pipeline on any write failure.
121+
122+
### Configuration
123+
124+
```yaml
125+
codeiq:
126+
analysis:
127+
batch-size: 500 # files per H2 flush batch
128+
index:
129+
store-path: .code-intelligence/index.mv.db
130+
graph:
131+
path: .osscodeiq/graph.db
132+
```
133+
134+
CLI: `--batch-size 500` on index command.
135+
136+
## What Doesn't Change
137+
138+
- Detectors, parsers, ANTLR, JavaParser — same
139+
- File discovery — same
140+
- Virtual threads — same (better with H2)
141+
- Determinism — same
142+
- CLI output — same
143+
- Incremental cache logic — same (just H2 instead of SQLite)

0 commit comments

Comments
 (0)