|
| 1 | +# Memory Optimization Design |
| 2 | + |
| 3 | +**Date:** 2026-03-29 |
| 4 | +**Status:** Approved |
| 5 | +**Target:** 50K files on K8s 800m CPU / 4GB RAM (currently OOMs) |
| 6 | + |
| 7 | +## Problem |
| 8 | + |
| 9 | +Current pipeline holds all nodes/edges in ArrayList before flushing. On 10K files (spring-boot): 3.9GB peak. On 50K files: OOM. |
| 10 | + |
| 11 | +Also: SQLite analysis cache uses JNI which pins virtual threads to platform threads, reducing parallelism. |
| 12 | + |
| 13 | +## Three-Command Architecture |
| 14 | + |
| 15 | +### Why: Neo4j is read-optimized, not write-optimized |
| 16 | + |
| 17 | +Neo4j Embedded excels at graph traversal reads (Cypher, shortest path, impact trace). It's not designed for high-throughput sequential writes during indexing: |
| 18 | +- Write amplification (indexes, WAL, relationship maintenance per insert) |
| 19 | +- 500MB runtime overhead even when only writing |
| 20 | +- Wastes memory on CI runners that only need to write, not query |
| 21 | + |
| 22 | +**Solution: H2 for writes (indexing), Neo4j for reads (serving).** |
| 23 | + |
| 24 | +### Three CLI Commands |
| 25 | + |
| 26 | +``` |
| 27 | +code-iq index /repo → H2 file (fast sequential writes, pure Java, 2.5MB overhead) |
| 28 | +code-iq enrich → H2 read → Neo4j bulk load → linkers → classify → topology |
| 29 | +code-iq serve → Neo4j read-only (Cypher, graph algorithms, MCP) |
| 30 | +``` |
| 31 | + |
| 32 | +| Command | Runs on | Memory | Store | What | |
| 33 | +|---|---|---|---|---| |
| 34 | +| `index` | CI (800m/4GB) | ~1.5GB | H2 file only | Scan + detect + batch write to H2 | |
| 35 | +| `enrich` | CI or dev machine | ~2-3GB | H2 read → Neo4j write | Bulk load, linkers, classify, topology | |
| 36 | +| `serve` | K8s pods (HPA) | ~1-2GB | Neo4j read-only | REST + MCP + UI, instant startup | |
| 37 | + |
| 38 | +### Phase 1: H2 as Primary Store During Indexing |
| 39 | + |
| 40 | +Replace in-memory ArrayLists with H2 file-based storage during `index`: |
| 41 | +- Remove `sqlite-jdbc` dependency |
| 42 | +- Add `h2` dependency (already in Spring Boot BOM) |
| 43 | +- H2 schema: files, nodes, edges, analysis_runs tables |
| 44 | +- Pure Java — no JNI, no virtual thread pinning |
| 45 | +- MVCC concurrency — no `synchronized` blocks needed |
| 46 | +- Batched writes: flush every 500 files to H2 (not in-memory lists) |
| 47 | +- File path: `.code-intelligence/index.mv.db` |
| 48 | + |
| 49 | +**Batch flush to H2:** |
| 50 | +``` |
| 51 | +Discover files → batch 500 → analyze → INSERT nodes/edges to H2 → release memory → next batch |
| 52 | +``` |
| 53 | + |
| 54 | +Peak memory: ~200MB per batch + 50MB H2 overhead = **under 1.5GB total**. |
| 55 | + |
| 56 | +### Phase 2: Enrich Command (H2 → Neo4j) |
| 57 | + |
| 58 | +Separate command that: |
| 59 | +1. Opens H2 file (read-only) |
| 60 | +2. Starts Neo4j Embedded |
| 61 | +3. Bulk-loads all nodes from H2 → Neo4j (batch INSERT, no per-row overhead) |
| 62 | +4. Bulk-loads all edges from H2 → Neo4j |
| 63 | +5. Runs linkers (TopicLinker, EntityLinker, ModuleContainmentLinker) — these need graph traversal, Neo4j excels here |
| 64 | +6. Runs LayerClassifier |
| 65 | +7. Runs ServiceDetector (topology) |
| 66 | +8. Creates Neo4j indexes for fast queries |
| 67 | +9. Shuts down, produces `graph.db/` directory |
| 68 | + |
| 69 | +**Why separate:** Linkers need graph traversal (find all topics, match producers to consumers). H2 is a relational DB — it can't do graph traversal efficiently. Neo4j can. |
| 70 | + |
| 71 | +### Phase 3: Serve (Neo4j Read-Only) |
| 72 | + |
| 73 | +`serve` loads pre-enriched `graph.db/`: |
| 74 | +- No indexing, no enrichment on startup |
| 75 | +- Instant startup — mount graph.db and go |
| 76 | +- HPA scales freely — all pods mount same graph.db from PVC/S3 |
| 77 | +- Hazelcast caches query results across pods |
| 78 | + |
| 79 | +### CI Pipeline |
| 80 | + |
| 81 | +```bash |
| 82 | +# On CI runner (800m/4GB) |
| 83 | +code-iq index /repo --no-cache # Produces: .code-intelligence/index.mv.db |
| 84 | +code-iq enrich # Produces: .osscodeiq/graph.db/ |
| 85 | + |
| 86 | +# Bundle for artifact |
| 87 | +code-iq bundle --tag v1.0 # ZIP: graph.db + source + flow.html |
| 88 | + |
| 89 | +# On triage server (K8s, HPA) |
| 90 | +code-iq serve --graph /path/to/graph.db # Instant startup, read-only |
| 91 | +``` |
| 92 | + |
| 93 | +### Memory Profile (50K files) |
| 94 | + |
| 95 | +| Phase | Component | Memory | |
| 96 | +|---|---|---| |
| 97 | +| **index** | JVM + Spring (no web) | 400MB | |
| 98 | +| | H2 file store | 50MB | |
| 99 | +| | Batch buffer (500 files) | 200MB | |
| 100 | +| | ANTLR/JavaParser (ThreadLocal) | 100MB | |
| 101 | +| | Virtual thread stacks | 100MB | |
| 102 | +| | **Total** | **~850MB** | |
| 103 | +| **enrich** | JVM + Spring | 400MB | |
| 104 | +| | Neo4j Embedded | 500MB | |
| 105 | +| | H2 reader | 50MB | |
| 106 | +| | Linker working set | 500MB | |
| 107 | +| | **Total** | **~1.5GB** | |
| 108 | +| **serve** | JVM + Spring (web) | 500MB | |
| 109 | +| | Neo4j Embedded (read-only) | 500MB | |
| 110 | +| | Hazelcast cache | 200MB | |
| 111 | +| | **Total** | **~1.2GB** | |
| 112 | + |
| 113 | +**All three phases fit comfortably in 4GB.** |
| 114 | + |
| 115 | +### Data Integrity |
| 116 | + |
| 117 | +- `index`: Every node/edge written to H2 in batch transactions. Rollback on failure. |
| 118 | +- `enrich`: Bulk load with count assertion — H2 node count == Neo4j node count after load. |
| 119 | +- `serve`: Read-only — no data mutation possible. |
| 120 | +- No silent data loss — exception stops pipeline on any write failure. |
| 121 | + |
| 122 | +### Configuration |
| 123 | + |
| 124 | +```yaml |
| 125 | +codeiq: |
| 126 | + analysis: |
| 127 | + batch-size: 500 # files per H2 flush batch |
| 128 | + index: |
| 129 | + store-path: .code-intelligence/index.mv.db |
| 130 | + graph: |
| 131 | + path: .osscodeiq/graph.db |
| 132 | +``` |
| 133 | +
|
| 134 | +CLI: `--batch-size 500` on index command. |
| 135 | + |
| 136 | +## What Doesn't Change |
| 137 | + |
| 138 | +- Detectors, parsers, ANTLR, JavaParser — same |
| 139 | +- File discovery — same |
| 140 | +- Virtual threads — same (better with H2) |
| 141 | +- Determinism — same |
| 142 | +- CLI output — same |
| 143 | +- Incremental cache logic — same (just H2 instead of SQLite) |
0 commit comments