|
| 1 | +# Code Intelligence — Project Instructions |
| 2 | + |
| 3 | +## What This Project Is |
| 4 | + |
| 5 | +A CLI tool that scans codebases to build a deterministic code knowledge graph. No AI, no external APIs — pure pattern matching. 72 detectors, 35 languages, 3 storage backends (NetworkX, SQLite, KuzuDB). |
| 6 | + |
| 7 | +## Architecture |
| 8 | + |
| 9 | +``` |
| 10 | +FileDiscovery → Parsers → Detectors → GraphBuilder (buffered) → Linkers → LayerClassifier → GraphStore (backend) |
| 11 | +``` |
| 12 | + |
| 13 | +- **Detectors** follow the `Detector` Protocol in `detectors/base.py` — implement `name`, `supported_languages`, `detect(ctx) -> DetectorResult` |
| 14 | +- **Backends** follow the `GraphBackend` Protocol in `graph/backend.py` — implement 16 methods. `CypherBackend` is optional for Cypher-capable backends. |
| 15 | +- **GraphStore** is a facade delegating to a backend — never access backends directly |
| 16 | +- **GraphBuilder** buffers all nodes and edges, flushes nodes first then edges (ensures cross-backend parity) |
| 17 | +- **Linkers** run after all detectors, produce cross-file relationship edges |
| 18 | +- **LayerClassifier** runs after linkers, sets `layer` property on every node |
| 19 | + |
| 20 | +## Critical Rules |
| 21 | + |
| 22 | +### Determinism is Non-Negotiable |
| 23 | +- Same input MUST produce same output, every time, on every backend |
| 24 | +- No set iteration without `sorted()` first |
| 25 | +- No dependency on thread completion order (builder uses indexed result slots) |
| 26 | +- All detectors must be stateless pure functions — no class-level mutable state |
| 27 | +- Benchmark after every change: run 2+ times, assert identical node/edge counts |
| 28 | + |
| 29 | +### Cross-Backend Data Parity |
| 30 | +- All 3 backends (NetworkX, SQLite, KuzuDB) must produce identical node and edge counts |
| 31 | +- Edges are only added if both source and target nodes exist |
| 32 | +- Test parity after any change to builder, store, or backends |
| 33 | + |
| 34 | +### Adding a New Detector |
| 35 | +1. Create file in `detectors/<category>/my_detector.py` |
| 36 | +2. Implement `Detector` protocol (name, supported_languages, detect method) |
| 37 | +3. Add to the hardcoded list in `detectors/registry.py` (will be auto-discovered after tech debt cleanup) |
| 38 | +4. Create test in `tests/detectors/<category>/test_my_detector.py` |
| 39 | +5. Include a determinism test (run twice, assert identical output) |
| 40 | +6. Run `pytest tests/ -x -q` — all tests must pass |
| 41 | + |
| 42 | +### Adding a New Backend |
| 43 | +1. Create file in `graph/backends/my_backend.py` |
| 44 | +2. Implement `GraphBackend` protocol (16 methods) |
| 45 | +3. Optionally implement `CypherBackend` for Cypher support |
| 46 | +4. Add to factory in `graph/backends/__init__.py` |
| 47 | +5. Add to `GraphConfig` backend choices in `config.py` |
| 48 | +6. Test parity: same nodes/edges as NetworkX on the same input |
| 49 | + |
| 50 | +## Code Conventions |
| 51 | + |
| 52 | +- Python 3.11+, `from __future__ import annotations` |
| 53 | +- Pydantic for data models, typer for CLI, rich for output |
| 54 | +- Regex-based detection (no tree-sitter dependency for new detectors unless needed) |
| 55 | +- `NodeKind` and `EdgeKind` enums in `models/graph.py` — add new values there |
| 56 | +- ID format: `"{prefix}:{filepath}:{type}:{identifier}"` for cross-file uniqueness |
| 57 | +- Properties dict for detector-specific metadata (`auth_type`, `framework`, `roles`, etc.) |
| 58 | +- `layer` property on every node: `frontend | backend | infra | shared | unknown` |
| 59 | + |
| 60 | +## Testing |
| 61 | + |
| 62 | +- `pytest tests/ -x -q` — must always pass (currently 361 tests) |
| 63 | +- Every detector needs: positive match test, negative match test, determinism test |
| 64 | +- Benchmark on spring-boot (10K files) for performance regression checks |
| 65 | +- Cross-backend parity test on contoso-real-estate for data quality |
| 66 | + |
| 67 | +## Key Files |
| 68 | + |
| 69 | +| File | Purpose | |
| 70 | +|------|---------| |
| 71 | +| `detectors/base.py` | Detector protocol (42 lines) | |
| 72 | +| `graph/backend.py` | GraphBackend + CypherBackend protocols | |
| 73 | +| `graph/store.py` | GraphStore facade | |
| 74 | +| `graph/builder.py` | GraphBuilder with buffered flush + linkers | |
| 75 | +| `graph/backends/networkx.py` | Default in-memory backend | |
| 76 | +| `graph/backends/kuzu.py` | KuzuDB embedded graph DB with Cypher | |
| 77 | +| `graph/backends/sqlite_backend.py` | SQLite file-based backend | |
| 78 | +| `classifiers/layer_classifier.py` | Deterministic layer classification | |
| 79 | +| `models/graph.py` | NodeKind, EdgeKind, GraphNode, GraphEdge | |
| 80 | +| `config.py` | Config with GraphConfig for backend selection | |
| 81 | +| `analyzer.py` | Pipeline orchestrator | |
| 82 | +| `cli.py` | CLI commands (analyze, graph, query, find, cypher, bundle, cache, plugins) | |
| 83 | + |
| 84 | +## Known Tech Debt (Phase 2) |
| 85 | + |
| 86 | +- Registry has 75-entry hardcoded detector list — needs auto-discovery |
| 87 | +- `imports_detector.py` is 723 lines — needs splitting per language |
| 88 | +- 60+ detectors have no tests — need coverage |
| 89 | +- `_parse_structured()` has 11-branch elif chain — needs dispatch table |
| 90 | +- Linker protocol uses `_new_module_nodes` private attribute hack — needs `LinkResult` |
| 91 | +- Missing extensions: `.html`, `.css`, `.mjs`, `.cjs`, `.jsonc`, `.groovy`, `.pyi` |
| 92 | +- No extensionless file support (Dockerfile, Makefile, go.mod) |
| 93 | + |
| 94 | +## Updating This File |
| 95 | + |
| 96 | +After significant changes (new detectors, new backends, architectural decisions, conventions learned), update this CLAUDE.md to reflect the current state. Keep it concise and actionable. |
0 commit comments