Skip to content

Commit 3b0342d

Browse files
authored
docs: update semble docs with latest benchmarks and API (#23)
1 parent 608e330 commit 3b0342d

4 files changed

Lines changed: 120 additions & 38 deletions

File tree

src/content/docs/packages/semble/benchmarks.mdx

Lines changed: 64 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,11 @@ We benchmark quality and speed across all methods on ~1,250 queries over 63 repo
1616
| CodeRankEmbed | 0.765 | 57 s | 16 ms |
1717
| ColGREP | 0.693 | 5.8 s | 124 ms |
1818
| BM25 | 0.673 | 263 ms | 0.02 ms |
19-
| ripgrep | 0.126 || 12 ms |
19+
| grepai | 0.561 | 35 s | 48 ms |
20+
| probe | 0.387 | - | 207 ms |
21+
| ripgrep | 0.126 | - | 12 ms |
2022

21-
Semble achieves 99% of the retrieval quality of the 137M-parameter CodeRankEmbed Hybrid, while indexing **218× faster** and answering queries **11× faster** entirely on CPU.
23+
Semble achieves 99% of the retrieval quality of the 137M-parameter CodeRankEmbed Hybrid, while indexing **218× faster** and answering queries **11× faster**, entirely on CPU.
2224

2325
The charts below plot latency against NDCG@10. Marker size reflects model parameter count.
2426

@@ -28,32 +30,63 @@ The charts below plot latency against NDCG@10. Marker size reflects model parame
2830
![Speed vs quality (warm)](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/speed_vs_ndcg_warm.png)
2931
*Query latency on a warm index vs NDCG@10*
3032

33+
## Token Efficiency
34+
35+
Coding agents (Claude Code, OpenCode, etc.) typically find code by running `grep` on keywords and reading the matched files. We model that workflow and compare it against semble's chunk retrieval across our full benchmark of 1,251 queries.
36+
37+
![Token efficiency: recall vs. retrieved tokens](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/token_efficiency.png)
38+
39+
### Expected tokens per query
40+
41+
For each query: tokens consumed at first relevant hit, or 32k if the method never finds anything. Averaged across all 1,251 queries.
42+
43+
| Method | Expected tokens | Savings |
44+
|--------|----------------:|--------:|
45+
| ripgrep + read file | 45,692 | baseline |
46+
| **semble** | **566** | **98% fewer** |
47+
48+
### Recall at fixed token budgets
49+
50+
A relevant file is "covered" once any retrieved unit comes from it.
51+
52+
| Method | 500 | 1k | 2k | 4k | 8k | 16k | 32k |
53+
|--------|----:|---:|---:|---:|---:|----:|----:|
54+
| **semble** | **0.685** | **0.849** | **0.938** | **0.976** | **0.991** | **0.996** | **0.996** |
55+
| ripgrep + read file | 0.001 | 0.008 | 0.037 | 0.088 | 0.212 | 0.379 | 0.583 |
56+
57+
<details>
58+
<summary>Methodology</summary>
59+
60+
Semble returns the top-50 ranked chunks. `ripgrep+read` splits the query into keywords (dropping stopwords and short words), runs `rg --fixed-strings --ignore-case` for each keyword, then reads matched files in full ranked by how many distinct keywords they contain. Both methods search the same set of file types and ignored directories. Tokens are counted with `cl100k_base` via `tiktoken`. A relevant file is "covered" once any retrieved unit overlaps its annotated span.
61+
62+
</details>
63+
3164
## By Language
3265

3366
NDCG@10 per language. Best score per row is bolded.
3467

35-
| Language | semble | CRE Hybrid | CRE | ColGREP | ripgrep |
36-
|----------|-------:|-----------:|----:|--------:|--------:|
37-
| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.180 |
38-
| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.126 |
39-
| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.230 |
40-
| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.134 |
41-
| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.176 |
42-
| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.000 |
43-
| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.117 |
44-
| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.133 |
45-
| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.202 |
46-
| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.123 |
47-
| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.160 |
48-
| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.000 |
49-
| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.000 |
50-
| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.198 |
51-
| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.166 |
52-
| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.162 |
53-
| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.000 |
54-
| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.000 |
55-
| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.128 |
56-
| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.126** |
68+
| Language | semble | CRE Hybrid | CRE | ColGREP | grepai | probe | ripgrep |
69+
|----------|-------:|-----------:|----:|--------:|-------:|------:|--------:|
70+
| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.330 | 0.392 | 0.180 |
71+
| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.731 | 0.375 | 0.126 |
72+
| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.643 | 0.382 | 0.230 |
73+
| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.669 | 0.412 | 0.134 |
74+
| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.675 | 0.588 | 0.176 |
75+
| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.755 | 0.369 | 0.000 |
76+
| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.277 | 0.392 | 0.117 |
77+
| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.722 | 0.410 | 0.133 |
78+
| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.634 | 0.488 | 0.202 |
79+
| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.402 | 0.340 | 0.123 |
80+
| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.429 | 0.280 | 0.160 |
81+
| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.723 | 0.226 | 0.000 |
82+
| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.699 | 0.336 | 0.000 |
83+
| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.386 | 0.536 | 0.198 |
84+
| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.478 | 0.335 | 0.166 |
85+
| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.519 | 0.242 | 0.162 |
86+
| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.555 | 0.384 | 0.000 |
87+
| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.483 | 0.313 | 0.000 |
88+
| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.394 | 0.354 | 0.128 |
89+
| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.561** | **0.387** | **0.126** |
5790

5891
## Ablations
5992

@@ -63,7 +96,7 @@ NDCG@10 per language. Best score per row is bolded.
6396
|-----------|----:|----------:|
6497
| BM25 | 0.675 | 0.834 |
6598
| potion-code-16M | 0.650 | 0.821 |
66-
| BM25 + potion-code-16M | | **0.854** |
99+
| BM25 + potion-code-16M | - | **0.854** |
67100

68101
By query category:
69102

@@ -89,7 +122,9 @@ Languages covered: bash, C, C++, C#, Elixir, Go, Haskell, Java, JavaScript, Kotl
89122

90123
## Methods
91124

92-
- **[ripgrep](https://github.com/BurntSushi/ripgrep)** — fast regex search, included as a raw keyword-match baseline.
93-
- **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)** — late-interaction code retrieval with the LateOn-Code-edge model.
94-
- **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** — 137M-param transformer embedding model. *CRE Hybrid* fuses its dense scores with BM25.
95-
- **semble**[potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack.
125+
- **[ripgrep](https://github.com/BurntSushi/ripgrep)**: fast regex search, included as a raw keyword-match baseline.
126+
- **[probe](https://github.com/buger/probe)**: BM25 keyword ranking backed by tree-sitter parse trees. No persistent index; scans on the fly.
127+
- **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)**: late-interaction code retrieval with the LateOn-Code-edge model.
128+
- **[grepai](https://github.com/nicholasgasior/grepai)**: semantic search using [nomic-embed-text](https://huggingface.co/nomic-ai/nomic-embed-text-v1) (137M params) via a local Ollama daemon.
129+
- **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)**: 137M-param transformer embedding model. *CRE Hybrid* fuses its dense scores with BM25.
130+
- **semble**: [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack.

src/content/docs/packages/semble/installation.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ sidebar:
77

88
## Requirements
99

10-
- Python 3.9 or higher
11-
- No GPU, API keys, or external services required — runs fully on CPU
10+
- Python 3.10 or higher
11+
- No GPU, API keys, or external services required. Runs fully on CPU.
1212

1313
## Install
1414

src/content/docs/packages/semble/introduction.mdx

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ sidebar:
55
icon: open-book
66
---
77

8-
[Semble](https://github.com/MinishLab/semble) is a code search library built for agents. It returns the exact code snippets they need instantly, cutting both token usage and waiting time on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](/packages/semble/benchmarks/)). Everything runs on CPU with no API keys, GPU, or external services.
8+
[Semble](https://github.com/MinishLab/semble) is a code search library built for agents. It returns the exact code snippets they need instantly, using ~98% fewer tokens than grep+read and cutting latency on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](/packages/semble/benchmarks/)). Everything runs on CPU with no API keys, GPU, or external services.
99

1010
Run it as an [MCP server](/packages/semble/mcp-server/) and any agent (Claude Code, Cursor, Codex, OpenCode, etc.) gets instant access to any repo, cloned and indexed on demand.
1111

@@ -60,10 +60,10 @@ Semble splits each file into code-aware chunks using [Chonkie](https://github.co
6060

6161
The two score lists are fused with Reciprocal Rank Fusion (RRF) and then reranked with a set of code-aware signals:
6262

63-
- **Adaptive weighting** symbol-like queries (`Foo::bar`, `getUserById`) get more lexical weight; natural-language queries stay balanced.
64-
- **Definition boosts** a chunk that defines the queried symbol (`class`, `def`, `func`) ranks above chunks that merely reference it.
65-
- **Identifier stems** query tokens are stemmed and matched against identifier stems, so `parse config` boosts chunks containing `parseConfig`, `ConfigParser`, or `config_parser`.
66-
- **File coherence** when multiple chunks from the same file match, the file is boosted so the top result reflects broad file-level relevance.
67-
- **Noise penalties** test files, `compat`/`legacy` shims, example code, and `.d.ts` stubs are down-ranked so canonical implementations surface first.
63+
- **Adaptive weighting**: symbol-like queries (`Foo::bar`, `getUserById`) get more lexical weight; natural-language queries stay balanced.
64+
- **Definition boosts**: a chunk that defines the queried symbol (`class`, `def`, `func`) ranks above chunks that merely reference it.
65+
- **Identifier stems**: query tokens are stemmed and matched against identifier stems, so `parse config` boosts chunks containing `parseConfig`, `ConfigParser`, or `config_parser`.
66+
- **File coherence**: when multiple chunks from the same file match, the file is boosted so the top result reflects broad file-level relevance.
67+
- **Noise penalties**: test files, `compat`/`legacy` shims, example code, and `.d.ts` stubs are down-ranked so canonical implementations surface first.
6868

6969
Because the embedding model is static with no transformer forward pass at query time, all of this runs in milliseconds on CPU.

src/content/docs/packages/semble/usage.mdx

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,28 @@ index = SembleIndex.from_git("https://github.com/MinishLab/model2vec")
2121

2222
Indexing a full repo typically takes under 300 ms. Remote repos are cloned on first use and cached for the lifetime of the process.
2323

24+
### Advanced options
25+
26+
Both `from_path` and `from_git` accept optional parameters to control what gets indexed:
27+
28+
```python
29+
index = SembleIndex.from_path(
30+
"./my-project",
31+
extensions=frozenset({".py", ".ts"}), # only index these file types
32+
ignore=frozenset({"dist", "node_modules"}), # skip these directories
33+
include_text_files=True, # also index .md, .yaml, .json, etc.
34+
)
35+
```
36+
37+
`from_git` additionally accepts a `ref` parameter to check out a specific branch or tag:
38+
39+
```python
40+
index = SembleIndex.from_git(
41+
"https://github.com/MinishLab/model2vec",
42+
ref="v2.0.0", # branch or tag; defaults to the remote HEAD
43+
)
44+
```
45+
2446
## Searching
2547

2648
Search the index with a natural-language description or a code snippet:
@@ -34,6 +56,18 @@ for result in results:
3456
print()
3557
```
3658

59+
### Filtering
60+
61+
Restrict results to specific languages or files using `filter_languages` and `filter_paths`:
62+
63+
```python
64+
# Only return results from Python files
65+
results = index.search("parse config", filter_languages=["python"])
66+
67+
# Only return results from specific files
68+
results = index.search("parse config", filter_paths=["src/config.py", "src/settings.py"])
69+
```
70+
3771
## Finding Related Code
3872

3973
Given any search result, find other chunks that are semantically similar to it:
@@ -70,9 +104,22 @@ Each result object exposes:
70104
```python
71105
result = results[0]
72106

73-
result.score # float relevance score
107+
result.score # float, relevance score
74108
result.chunk.file_path # "src/config.py"
75109
result.chunk.start_line # 42
76110
result.chunk.end_line # 67
77111
result.chunk.content # raw source code of the chunk
78112
```
113+
114+
## Index Stats
115+
116+
Inspect the state of an index with the `stats` property:
117+
118+
```python
119+
stats = index.stats
120+
121+
stats.indexed_files # number of files indexed
122+
stats.total_chunks # total number of chunks
123+
stats.languages # dict mapping language name to chunk count
124+
# e.g. {"python": 412, "typescript": 88}
125+
```

0 commit comments

Comments
 (0)