docs: update semble docs with latest benchmarks and API (#23)

Pringled · web-flow · commit 3b0342dadf2e · 2026-04-30T16:05:38.000+02:00
diff --git a/src/content/docs/packages/semble/benchmarks.mdx b/src/content/docs/packages/semble/benchmarks.mdx
@@ -16,9 +16,11 @@ We benchmark quality and speed across all methods on ~1,250 queries over 63 repo
 | CodeRankEmbed | 0.765 | 57 s | 16 ms |
 | ColGREP | 0.693 | 5.8 s | 124 ms |
 | BM25 | 0.673 | 263 ms | 0.02 ms |
-| ripgrep | 0.126 | — | 12 ms |
+| grepai | 0.561 | 35 s | 48 ms |
+| probe | 0.387 | - | 207 ms |
+| ripgrep | 0.126 | - | 12 ms |
 
-Semble achieves 99% of the retrieval quality of the 137M-parameter CodeRankEmbed Hybrid, while indexing **218× faster** and answering queries **11× faster** — entirely on CPU.
+Semble achieves 99% of the retrieval quality of the 137M-parameter CodeRankEmbed Hybrid, while indexing **218× faster** and answering queries **11× faster**, entirely on CPU.
 
 The charts below plot latency against NDCG@10. Marker size reflects model parameter count.
 
@@ -28,32 +30,63 @@ The charts below plot latency against NDCG@10. Marker size reflects model parame
 ![Speed vs quality (warm)](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/speed_vs_ndcg_warm.png)
 *Query latency on a warm index vs NDCG@10*
 
+## Token Efficiency
+
+Coding agents (Claude Code, OpenCode, etc.) typically find code by running `grep` on keywords and reading the matched files. We model that workflow and compare it against semble's chunk retrieval across our full benchmark of 1,251 queries.
+
+![Token efficiency: recall vs. retrieved tokens](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/token_efficiency.png)
+
+### Expected tokens per query
+
+For each query: tokens consumed at first relevant hit, or 32k if the method never finds anything. Averaged across all 1,251 queries.
+
+| Method | Expected tokens | Savings |
+|--------|----------------:|--------:|
+| ripgrep + read file | 45,692 | baseline |
+| **semble** | **566** | **98% fewer** |
+
+### Recall at fixed token budgets
+
+A relevant file is "covered" once any retrieved unit comes from it.
+
+| Method | 500 | 1k | 2k | 4k | 8k | 16k | 32k |
+|--------|----:|---:|---:|---:|---:|----:|----:|
+| **semble** | **0.685** | **0.849** | **0.938** | **0.976** | **0.991** | **0.996** | **0.996** |
+| ripgrep + read file | 0.001 | 0.008 | 0.037 | 0.088 | 0.212 | 0.379 | 0.583 |
+
+<details>
+<summary>Methodology</summary>
+
+Semble returns the top-50 ranked chunks. `ripgrep+read` splits the query into keywords (dropping stopwords and short words), runs `rg --fixed-strings --ignore-case` for each keyword, then reads matched files in full ranked by how many distinct keywords they contain. Both methods search the same set of file types and ignored directories. Tokens are counted with `cl100k_base` via `tiktoken`. A relevant file is "covered" once any retrieved unit overlaps its annotated span.
+
+</details>
+
 ## By Language
 
 NDCG@10 per language. Best score per row is bolded.
 
-| Language | semble | CRE Hybrid | CRE | ColGREP | ripgrep |
-|----------|-------:|-----------:|----:|--------:|--------:|
-| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.180 |
-| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.126 |
-| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.230 |
-| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.134 |
-| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.176 |
-| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.000 |
-| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.117 |
-| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.133 |
-| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.202 |
-| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.123 |
-| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.160 |
-| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.000 |
-| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.000 |
-| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.198 |
-| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.166 |
-| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.162 |
-| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.000 |
-| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.000 |
-| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.128 |
-| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.126** |
+| Language | semble | CRE Hybrid | CRE | ColGREP | grepai | probe | ripgrep |
+|----------|-------:|-----------:|----:|--------:|-------:|------:|--------:|
+| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.330 | 0.392 | 0.180 |
+| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.731 | 0.375 | 0.126 |
+| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.643 | 0.382 | 0.230 |
+| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.669 | 0.412 | 0.134 |
+| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.675 | 0.588 | 0.176 |
+| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.755 | 0.369 | 0.000 |
+| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.277 | 0.392 | 0.117 |
+| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.722 | 0.410 | 0.133 |
+| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.634 | 0.488 | 0.202 |
+| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.402 | 0.340 | 0.123 |
+| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.429 | 0.280 | 0.160 |
+| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.723 | 0.226 | 0.000 |
+| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.699 | 0.336 | 0.000 |
+| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.386 | 0.536 | 0.198 |
+| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.478 | 0.335 | 0.166 |
+| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.519 | 0.242 | 0.162 |
+| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.555 | 0.384 | 0.000 |
+| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.483 | 0.313 | 0.000 |
+| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.394 | 0.354 | 0.128 |
+| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.561** | **0.387** | **0.126** |
 
 ## Ablations
 
@@ -63,7 +96,7 @@ NDCG@10 per language. Best score per row is bolded.
 |-----------|----:|----------:|
 | BM25 | 0.675 | 0.834 |
 | potion-code-16M | 0.650 | 0.821 |
-| BM25 + potion-code-16M | — | **0.854** |
+| BM25 + potion-code-16M | - | **0.854** |
 
 By query category:
 
@@ -89,7 +122,9 @@ Languages covered: bash, C, C++, C#, Elixir, Go, Haskell, Java, JavaScript, Kotl
 
 ## Methods
 
-- **[ripgrep](https://github.com/BurntSushi/ripgrep)** — fast regex search, included as a raw keyword-match baseline.
-- **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)** — late-interaction code retrieval with the LateOn-Code-edge model.
-- **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** — 137M-param transformer embedding model. *CRE Hybrid* fuses its dense scores with BM25.
-- **semble** — [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack.
+- **[ripgrep](https://github.com/BurntSushi/ripgrep)**: fast regex search, included as a raw keyword-match baseline.
+- **[probe](https://github.com/buger/probe)**: BM25 keyword ranking backed by tree-sitter parse trees. No persistent index; scans on the fly.
+- **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)**: late-interaction code retrieval with the LateOn-Code-edge model.
+- **[grepai](https://github.com/nicholasgasior/grepai)**: semantic search using [nomic-embed-text](https://huggingface.co/nomic-ai/nomic-embed-text-v1) (137M params) via a local Ollama daemon.
+- **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)**: 137M-param transformer embedding model. *CRE Hybrid* fuses its dense scores with BM25.
+- **semble**: [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack.
diff --git a/src/content/docs/packages/semble/installation.mdx b/src/content/docs/packages/semble/installation.mdx
@@ -7,8 +7,8 @@ sidebar:
 
 ## Requirements
 
-- Python 3.9 or higher
-- No GPU, API keys, or external services required — runs fully on CPU
+- Python 3.10 or higher
+- No GPU, API keys, or external services required. Runs fully on CPU.
 
 ## Install
 
diff --git a/src/content/docs/packages/semble/introduction.mdx b/src/content/docs/packages/semble/introduction.mdx
@@ -5,7 +5,7 @@ sidebar:
   icon: open-book
 ---
 
-[Semble](https://github.com/MinishLab/semble) is a code search library built for agents. It returns the exact code snippets they need instantly, cutting both token usage and waiting time on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](/packages/semble/benchmarks/)). Everything runs on CPU with no API keys, GPU, or external services.
+[Semble](https://github.com/MinishLab/semble) is a code search library built for agents. It returns the exact code snippets they need instantly, using ~98% fewer tokens than grep+read and cutting latency on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](/packages/semble/benchmarks/)). Everything runs on CPU with no API keys, GPU, or external services.
 
 Run it as an [MCP server](/packages/semble/mcp-server/) and any agent (Claude Code, Cursor, Codex, OpenCode, etc.) gets instant access to any repo, cloned and indexed on demand.
 
@@ -60,10 +60,10 @@ Semble splits each file into code-aware chunks using [Chonkie](https://github.co
 
 The two score lists are fused with Reciprocal Rank Fusion (RRF) and then reranked with a set of code-aware signals:
 
-- **Adaptive weighting** — symbol-like queries (`Foo::bar`, `getUserById`) get more lexical weight; natural-language queries stay balanced.
-- **Definition boosts** — a chunk that defines the queried symbol (`class`, `def`, `func`) ranks above chunks that merely reference it.
-- **Identifier stems** — query tokens are stemmed and matched against identifier stems, so `parse config` boosts chunks containing `parseConfig`, `ConfigParser`, or `config_parser`.
-- **File coherence** — when multiple chunks from the same file match, the file is boosted so the top result reflects broad file-level relevance.
-- **Noise penalties** — test files, `compat`/`legacy` shims, example code, and `.d.ts` stubs are down-ranked so canonical implementations surface first.
+- **Adaptive weighting**: symbol-like queries (`Foo::bar`, `getUserById`) get more lexical weight; natural-language queries stay balanced.
+- **Definition boosts**: a chunk that defines the queried symbol (`class`, `def`, `func`) ranks above chunks that merely reference it.
+- **Identifier stems**: query tokens are stemmed and matched against identifier stems, so `parse config` boosts chunks containing `parseConfig`, `ConfigParser`, or `config_parser`.
+- **File coherence**: when multiple chunks from the same file match, the file is boosted so the top result reflects broad file-level relevance.
+- **Noise penalties**: test files, `compat`/`legacy` shims, example code, and `.d.ts` stubs are down-ranked so canonical implementations surface first.
 
 Because the embedding model is static with no transformer forward pass at query time, all of this runs in milliseconds on CPU.
diff --git a/src/content/docs/packages/semble/usage.mdx b/src/content/docs/packages/semble/usage.mdx
@@ -21,6 +21,28 @@ index = SembleIndex.from_git("https://github.com/MinishLab/model2vec")
 
 Indexing a full repo typically takes under 300 ms. Remote repos are cloned on first use and cached for the lifetime of the process.
 
+### Advanced options
+
+Both `from_path` and `from_git` accept optional parameters to control what gets indexed:
+
+```python
+index = SembleIndex.from_path(
+    "./my-project",
+    extensions=frozenset({".py", ".ts"}),        # only index these file types
+    ignore=frozenset({"dist", "node_modules"}),  # skip these directories
+    include_text_files=True,                     # also index .md, .yaml, .json, etc.
+)
+```
+
+`from_git` additionally accepts a `ref` parameter to check out a specific branch or tag:
+
+```python
+index = SembleIndex.from_git(
+    "https://github.com/MinishLab/model2vec",
+    ref="v2.0.0",  # branch or tag; defaults to the remote HEAD
+)
+```
+
 ## Searching
 
 Search the index with a natural-language description or a code snippet:
@@ -34,6 +56,18 @@ for result in results:
     print()
 ```
 
+### Filtering
+
+Restrict results to specific languages or files using `filter_languages` and `filter_paths`:
+
+```python
+# Only return results from Python files
+results = index.search("parse config", filter_languages=["python"])
+
+# Only return results from specific files
+results = index.search("parse config", filter_paths=["src/config.py", "src/settings.py"])
+```
+
 ## Finding Related Code
 
 Given any search result, find other chunks that are semantically similar to it:
@@ -70,9 +104,22 @@ Each result object exposes:
 ```python
 result = results[0]
 
-result.score              # float — relevance score
+result.score              # float, relevance score
 result.chunk.file_path    # "src/config.py"
 result.chunk.start_line   # 42
 result.chunk.end_line     # 67
 result.chunk.content      # raw source code of the chunk
 ```
+
+## Index Stats
+
+Inspect the state of an index with the `stats` property:
+
+```python
+stats = index.stats
+
+stats.indexed_files   # number of files indexed
+stats.total_chunks    # total number of chunks
+stats.languages       # dict mapping language name to chunk count
+                      # e.g. {"python": 412, "typescript": 88}
+```