Skip to content

Commit 608e330

Browse files
authored
docs: Add semble docs (#22)
1 parent df42c6f commit 608e330

13 files changed

Lines changed: 439 additions & 38 deletions

File tree

astro.config.mjs

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,16 @@ gtag('config', 'G-LQWDNXKF2X');`,
9494
{ label: 'Supported Backends', link: '/packages/vicinity/supported-backends/' },
9595
],
9696
},
97+
{
98+
label: 'Semble',
99+
items: [
100+
{ label: 'Introduction', link: '/packages/semble/introduction/' },
101+
{ label: 'Installation', link: '/packages/semble/installation/' },
102+
{ label: 'Usage', link: '/packages/semble/usage/' },
103+
{ label: 'MCP Server', link: '/packages/semble/mcp-server/' },
104+
{ label: 'Benchmarks', link: '/packages/semble/benchmarks/' },
105+
],
106+
},
97107
{
98108
label: 'Tokenlearn',
99109
items: [

package-lock.json

Lines changed: 37 additions & 36 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
760 KB
Loading
120 KB
Loading

src/components/home/HomepagePackageList.astro

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ const packages = await getHomepagePackageMetrics();
1414
<div class="pkg-row-main">
1515
<a href={pkg.docsHref} class="pkg-icon-link" aria-label={`${pkg.name} docs`}>
1616
<span class="pkg-icon-wrap">
17-
<img class="pkg-icon" src={pkg.logoSrc} alt={pkg.logoAlt} loading="lazy" />
17+
<img class={`pkg-icon pkg-icon--${pkg.name.toLowerCase().replace(/[^a-z0-9]+/g, '-')}`} src={pkg.logoSrc} alt={pkg.logoAlt} loading="lazy" />
1818
</span>
1919
</a>
2020
<a href={pkg.docsHref} class="pkg-row-info">

src/content/docs/packages/overview/index.mdx

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,28 @@ tableOfContents: false
7171
</div>
7272
</article>
7373

74+
<article class="overview-item">
75+
<div class="overview-item-top">
76+
<img class="overview-icon" src="/images/logos/semble_logo.webp" alt="Semble" loading="lazy" />
77+
<div class="overview-copy">
78+
<h2><a href="/packages/semble/introduction/">Semble</a></h2>
79+
<p>Fast and accurate code search for agents.</p>
80+
</div>
81+
</div>
82+
<div class="overview-item-bottom">
83+
<div class="overview-tags">
84+
<span class="overview-tag">Code Search</span>
85+
<span class="overview-tag">MCP Server</span>
86+
<span class="overview-tag">Agents</span>
87+
<span class="overview-tag">Python</span>
88+
</div>
89+
<div class="overview-actions">
90+
<a class="overview-link overview-link-primary" href="/packages/semble/introduction/">Docs</a>
91+
<a class="overview-link" href="https://github.com/minishlab/semble">Repo</a>
92+
</div>
93+
</div>
94+
</article>
95+
7496
<article class="overview-item">
7597
<div class="overview-item-top">
7698
<img class="overview-icon" src="/images/logos/tokenlearn_logo.webp" alt="Tokenlearn" loading="lazy" />
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
title: Benchmarks
3+
description: Quality and speed benchmarks for Semble
4+
sidebar:
5+
icon: chart-bar
6+
---
7+
8+
We benchmark quality and speed across all methods on ~1,250 queries over 63 repositories in 19 languages.
9+
10+
## Main Results
11+
12+
| Method | NDCG@10 | Index time | Query p50 |
13+
|--------|--------:|-----------:|----------:|
14+
| CodeRankEmbed Hybrid | 0.862 | 57 s | 16 ms |
15+
| **semble** | **0.854** | **263 ms** | **1.5 ms** |
16+
| CodeRankEmbed | 0.765 | 57 s | 16 ms |
17+
| ColGREP | 0.693 | 5.8 s | 124 ms |
18+
| BM25 | 0.673 | 263 ms | 0.02 ms |
19+
| ripgrep | 0.126 || 12 ms |
20+
21+
Semble achieves 99% of the retrieval quality of the 137M-parameter CodeRankEmbed Hybrid, while indexing **218× faster** and answering queries **11× faster** — entirely on CPU.
22+
23+
The charts below plot latency against NDCG@10. Marker size reflects model parameter count.
24+
25+
![Speed vs quality (cold start)](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/speed_vs_ndcg_cold.png)
26+
*Time to first result (index + query) vs NDCG@10*
27+
28+
![Speed vs quality (warm)](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/speed_vs_ndcg_warm.png)
29+
*Query latency on a warm index vs NDCG@10*
30+
31+
## By Language
32+
33+
NDCG@10 per language. Best score per row is bolded.
34+
35+
| Language | semble | CRE Hybrid | CRE | ColGREP | ripgrep |
36+
|----------|-------:|-----------:|----:|--------:|--------:|
37+
| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.180 |
38+
| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.126 |
39+
| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.230 |
40+
| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.134 |
41+
| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.176 |
42+
| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.000 |
43+
| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.117 |
44+
| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.133 |
45+
| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.202 |
46+
| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.123 |
47+
| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.160 |
48+
| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.000 |
49+
| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.000 |
50+
| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.198 |
51+
| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.166 |
52+
| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.162 |
53+
| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.000 |
54+
| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.000 |
55+
| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.128 |
56+
| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.126** |
57+
58+
## Ablations
59+
60+
`raw` returns retrieval scores directly; `+ ranking` feeds them through semble's hybrid reranker.
61+
62+
| Retrieval | Raw | + ranking |
63+
|-----------|----:|----------:|
64+
| BM25 | 0.675 | 0.834 |
65+
| potion-code-16M | 0.650 | 0.821 |
66+
| BM25 + potion-code-16M || **0.854** |
67+
68+
By query category:
69+
70+
| Mode | Architecture | Semantic | Symbol |
71+
|------|-------------:|---------:|-------:|
72+
| BM25 raw | 0.628 | 0.676 | 0.719 |
73+
| potion-code-16M raw | 0.626 | 0.666 | 0.629 |
74+
| semble BM25 (+ ranking) | 0.770 | 0.819 | 0.957 |
75+
| semble potion-code-16M (+ ranking) | 0.757 | 0.808 | 0.943 |
76+
| **semble hybrid** | **0.802** | **0.846** | **0.958** |
77+
78+
## Dataset
79+
80+
~1,250 queries over 63 repositories in 19 languages, grouped into three categories:
81+
82+
| Category | Queries | What it tests |
83+
|----------|--------:|---------------|
84+
| semantic | 711 | Code that implements a specific behavior or concept |
85+
| architecture | 343 | Design decisions, module boundaries, structural patterns |
86+
| symbol | 204 | Named entity lookup (function, class, type, variable) |
87+
88+
Languages covered: bash, C, C++, C#, Elixir, Go, Haskell, Java, JavaScript, Kotlin, Lua, PHP, Python, Ruby, Rust, Scala, Swift, TypeScript, Zig.
89+
90+
## Methods
91+
92+
- **[ripgrep](https://github.com/BurntSushi/ripgrep)** — fast regex search, included as a raw keyword-match baseline.
93+
- **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)** — late-interaction code retrieval with the LateOn-Code-edge model.
94+
- **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** — 137M-param transformer embedding model. *CRE Hybrid* fuses its dense scores with BM25.
95+
- **semble**[potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
---
2+
title: Installation
3+
description: How to install Semble
4+
sidebar:
5+
icon: seti:config
6+
---
7+
8+
## Requirements
9+
10+
- Python 3.9 or higher
11+
- No GPU, API keys, or external services required — runs fully on CPU
12+
13+
## Install
14+
15+
```bash
16+
pip install semble
17+
```
18+
19+
Or with [uv](https://docs.astral.sh/uv/):
20+
21+
```bash
22+
uv add semble
23+
```
24+
25+
## MCP Server Extra
26+
27+
To use Semble as an [MCP server](/packages/semble/mcp-server/) with agents like Claude Code, Cursor, or OpenCode, install the `mcp` extra:
28+
29+
```bash
30+
pip install "semble[mcp]"
31+
```
32+
33+
Or, use [uvx](https://docs.astral.sh/uv/guides/tools/) to run it without a permanent install:
34+
35+
```bash
36+
uvx --from "semble[mcp]" semble
37+
```
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
title: Semble
3+
description: Fast and Accurate Code Search for Agents
4+
sidebar:
5+
icon: open-book
6+
---
7+
8+
[Semble](https://github.com/MinishLab/semble) is a code search library built for agents. It returns the exact code snippets they need instantly, cutting both token usage and waiting time on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](/packages/semble/benchmarks/)). Everything runs on CPU with no API keys, GPU, or external services.
9+
10+
Run it as an [MCP server](/packages/semble/mcp-server/) and any agent (Claude Code, Cursor, Codex, OpenCode, etc.) gets instant access to any repo, cloned and indexed on demand.
11+
12+
## Quick Start
13+
14+
Install Semble:
15+
16+
```bash
17+
pip install semble # Install with pip
18+
uv add semble # Install with uv
19+
```
20+
21+
Index a repo and search it:
22+
23+
```python
24+
from semble import SembleIndex
25+
26+
# Index a local directory
27+
index = SembleIndex.from_path("./my-project")
28+
29+
# Index a remote git repository
30+
index = SembleIndex.from_git("https://github.com/MinishLab/model2vec")
31+
32+
# Search with a natural-language or code query
33+
results = index.search("save model to disk", top_k=3)
34+
35+
# Find code similar to a specific result
36+
related = index.find_related(results[0], top_k=3)
37+
38+
# Each result exposes the matched chunk
39+
result = results[0]
40+
result.chunk.file_path # "model2vec/model.py"
41+
result.chunk.start_line # 127
42+
result.chunk.end_line # 150
43+
result.chunk.content # "def save_pretrained(self, path: PathLike, ..."
44+
```
45+
46+
## Main Features
47+
48+
- **Fast**: indexes a repo in ~250 ms and answers queries in ~1.5 ms, all on CPU.
49+
- **Accurate**: NDCG@10 of 0.854 on the [benchmarks](/packages/semble/benchmarks/), on par with code-specialized transformer models at a fraction of the size and cost.
50+
- **Local and remote**: pass a local path or a git URL; indexes are cached for the session.
51+
- **MCP server**: drop-in tool for Claude Code, Cursor, Codex, OpenCode, and any other MCP-compatible agent.
52+
- **Zero setup**: runs on CPU with no API keys, GPU, or external services required.
53+
54+
## How It Works
55+
56+
Semble splits each file into code-aware chunks using [Chonkie](https://github.com/chonkie-inc/chonkie), then scores every query with two complementary retrievers:
57+
58+
- **Semantic**: static [Model2Vec](https://github.com/MinishLab/model2vec) embeddings from the code-specialized [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) model.
59+
- **Lexical**: [BM25](https://github.com/xhluca/bm25s) for exact matches on identifiers and API names.
60+
61+
The two score lists are fused with Reciprocal Rank Fusion (RRF) and then reranked with a set of code-aware signals:
62+
63+
- **Adaptive weighting** — symbol-like queries (`Foo::bar`, `getUserById`) get more lexical weight; natural-language queries stay balanced.
64+
- **Definition boosts** — a chunk that defines the queried symbol (`class`, `def`, `func`) ranks above chunks that merely reference it.
65+
- **Identifier stems** — query tokens are stemmed and matched against identifier stems, so `parse config` boosts chunks containing `parseConfig`, `ConfigParser`, or `config_parser`.
66+
- **File coherence** — when multiple chunks from the same file match, the file is boosted so the top result reflects broad file-level relevance.
67+
- **Noise penalties** — test files, `compat`/`legacy` shims, example code, and `.d.ts` stubs are down-ranked so canonical implementations surface first.
68+
69+
Because the embedding model is static with no transformer forward pass at query time, all of this runs in milliseconds on CPU.

0 commit comments

Comments
 (0)