MinishLab
diff --git a/‎astro.config.mjs‎
Lines changed: 10 additions & 0 deletions b/‎astro.config.mjs‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎package-lock.json‎
Lines changed: 37 additions & 36 deletions b/‎package-lock.json‎
Lines changed: 37 additions & 36 deletions
diff --git a/‎public/images/logos/semble_logo.png‎
760 KB b/‎public/images/logos/semble_logo.png‎
760 KB
diff --git a/‎public/images/logos/semble_logo.webp‎
120 KB b/‎public/images/logos/semble_logo.webp‎
120 KB
diff --git a/‎src/components/home/HomepagePackageList.astro‎
Lines changed: 1 addition & 1 deletion b/‎src/components/home/HomepagePackageList.astro‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/content/docs/packages/overview/index.mdx‎
Lines changed: 22 additions & 0 deletions b/‎src/content/docs/packages/overview/index.mdx‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎src/content/docs/packages/semble/benchmarks.mdx‎
Lines changed: 95 additions & 0 deletions b/‎src/content/docs/packages/semble/benchmarks.mdx‎
Lines changed: 95 additions & 0 deletions
diff --git a/‎src/content/docs/packages/semble/installation.mdx‎
Lines changed: 37 additions & 0 deletions b/‎src/content/docs/packages/semble/installation.mdx‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎src/content/docs/packages/semble/introduction.mdx‎
Lines changed: 69 additions & 0 deletions b/‎src/content/docs/packages/semble/introduction.mdx‎
Lines changed: 69 additions & 0 deletions
@@ -94,6 +94,16 @@ gtag('config', 'G-LQWDNXKF2X');`,
                 { label: 'Supported Backends', link: '/packages/vicinity/supported-backends/' },
               ],
             },
+            {
+              label: 'Semble',
+              items: [
+                { label: 'Introduction',  link: '/packages/semble/introduction/' },
+                { label: 'Installation',  link: '/packages/semble/installation/' },
+                { label: 'Usage',         link: '/packages/semble/usage/' },
+                { label: 'MCP Server',    link: '/packages/semble/mcp-server/' },
+                { label: 'Benchmarks',    link: '/packages/semble/benchmarks/' },
+              ],
+            },
             {
               label: 'Tokenlearn',
               items: [
 
@@ -14,7 +14,7 @@ const packages = await getHomepagePackageMetrics();
       <div class="pkg-row-main">
         <a href={pkg.docsHref} class="pkg-icon-link" aria-label={`${pkg.name} docs`}>
           <span class="pkg-icon-wrap">
-            <img class="pkg-icon" src={pkg.logoSrc} alt={pkg.logoAlt} loading="lazy" />
+            <img class={`pkg-icon pkg-icon--${pkg.name.toLowerCase().replace(/[^a-z0-9]+/g, '-')}`} src={pkg.logoSrc} alt={pkg.logoAlt} loading="lazy" />
           </span>
         </a>
         <a href={pkg.docsHref} class="pkg-row-info">
 
@@ -71,6 +71,28 @@ tableOfContents: false
     </div>
   </article>
 
+  <article class="overview-item">
+    <div class="overview-item-top">
+      <img class="overview-icon" src="/images/logos/semble_logo.webp" alt="Semble" loading="lazy" />
+      <div class="overview-copy">
+        <h2><a href="/packages/semble/introduction/">Semble</a></h2>
+        <p>Fast and accurate code search for agents.</p>
+      </div>
+    </div>
+    <div class="overview-item-bottom">
+      <div class="overview-tags">
+        <span class="overview-tag">Code Search</span>
+        <span class="overview-tag">MCP Server</span>
+        <span class="overview-tag">Agents</span>
+        <span class="overview-tag">Python</span>
+      </div>
+      <div class="overview-actions">
+        <a class="overview-link overview-link-primary" href="/packages/semble/introduction/">Docs</a>
+        <a class="overview-link" href="https://github.com/minishlab/semble">Repo</a>
+      </div>
+    </div>
+  </article>
+
   <article class="overview-item">
     <div class="overview-item-top">
       <img class="overview-icon" src="/images/logos/tokenlearn_logo.webp" alt="Tokenlearn" loading="lazy" />
 
@@ -0,0 +1,95 @@
+---
+title: Benchmarks
+description: Quality and speed benchmarks for Semble
+sidebar:
+  icon: chart-bar
+---
+
+We benchmark quality and speed across all methods on ~1,250 queries over 63 repositories in 19 languages.
+
+## Main Results
+
+| Method | NDCG@10 | Index time | Query p50 |
+|--------|--------:|-----------:|----------:|
+| CodeRankEmbed Hybrid | 0.862 | 57 s | 16 ms |
+| **semble** | **0.854** | **263 ms** | **1.5 ms** |
+| CodeRankEmbed | 0.765 | 57 s | 16 ms |
+| ColGREP | 0.693 | 5.8 s | 124 ms |
+| BM25 | 0.673 | 263 ms | 0.02 ms |
+| ripgrep | 0.126 | — | 12 ms |
+
+Semble achieves 99% of the retrieval quality of the 137M-parameter CodeRankEmbed Hybrid, while indexing **218× faster** and answering queries **11× faster** — entirely on CPU.
+
+The charts below plot latency against NDCG@10. Marker size reflects model parameter count.
+
+![Speed vs quality (cold start)](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/speed_vs_ndcg_cold.png)
+*Time to first result (index + query) vs NDCG@10*
+
+![Speed vs quality (warm)](https://raw.githubusercontent.com/MinishLab/semble/main/assets/images/speed_vs_ndcg_warm.png)
+*Query latency on a warm index vs NDCG@10*
+
+## By Language
+
+NDCG@10 per language. Best score per row is bolded.
+
+| Language | semble | CRE Hybrid | CRE | ColGREP | ripgrep |
+|----------|-------:|-----------:|----:|--------:|--------:|
+| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.180 |
+| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.126 |
+| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.230 |
+| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.134 |
+| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.176 |
+| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.000 |
+| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.117 |
+| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.133 |
+| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.202 |
+| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.123 |
+| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.160 |
+| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.000 |
+| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.000 |
+| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.198 |
+| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.166 |
+| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.162 |
+| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.000 |
+| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.000 |
+| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.128 |
+| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.126** |
+
+## Ablations
+
+`raw` returns retrieval scores directly; `+ ranking` feeds them through semble's hybrid reranker.
+
+| Retrieval | Raw | + ranking |
+|-----------|----:|----------:|
+| BM25 | 0.675 | 0.834 |
+| potion-code-16M | 0.650 | 0.821 |
+| BM25 + potion-code-16M | — | **0.854** |
+
+By query category:
+
+| Mode | Architecture | Semantic | Symbol |
+|------|-------------:|---------:|-------:|
+| BM25 raw | 0.628 | 0.676 | 0.719 |
+| potion-code-16M raw | 0.626 | 0.666 | 0.629 |
+| semble BM25 (+ ranking) | 0.770 | 0.819 | 0.957 |
+| semble potion-code-16M (+ ranking) | 0.757 | 0.808 | 0.943 |
+| **semble hybrid** | **0.802** | **0.846** | **0.958** |
+
+## Dataset
+
+~1,250 queries over 63 repositories in 19 languages, grouped into three categories:
+
+| Category | Queries | What it tests |
+|----------|--------:|---------------|
+| semantic | 711 | Code that implements a specific behavior or concept |
+| architecture | 343 | Design decisions, module boundaries, structural patterns |
+| symbol | 204 | Named entity lookup (function, class, type, variable) |
+
+Languages covered: bash, C, C++, C#, Elixir, Go, Haskell, Java, JavaScript, Kotlin, Lua, PHP, Python, Ruby, Rust, Scala, Swift, TypeScript, Zig.
+
+## Methods
+
+- **[ripgrep](https://github.com/BurntSushi/ripgrep)** — fast regex search, included as a raw keyword-match baseline.
+- **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)** — late-interaction code retrieval with the LateOn-Code-edge model.
+- **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** — 137M-param transformer embedding model. *CRE Hybrid* fuses its dense scores with BM25.
+- **semble** — [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack.
@@ -0,0 +1,37 @@
+---
+title: Installation
+description: How to install Semble
+sidebar:
+  icon: seti:config
+---
+
+## Requirements
+
+- Python 3.9 or higher
+- No GPU, API keys, or external services required — runs fully on CPU
+
+## Install
+
+```bash
+pip install semble
+```
+
+Or with [uv](https://docs.astral.sh/uv/):
+
+```bash
+uv add semble
+```
+
+## MCP Server Extra
+
+To use Semble as an [MCP server](/packages/semble/mcp-server/) with agents like Claude Code, Cursor, or OpenCode, install the `mcp` extra:
+
+```bash
+pip install "semble[mcp]"
+```
+
+Or, use [uvx](https://docs.astral.sh/uv/guides/tools/) to run it without a permanent install:
+
+```bash
+uvx --from "semble[mcp]" semble
+```
@@ -0,0 +1,69 @@
+---
+title: Semble
+description: Fast and Accurate Code Search for Agents
+sidebar:
+  icon: open-book
+---
+
+[Semble](https://github.com/MinishLab/semble) is a code search library built for agents. It returns the exact code snippets they need instantly, cutting both token usage and waiting time on every step. Indexing and searching a full codebase end-to-end takes under a second, with ~200x faster indexing and ~10x faster queries than a code-specialized transformer, at 99% of its retrieval quality (see [benchmarks](/packages/semble/benchmarks/)). Everything runs on CPU with no API keys, GPU, or external services.
+
+Run it as an [MCP server](/packages/semble/mcp-server/) and any agent (Claude Code, Cursor, Codex, OpenCode, etc.) gets instant access to any repo, cloned and indexed on demand.
+
+## Quick Start
+
+Install Semble:
+
+```bash
+pip install semble  # Install with pip
+uv add semble       # Install with uv
+```
+
+Index a repo and search it:
+
+```python
+from semble import SembleIndex
+
+# Index a local directory
+index = SembleIndex.from_path("./my-project")
+
+# Index a remote git repository
+index = SembleIndex.from_git("https://github.com/MinishLab/model2vec")
+
+# Search with a natural-language or code query
+results = index.search("save model to disk", top_k=3)
+
+# Find code similar to a specific result
+related = index.find_related(results[0], top_k=3)
+
+# Each result exposes the matched chunk
+result = results[0]
+result.chunk.file_path   # "model2vec/model.py"
+result.chunk.start_line  # 127
+result.chunk.end_line    # 150
+result.chunk.content     # "def save_pretrained(self, path: PathLike, ..."
+```
+
+## Main Features
+
+- **Fast**: indexes a repo in ~250 ms and answers queries in ~1.5 ms, all on CPU.
+- **Accurate**: NDCG@10 of 0.854 on the [benchmarks](/packages/semble/benchmarks/), on par with code-specialized transformer models at a fraction of the size and cost.
+- **Local and remote**: pass a local path or a git URL; indexes are cached for the session.
+- **MCP server**: drop-in tool for Claude Code, Cursor, Codex, OpenCode, and any other MCP-compatible agent.
+- **Zero setup**: runs on CPU with no API keys, GPU, or external services required.
+
+## How It Works
+
+Semble splits each file into code-aware chunks using [Chonkie](https://github.com/chonkie-inc/chonkie), then scores every query with two complementary retrievers:
+
+- **Semantic**: static [Model2Vec](https://github.com/MinishLab/model2vec) embeddings from the code-specialized [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) model.
+- **Lexical**: [BM25](https://github.com/xhluca/bm25s) for exact matches on identifiers and API names.
+
+The two score lists are fused with Reciprocal Rank Fusion (RRF) and then reranked with a set of code-aware signals:
+
+- **Adaptive weighting** — symbol-like queries (`Foo::bar`, `getUserById`) get more lexical weight; natural-language queries stay balanced.
+- **Definition boosts** — a chunk that defines the queried symbol (`class`, `def`, `func`) ranks above chunks that merely reference it.
+- **Identifier stems** — query tokens are stemmed and matched against identifier stems, so `parse config` boosts chunks containing `parseConfig`, `ConfigParser`, or `config_parser`.
+- **File coherence** — when multiple chunks from the same file match, the file is boosted so the top result reflects broad file-level relevance.
+- **Noise penalties** — test files, `compat`/`legacy` shims, example code, and `.d.ts` stubs are down-ranked so canonical implementations surface first.
+
+Because the embedding model is static with no transformer forward pass at query time, all of this runs in milliseconds on CPU.