|
| 1 | +--- |
| 2 | +title: Benchmarks |
| 3 | +description: Quality and speed benchmarks for Semble |
| 4 | +sidebar: |
| 5 | + icon: chart-bar |
| 6 | +--- |
| 7 | + |
| 8 | +We benchmark quality and speed across all methods on ~1,250 queries over 63 repositories in 19 languages. |
| 9 | + |
| 10 | +## Main Results |
| 11 | + |
| 12 | +| Method | NDCG@10 | Index time | Query p50 | |
| 13 | +|--------|--------:|-----------:|----------:| |
| 14 | +| CodeRankEmbed Hybrid | 0.862 | 57 s | 16 ms | |
| 15 | +| **semble** | **0.854** | **263 ms** | **1.5 ms** | |
| 16 | +| CodeRankEmbed | 0.765 | 57 s | 16 ms | |
| 17 | +| ColGREP | 0.693 | 5.8 s | 124 ms | |
| 18 | +| BM25 | 0.673 | 263 ms | 0.02 ms | |
| 19 | +| ripgrep | 0.126 | — | 12 ms | |
| 20 | + |
| 21 | +Semble achieves 99% of the retrieval quality of the 137M-parameter CodeRankEmbed Hybrid, while indexing **218× faster** and answering queries **11× faster** — entirely on CPU. |
| 22 | + |
| 23 | +The charts below plot latency against NDCG@10. Marker size reflects model parameter count. |
| 24 | + |
| 25 | + |
| 26 | +*Time to first result (index + query) vs NDCG@10* |
| 27 | + |
| 28 | + |
| 29 | +*Query latency on a warm index vs NDCG@10* |
| 30 | + |
| 31 | +## By Language |
| 32 | + |
| 33 | +NDCG@10 per language. Best score per row is bolded. |
| 34 | + |
| 35 | +| Language | semble | CRE Hybrid | CRE | ColGREP | ripgrep | |
| 36 | +|----------|-------:|-----------:|----:|--------:|--------:| |
| 37 | +| scala | 0.909 | **0.922** | 0.845 | 0.765 | 0.180 | |
| 38 | +| cpp | **0.915** | 0.913 | 0.846 | 0.626 | 0.126 | |
| 39 | +| ruby | **0.909** | **0.909** | 0.769 | 0.708 | 0.230 | |
| 40 | +| elixir | 0.894 | **0.905** | 0.869 | 0.808 | 0.134 | |
| 41 | +| javascript | 0.917 | 0.903 | **0.920** | 0.823 | 0.176 | |
| 42 | +| zig | **0.913** | 0.901 | 0.807 | 0.474 | 0.000 | |
| 43 | +| csharp | 0.885 | **0.889** | 0.743 | 0.614 | 0.117 | |
| 44 | +| go | **0.895** | 0.884 | 0.676 | 0.785 | 0.133 | |
| 45 | +| python | 0.867 | **0.880** | 0.794 | 0.777 | 0.202 | |
| 46 | +| php | 0.858 | **0.874** | 0.758 | 0.663 | 0.123 | |
| 47 | +| swift | 0.860 | **0.873** | 0.721 | 0.710 | 0.160 | |
| 48 | +| bash | 0.825 | 0.852 | **0.892** | 0.706 | 0.000 | |
| 49 | +| lua | 0.823 | **0.847** | 0.803 | 0.798 | 0.000 | |
| 50 | +| java | **0.849** | 0.841 | 0.706 | 0.641 | 0.198 | |
| 51 | +| kotlin | 0.821 | **0.830** | 0.670 | 0.637 | 0.166 | |
| 52 | +| rust | **0.856** | 0.827 | 0.627 | 0.662 | 0.162 | |
| 53 | +| c | 0.741 | **0.806** | 0.706 | 0.676 | 0.000 | |
| 54 | +| haskell | 0.765 | 0.771 | **0.776** | 0.683 | 0.000 | |
| 55 | +| typescript | 0.706 | **0.708** | 0.545 | 0.430 | 0.128 | |
| 56 | +| **overall** | **0.854** | **0.862** | **0.765** | **0.693** | **0.126** | |
| 57 | + |
| 58 | +## Ablations |
| 59 | + |
| 60 | +`raw` returns retrieval scores directly; `+ ranking` feeds them through semble's hybrid reranker. |
| 61 | + |
| 62 | +| Retrieval | Raw | + ranking | |
| 63 | +|-----------|----:|----------:| |
| 64 | +| BM25 | 0.675 | 0.834 | |
| 65 | +| potion-code-16M | 0.650 | 0.821 | |
| 66 | +| BM25 + potion-code-16M | — | **0.854** | |
| 67 | + |
| 68 | +By query category: |
| 69 | + |
| 70 | +| Mode | Architecture | Semantic | Symbol | |
| 71 | +|------|-------------:|---------:|-------:| |
| 72 | +| BM25 raw | 0.628 | 0.676 | 0.719 | |
| 73 | +| potion-code-16M raw | 0.626 | 0.666 | 0.629 | |
| 74 | +| semble BM25 (+ ranking) | 0.770 | 0.819 | 0.957 | |
| 75 | +| semble potion-code-16M (+ ranking) | 0.757 | 0.808 | 0.943 | |
| 76 | +| **semble hybrid** | **0.802** | **0.846** | **0.958** | |
| 77 | + |
| 78 | +## Dataset |
| 79 | + |
| 80 | +~1,250 queries over 63 repositories in 19 languages, grouped into three categories: |
| 81 | + |
| 82 | +| Category | Queries | What it tests | |
| 83 | +|----------|--------:|---------------| |
| 84 | +| semantic | 711 | Code that implements a specific behavior or concept | |
| 85 | +| architecture | 343 | Design decisions, module boundaries, structural patterns | |
| 86 | +| symbol | 204 | Named entity lookup (function, class, type, variable) | |
| 87 | + |
| 88 | +Languages covered: bash, C, C++, C#, Elixir, Go, Haskell, Java, JavaScript, Kotlin, Lua, PHP, Python, Ruby, Rust, Scala, Swift, TypeScript, Zig. |
| 89 | + |
| 90 | +## Methods |
| 91 | + |
| 92 | +- **[ripgrep](https://github.com/BurntSushi/ripgrep)** — fast regex search, included as a raw keyword-match baseline. |
| 93 | +- **[ColGREP](https://github.com/lightonai/next-plaid/tree/main/colgrep)** — late-interaction code retrieval with the LateOn-Code-edge model. |
| 94 | +- **[CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)** — 137M-param transformer embedding model. *CRE Hybrid* fuses its dense scores with BM25. |
| 95 | +- **semble** — [potion-code-16M](https://huggingface.co/minishlab/potion-code-16M) static embeddings + BM25 + the semble reranking stack. |
0 commit comments