ClickHouse · alexey-milovidov · May 7, 2026 · May 7, 2026
diff --git a/README.md b/README.md
@@ -311,7 +311,7 @@ Please help us add more systems and run the benchmarks on more types of VMs:
 - [ ] MS SQL Server with Column Store Index (without publishing)
 - [ ] OceanBase
 - [ ] Planetscale (without publishing)
-- [ ] Quickwit
+- [x] Quickwit
 - [ ] Redshift Spectrum
 - [ ] Seafowl
 - [ ] ShitholeDB

diff --git a/quickwit/README.md b/quickwit/README.md
@@ -0,0 +1,58 @@
+# Quickwit
+
+[Quickwit](https://quickwit.io) is a Rust-based search engine for log analytics, built on top of [Tantivy](https://github.com/quickwit-oss/tantivy). It exposes an Elasticsearch-compatible REST API for ingestion and search, but does not implement an SQL endpoint, so this benchmark uses the native Elasticsearch query DSL directly.
+
+## Methodology
+
+Infrastructure:
+- Single-node Quickwit 0.8.2 on AWS EC2 c6a.4xlarge
+
+Index configuration (`index_config.yaml`):
+- All scalar fields declared with `fast: true` so they can participate in aggregations and sorts (Quickwit aggregations require fast fields).
+- Keyword-like text fields use the `raw` tokenizer with the `raw` fast-field normalizer to mimic Elasticsearch's `keyword` mapping.
+- `EventTime` is set as the index's timestamp field, providing time-based pruning.
+
+Ingestion (`load.py`):
+- Reads `hits.json.gz` and streams NDJSON to the Elasticsearch-compatible bulk endpoint at `/api/v1/_elastic/hits/_bulk`.
+- Quickwit's bulk endpoint only honors the `create` action, and rejects payloads >10MB, so batches are smaller than the Elasticsearch loader.
+
+Queries (`queries.json`):
+- Each query in `queries.sql` is hand-translated to the Elasticsearch DSL on the corresponding line of `queries.json`, and submitted to `/api/v1/_elastic/hits/_search`.
+- Timing is taken from the `took` field returned by Quickwit (milliseconds, engine-internal).
+- Queries that are not expressible in Quickwit's DSL are recorded as `null`.
+
+## Unsupported queries
+
+Quickwit's aggregation and query model is narrower than Elasticsearch's. The following ClickBench queries cannot currently be expressed and are reported as `null`:
+
+| Q  | Reason                                                                |
+|----|-----------------------------------------------------------------------|
+| 5  | `COUNT(DISTINCT)` — Quickwit has no `cardinality` aggregation         |
+| 6  | `COUNT(DISTINCT)`                                                     |
+| 9  | `COUNT(DISTINCT)`                                                     |
+| 10 | `COUNT(DISTINCT)`                                                     |
+| 11 | `COUNT(DISTINCT)`                                                     |
+| 12 | `COUNT(DISTINCT)`                                                     |
+| 14 | `COUNT(DISTINCT)`                                                     |
+| 19 | `extract(minute FROM …)` — no scripted/runtime fields                 |
+| 21 | `LIKE '%…%'` — leading wildcards rejected, no `wildcard`/`regexp`     |
+| 22 | `LIKE '%…%'`                                                          |
+| 23 | `COUNT(DISTINCT)`                                                     |
+| 24 | `LIKE '%…%'`                                                          |
+| 26 | `ORDER BY` on text field — not supported by the search backend        |
+| 27 | `ORDER BY` on text field                                              |
+| 28 | `AVG(length(URL))` — no scripted/runtime fields                       |
+| 29 | `REGEXP_REPLACE` — not supported                                      |
+| 30 | `SUM(col + N)` — no scripted aggregations                             |
+| 36 | `ClientIP - N` — no scripted aggregations                             |
+| 40 | `CASE WHEN …` — no scripted/runtime fields                            |
+
+All other queries run through the native Elasticsearch DSL.
+
+## Running
+
+```bash
+bash benchmark.sh
+```
+
+This installs Quickwit, creates the index, downloads `hits.json.gz`, ingests the data via the ES bulk API, and then runs `run.sh` to time each query three times with caches dropped between runs.
diff --git a/quickwit/benchmark.sh b/quickwit/benchmark.sh
@@ -0,0 +1,65 @@
+#!/bin/bash
+set -e
+
+# Install prerequisites
+sudo apt-get update -y
+sudo apt-get install -y wget curl jq bc python3 python3-requests
+
+# Download Quickwit
+QW_VERSION="0.8.2"
+ARCH=$(uname -m)
+QW_DIR="quickwit-v${QW_VERSION}"
+wget --continue --progress=dot:giga \
+    "https://github.com/quickwit-oss/quickwit/releases/download/v${QW_VERSION}/${QW_DIR}-${ARCH}-unknown-linux-gnu.tar.gz"
+tar xzf "${QW_DIR}-${ARCH}-unknown-linux-gnu.tar.gz"
+
+# Start the server in the background. Quickwit defaults: REST on 7280, gRPC on 7281.
+pushd "$QW_DIR" >/dev/null
+nohup ./quickwit run > ../quickwit.log 2>&1 &
+QW_PID=$!
+popd >/dev/null
+echo "Quickwit started (PID $QW_PID)"
+
+# Wait for the server to come up.
+for i in $(seq 1 60); do
+    if curl -sS -f http://localhost:7280/api/v1/version >/dev/null 2>&1; then
+        echo "Quickwit is ready"
+        break
+    fi
+    sleep 1
+done
+
+# Create the index from the YAML config.
+curl -sS -X POST http://localhost:7280/api/v1/indexes \
+    -H 'Content-Type: application/yaml' \
+    --data-binary @index_config.yaml
+
+# Download the data
+wget --continue --progress=dot:giga 'https://datasets.clickhouse.com/hits_compatible/hits.json.gz'
+
+START=$(date +%s)
+
+# Stream JSON directly into Quickwit via the Elasticsearch-compatible bulk API.
+python3 load.py
+
+# Force any in-flight commits and wait for the data to become searchable.
+# The default commit timeout in index_config.yaml is 30s, so wait a bit longer.
+sleep 60
+
+# Show stats.
+curl -sS "http://localhost:7280/api/v1/indexes/hits/describe" | tee stats.json
+echo
+
+END=$(date +%s)
+echo "Load time: $((END - START))"
+
+# Data size on disk (single-node uses qwdata/ inside the install dir).
+echo -n "Data size: "
+du -sb "$QW_DIR/qwdata" 2>/dev/null | awk '{print $1}'
+
+# Run queries
+chmod +x run.sh
+./run.sh
+
+# Stop Quickwit
+kill "$QW_PID" 2>/dev/null || true
diff --git a/quickwit/index_config.yaml b/quickwit/index_config.yaml
@@ -0,0 +1,149 @@
+version: 0.8
+
+index_id: hits
+
+doc_mapping:
+  mode: strict
+  timestamp_field: EventTime
+  field_mappings:
+    - {name: WatchID, type: i64, indexed: true, fast: true}
+    - {name: JavaEnable, type: i64, indexed: true, fast: true}
+    - {name: Title, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: GoodEvent, type: i64, indexed: true, fast: true}
+    - name: EventTime
+      type: datetime
+      input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
+      output_format: unix_timestamp_secs
+      indexed: true
+      fast: true
+      fast_precision: seconds
+    - name: EventDate
+      type: datetime
+      input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
+      output_format: unix_timestamp_secs
+      indexed: true
+      fast: true
+      fast_precision: seconds
+    - {name: CounterID, type: i64, indexed: true, fast: true}
+    - {name: ClientIP, type: i64, indexed: true, fast: true}
+    - {name: RegionID, type: i64, indexed: true, fast: true}
+    - {name: UserID, type: i64, indexed: true, fast: true}
+    - {name: CounterClass, type: i64, indexed: true, fast: true}
+    - {name: OS, type: i64, indexed: true, fast: true}
+    - {name: UserAgent, type: i64, indexed: true, fast: true}
+    - {name: URL, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: Referer, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: IsRefresh, type: i64, indexed: true, fast: true}
+    - {name: RefererCategoryID, type: i64, indexed: true, fast: true}
+    - {name: RefererRegionID, type: i64, indexed: true, fast: true}
+    - {name: URLCategoryID, type: i64, indexed: true, fast: true}
+    - {name: URLRegionID, type: i64, indexed: true, fast: true}
+    - {name: ResolutionWidth, type: i64, indexed: true, fast: true}
+    - {name: ResolutionHeight, type: i64, indexed: true, fast: true}
+    - {name: ResolutionDepth, type: i64, indexed: true, fast: true}
+    - {name: FlashMajor, type: i64, indexed: true, fast: true}
+    - {name: FlashMinor, type: i64, indexed: true, fast: true}
+    - {name: FlashMinor2, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: NetMajor, type: i64, indexed: true, fast: true}
+    - {name: NetMinor, type: i64, indexed: true, fast: true}
+    - {name: UserAgentMajor, type: i64, indexed: true, fast: true}
+    - {name: UserAgentMinor, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: CookieEnable, type: i64, indexed: true, fast: true}
+    - {name: JavascriptEnable, type: i64, indexed: true, fast: true}
+    - {name: IsMobile, type: i64, indexed: true, fast: true}
+    - {name: MobilePhone, type: i64, indexed: true, fast: true}
+    - {name: MobilePhoneModel, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: Params, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: IPNetworkID, type: i64, indexed: true, fast: true}
+    - {name: TraficSourceID, type: i64, indexed: true, fast: true}
+    - {name: SearchEngineID, type: i64, indexed: true, fast: true}
+    - {name: SearchPhrase, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: AdvEngineID, type: i64, indexed: true, fast: true}
+    - {name: IsArtifical, type: i64, indexed: true, fast: true}
+    - {name: WindowClientWidth, type: i64, indexed: true, fast: true}
+    - {name: WindowClientHeight, type: i64, indexed: true, fast: true}
+    - {name: ClientTimeZone, type: i64, indexed: true, fast: true}
+    - name: ClientEventTime
+      type: datetime
+      input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
+      output_format: unix_timestamp_secs
+      indexed: true
+      fast: true
+      fast_precision: seconds
+    - {name: SilverlightVersion1, type: i64, indexed: true, fast: true}
+    - {name: SilverlightVersion2, type: i64, indexed: true, fast: true}
+    - {name: SilverlightVersion3, type: i64, indexed: true, fast: true}
+    - {name: SilverlightVersion4, type: i64, indexed: true, fast: true}
+    - {name: PageCharset, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: CodeVersion, type: i64, indexed: true, fast: true}
+    - {name: IsLink, type: i64, indexed: true, fast: true}
+    - {name: IsDownload, type: i64, indexed: true, fast: true}
+    - {name: IsNotBounce, type: i64, indexed: true, fast: true}
+    - {name: FUniqID, type: i64, indexed: true, fast: true}
+    - {name: OriginalURL, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: HID, type: i64, indexed: true, fast: true}
+    - {name: IsOldCounter, type: i64, indexed: true, fast: true}
+    - {name: IsEvent, type: i64, indexed: true, fast: true}
+    - {name: IsParameter, type: i64, indexed: true, fast: true}
+    - {name: DontCountHits, type: i64, indexed: true, fast: true}
+    - {name: WithHash, type: i64, indexed: true, fast: true}
+    - {name: HitColor, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - name: LocalEventTime
+      type: datetime
+      input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
+      output_format: unix_timestamp_secs
+      indexed: true
+      fast: true
+      fast_precision: seconds
+    - {name: Age, type: i64, indexed: true, fast: true}
+    - {name: Sex, type: i64, indexed: true, fast: true}
+    - {name: Income, type: i64, indexed: true, fast: true}
+    - {name: Interests, type: i64, indexed: true, fast: true}
+    - {name: Robotness, type: i64, indexed: true, fast: true}
+    - {name: RemoteIP, type: i64, indexed: true, fast: true}
+    - {name: WindowName, type: i64, indexed: true, fast: true}
+    - {name: OpenerName, type: i64, indexed: true, fast: true}
+    - {name: HistoryLength, type: i64, indexed: true, fast: true}
+    - {name: BrowserLanguage, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: BrowserCountry, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: SocialNetwork, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: SocialAction, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: HTTPError, type: i64, indexed: true, fast: true}
+    - {name: SendTiming, type: i64, indexed: true, fast: true}
+    - {name: DNSTiming, type: i64, indexed: true, fast: true}
+    - {name: ConnectTiming, type: i64, indexed: true, fast: true}
+    - {name: ResponseStartTiming, type: i64, indexed: true, fast: true}
+    - {name: ResponseEndTiming, type: i64, indexed: true, fast: true}
+    - {name: FetchTiming, type: i64, indexed: true, fast: true}
+    - {name: SocialSourceNetworkID, type: i64, indexed: true, fast: true}
+    - {name: SocialSourcePage, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: ParamPrice, type: i64, indexed: true, fast: true}
+    - {name: ParamOrderID, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: ParamCurrency, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: ParamCurrencyID, type: i64, indexed: true, fast: true}
+    - {name: OpenstatServiceName, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: OpenstatCampaignID, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: OpenstatAdID, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: OpenstatSourceID, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: UTMSource, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: UTMMedium, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: UTMCampaign, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: UTMContent, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: UTMTerm, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: FromTag, type: text, tokenizer: raw, fast: {normalizer: raw}}
+    - {name: HasGCLID, type: i64, indexed: true, fast: true}
+    - {name: RefererHash, type: i64, indexed: true, fast: true}
+    - {name: URLHash, type: i64, indexed: true, fast: true}
+    - {name: CLID, type: i64, indexed: true, fast: true}
+
+  store_source: false
+
+indexing_settings:
+  commit_timeout_secs: 30
+  merge_policy:
+    type: stable_log
+    merge_factor: 10
+    max_merge_factor: 12
+
+search_settings:
+  default_search_fields: []
diff --git a/quickwit/load.py b/quickwit/load.py
@@ -0,0 +1,68 @@
+import gzip
+import json
+from itertools import islice
+
+import requests
+
+# Quickwit's _bulk endpoint accepts at most 10MB per request; keep batches
+# small enough to stay under the limit comfortably.
+BULK_SIZE = 2000
+QW_URL = "http://localhost:7280/api/v1/_elastic/hits/_bulk"
+TOTAL_RECORDS = 99997497
+
+# Quickwit only supports the "create" action of the Elasticsearch bulk API.
+ACTION_META_BYTES = (json.dumps({"create": {"_index": "hits"}}) + "\n").encode("utf-8")
+REQUEST_TIMEOUT = 120
+
+
+def build_body(docs):
+    parts = []
+    for doc in docs:
+        parts.append(ACTION_META_BYTES)
+        parts.append(doc.encode("utf-8") if isinstance(doc, str) else doc)
+    return b"".join(parts)
+
+
+def send_bulk(session, docs, batch_num):
+    # Quickwit's bulk endpoint requires a Content-Length header, so we have to
+    # buffer the body rather than streaming it.
+    resp = session.post(QW_URL, data=build_body(docs), timeout=REQUEST_TIMEOUT)
+    if resp.status_code >= 300:
+        print(
+            f"\nSent batch {batch_num} ({len(docs)} docs) - Warning: HTTP {resp.status_code}: {resp.text[:300]}"
+        )
+        return 0
+
+    body = resp.json()
+    if body.get("errors"):
+        items = body.get("items", [])
+        err = sum(1 for i in items if "error" in i.get("create", {}))
+        if err:
+            print(f"\nBatch {batch_num}: {err} item errors")
+
+    return len(docs)
+
+
+def main():
+    total_docs = 0
+    batch_num = 0
+
+    with requests.Session() as session:
+        session.headers.update({"Content-Type": "application/x-ndjson"})
+
+        with gzip.open("hits.json.gz", mode="rt", encoding="utf-8") as f:
+            print("Reading from hits.json.gz")
+            while True:
+                docs = list(islice(f, BULK_SIZE))
+                if not docs:
+                    break
+                batch_num += 1
+                total_docs += send_bulk(session, docs, batch_num)
+                pct = (total_docs / TOTAL_RECORDS) * 100 if TOTAL_RECORDS else 0
+                print(f" {pct:.2f}% ({total_docs}/{TOTAL_RECORDS})")
+
+    print(f"\nTotal docs sent: {total_docs}")
+
+
+if __name__ == "__main__":
+    main()