Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,7 @@ Please help us add more systems and run the benchmarks on more types of VMs:
- [ ] MS SQL Server with Column Store Index (without publishing)
- [ ] OceanBase
- [ ] Planetscale (without publishing)
- [ ] Quickwit
- [x] Quickwit
- [ ] Redshift Spectrum
- [ ] Seafowl
- [ ] ShitholeDB
Expand Down
58 changes: 58 additions & 0 deletions quickwit/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Quickwit

[Quickwit](https://quickwit.io) is a Rust-based search engine for log analytics, built on top of [Tantivy](https://github.com/quickwit-oss/tantivy). It exposes an Elasticsearch-compatible REST API for ingestion and search, but does not implement an SQL endpoint, so this benchmark uses the native Elasticsearch query DSL directly.

## Methodology

Infrastructure:
- Single-node Quickwit 0.8.2 on AWS EC2 c6a.4xlarge

Index configuration (`index_config.yaml`):
- All scalar fields declared with `fast: true` so they can participate in aggregations and sorts (Quickwit aggregations require fast fields).
- Keyword-like text fields use the `raw` tokenizer with the `raw` fast-field normalizer to mimic Elasticsearch's `keyword` mapping.
- `EventTime` is set as the index's timestamp field, providing time-based pruning.

Ingestion (`load.py`):
- Reads `hits.json.gz` and streams NDJSON to the Elasticsearch-compatible bulk endpoint at `/api/v1/_elastic/hits/_bulk`.
- Quickwit's bulk endpoint only honors the `create` action, and rejects payloads >10MB, so batches are smaller than the Elasticsearch loader.

Queries (`queries.json`):
- Each query in `queries.sql` is hand-translated to the Elasticsearch DSL on the corresponding line of `queries.json`, and submitted to `/api/v1/_elastic/hits/_search`.
- Timing is taken from the `took` field returned by Quickwit (milliseconds, engine-internal).
- Queries that are not expressible in Quickwit's DSL are recorded as `null`.

## Unsupported queries

Quickwit's aggregation and query model is narrower than Elasticsearch's. The following ClickBench queries cannot currently be expressed and are reported as `null`:

| Q | Reason |
|----|-----------------------------------------------------------------------|
| 5 | `COUNT(DISTINCT)` — Quickwit has no `cardinality` aggregation |
| 6 | `COUNT(DISTINCT)` |
| 9 | `COUNT(DISTINCT)` |
| 10 | `COUNT(DISTINCT)` |
| 11 | `COUNT(DISTINCT)` |
| 12 | `COUNT(DISTINCT)` |
| 14 | `COUNT(DISTINCT)` |
| 19 | `extract(minute FROM …)` — no scripted/runtime fields |
| 21 | `LIKE '%…%'` — leading wildcards rejected, no `wildcard`/`regexp` |
| 22 | `LIKE '%…%'` |
| 23 | `COUNT(DISTINCT)` |
| 24 | `LIKE '%…%'` |
| 26 | `ORDER BY` on text field — not supported by the search backend |
| 27 | `ORDER BY` on text field |
| 28 | `AVG(length(URL))` — no scripted/runtime fields |
| 29 | `REGEXP_REPLACE` — not supported |
| 30 | `SUM(col + N)` — no scripted aggregations |
| 36 | `ClientIP - N` — no scripted aggregations |
| 40 | `CASE WHEN …` — no scripted/runtime fields |

All other queries run through the native Elasticsearch DSL.

## Running

```bash
bash benchmark.sh
```

This installs Quickwit, creates the index, downloads `hits.json.gz`, ingests the data via the ES bulk API, and then runs `run.sh` to time each query three times with caches dropped between runs.
65 changes: 65 additions & 0 deletions quickwit/benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#!/bin/bash
set -e

# Install prerequisites
sudo apt-get update -y
sudo apt-get install -y wget curl jq bc python3 python3-requests

# Download Quickwit
QW_VERSION="0.8.2"
ARCH=$(uname -m)
QW_DIR="quickwit-v${QW_VERSION}"
wget --continue --progress=dot:giga \
"https://github.com/quickwit-oss/quickwit/releases/download/v${QW_VERSION}/${QW_DIR}-${ARCH}-unknown-linux-gnu.tar.gz"
tar xzf "${QW_DIR}-${ARCH}-unknown-linux-gnu.tar.gz"

# Start the server in the background. Quickwit defaults: REST on 7280, gRPC on 7281.
pushd "$QW_DIR" >/dev/null
nohup ./quickwit run > ../quickwit.log 2>&1 &
QW_PID=$!
popd >/dev/null
echo "Quickwit started (PID $QW_PID)"

# Wait for the server to come up.
for i in $(seq 1 60); do
if curl -sS -f http://localhost:7280/api/v1/version >/dev/null 2>&1; then
echo "Quickwit is ready"
break
fi
sleep 1
done

# Create the index from the YAML config.
curl -sS -X POST http://localhost:7280/api/v1/indexes \
-H 'Content-Type: application/yaml' \
--data-binary @index_config.yaml

# Download the data
wget --continue --progress=dot:giga 'https://datasets.clickhouse.com/hits_compatible/hits.json.gz'

START=$(date +%s)

# Stream JSON directly into Quickwit via the Elasticsearch-compatible bulk API.
python3 load.py

# Force any in-flight commits and wait for the data to become searchable.
# The default commit timeout in index_config.yaml is 30s, so wait a bit longer.
sleep 60

# Show stats.
curl -sS "http://localhost:7280/api/v1/indexes/hits/describe" | tee stats.json
echo

END=$(date +%s)
echo "Load time: $((END - START))"

# Data size on disk (single-node uses qwdata/ inside the install dir).
echo -n "Data size: "
du -sb "$QW_DIR/qwdata" 2>/dev/null | awk '{print $1}'

# Run queries
chmod +x run.sh
./run.sh

# Stop Quickwit
kill "$QW_PID" 2>/dev/null || true
149 changes: 149 additions & 0 deletions quickwit/index_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
version: 0.8

index_id: hits

doc_mapping:
mode: strict
timestamp_field: EventTime
field_mappings:
- {name: WatchID, type: i64, indexed: true, fast: true}
- {name: JavaEnable, type: i64, indexed: true, fast: true}
- {name: Title, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: GoodEvent, type: i64, indexed: true, fast: true}
- name: EventTime
type: datetime
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
output_format: unix_timestamp_secs
indexed: true
fast: true
fast_precision: seconds
- name: EventDate
type: datetime
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
output_format: unix_timestamp_secs
indexed: true
fast: true
fast_precision: seconds
- {name: CounterID, type: i64, indexed: true, fast: true}
- {name: ClientIP, type: i64, indexed: true, fast: true}
- {name: RegionID, type: i64, indexed: true, fast: true}
- {name: UserID, type: i64, indexed: true, fast: true}
- {name: CounterClass, type: i64, indexed: true, fast: true}
- {name: OS, type: i64, indexed: true, fast: true}
- {name: UserAgent, type: i64, indexed: true, fast: true}
- {name: URL, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: Referer, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: IsRefresh, type: i64, indexed: true, fast: true}
- {name: RefererCategoryID, type: i64, indexed: true, fast: true}
- {name: RefererRegionID, type: i64, indexed: true, fast: true}
- {name: URLCategoryID, type: i64, indexed: true, fast: true}
- {name: URLRegionID, type: i64, indexed: true, fast: true}
- {name: ResolutionWidth, type: i64, indexed: true, fast: true}
- {name: ResolutionHeight, type: i64, indexed: true, fast: true}
- {name: ResolutionDepth, type: i64, indexed: true, fast: true}
- {name: FlashMajor, type: i64, indexed: true, fast: true}
- {name: FlashMinor, type: i64, indexed: true, fast: true}
- {name: FlashMinor2, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: NetMajor, type: i64, indexed: true, fast: true}
- {name: NetMinor, type: i64, indexed: true, fast: true}
- {name: UserAgentMajor, type: i64, indexed: true, fast: true}
- {name: UserAgentMinor, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: CookieEnable, type: i64, indexed: true, fast: true}
- {name: JavascriptEnable, type: i64, indexed: true, fast: true}
- {name: IsMobile, type: i64, indexed: true, fast: true}
- {name: MobilePhone, type: i64, indexed: true, fast: true}
- {name: MobilePhoneModel, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: Params, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: IPNetworkID, type: i64, indexed: true, fast: true}
- {name: TraficSourceID, type: i64, indexed: true, fast: true}
- {name: SearchEngineID, type: i64, indexed: true, fast: true}
- {name: SearchPhrase, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: AdvEngineID, type: i64, indexed: true, fast: true}
- {name: IsArtifical, type: i64, indexed: true, fast: true}
- {name: WindowClientWidth, type: i64, indexed: true, fast: true}
- {name: WindowClientHeight, type: i64, indexed: true, fast: true}
- {name: ClientTimeZone, type: i64, indexed: true, fast: true}
- name: ClientEventTime
type: datetime
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
output_format: unix_timestamp_secs
indexed: true
fast: true
fast_precision: seconds
- {name: SilverlightVersion1, type: i64, indexed: true, fast: true}
- {name: SilverlightVersion2, type: i64, indexed: true, fast: true}
- {name: SilverlightVersion3, type: i64, indexed: true, fast: true}
- {name: SilverlightVersion4, type: i64, indexed: true, fast: true}
- {name: PageCharset, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: CodeVersion, type: i64, indexed: true, fast: true}
- {name: IsLink, type: i64, indexed: true, fast: true}
- {name: IsDownload, type: i64, indexed: true, fast: true}
- {name: IsNotBounce, type: i64, indexed: true, fast: true}
- {name: FUniqID, type: i64, indexed: true, fast: true}
- {name: OriginalURL, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: HID, type: i64, indexed: true, fast: true}
- {name: IsOldCounter, type: i64, indexed: true, fast: true}
- {name: IsEvent, type: i64, indexed: true, fast: true}
- {name: IsParameter, type: i64, indexed: true, fast: true}
- {name: DontCountHits, type: i64, indexed: true, fast: true}
- {name: WithHash, type: i64, indexed: true, fast: true}
- {name: HitColor, type: text, tokenizer: raw, fast: {normalizer: raw}}
- name: LocalEventTime
type: datetime
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
output_format: unix_timestamp_secs
indexed: true
fast: true
fast_precision: seconds
- {name: Age, type: i64, indexed: true, fast: true}
- {name: Sex, type: i64, indexed: true, fast: true}
- {name: Income, type: i64, indexed: true, fast: true}
- {name: Interests, type: i64, indexed: true, fast: true}
- {name: Robotness, type: i64, indexed: true, fast: true}
- {name: RemoteIP, type: i64, indexed: true, fast: true}
- {name: WindowName, type: i64, indexed: true, fast: true}
- {name: OpenerName, type: i64, indexed: true, fast: true}
- {name: HistoryLength, type: i64, indexed: true, fast: true}
- {name: BrowserLanguage, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: BrowserCountry, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: SocialNetwork, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: SocialAction, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: HTTPError, type: i64, indexed: true, fast: true}
- {name: SendTiming, type: i64, indexed: true, fast: true}
- {name: DNSTiming, type: i64, indexed: true, fast: true}
- {name: ConnectTiming, type: i64, indexed: true, fast: true}
- {name: ResponseStartTiming, type: i64, indexed: true, fast: true}
- {name: ResponseEndTiming, type: i64, indexed: true, fast: true}
- {name: FetchTiming, type: i64, indexed: true, fast: true}
- {name: SocialSourceNetworkID, type: i64, indexed: true, fast: true}
- {name: SocialSourcePage, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: ParamPrice, type: i64, indexed: true, fast: true}
- {name: ParamOrderID, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: ParamCurrency, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: ParamCurrencyID, type: i64, indexed: true, fast: true}
- {name: OpenstatServiceName, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: OpenstatCampaignID, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: OpenstatAdID, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: OpenstatSourceID, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: UTMSource, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: UTMMedium, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: UTMCampaign, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: UTMContent, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: UTMTerm, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: FromTag, type: text, tokenizer: raw, fast: {normalizer: raw}}
- {name: HasGCLID, type: i64, indexed: true, fast: true}
- {name: RefererHash, type: i64, indexed: true, fast: true}
- {name: URLHash, type: i64, indexed: true, fast: true}
- {name: CLID, type: i64, indexed: true, fast: true}

store_source: false

indexing_settings:
commit_timeout_secs: 30
merge_policy:
type: stable_log
merge_factor: 10
max_merge_factor: 12

search_settings:
default_search_fields: []
68 changes: 68 additions & 0 deletions quickwit/load.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
import gzip
import json
from itertools import islice

import requests

# Quickwit's _bulk endpoint accepts at most 10MB per request; keep batches
# small enough to stay under the limit comfortably.
BULK_SIZE = 2000
QW_URL = "http://localhost:7280/api/v1/_elastic/hits/_bulk"
TOTAL_RECORDS = 99997497

# Quickwit only supports the "create" action of the Elasticsearch bulk API.
ACTION_META_BYTES = (json.dumps({"create": {"_index": "hits"}}) + "\n").encode("utf-8")
REQUEST_TIMEOUT = 120


def build_body(docs):
parts = []
for doc in docs:
parts.append(ACTION_META_BYTES)
parts.append(doc.encode("utf-8") if isinstance(doc, str) else doc)
return b"".join(parts)


def send_bulk(session, docs, batch_num):
# Quickwit's bulk endpoint requires a Content-Length header, so we have to
# buffer the body rather than streaming it.
resp = session.post(QW_URL, data=build_body(docs), timeout=REQUEST_TIMEOUT)
if resp.status_code >= 300:
print(
f"\nSent batch {batch_num} ({len(docs)} docs) - Warning: HTTP {resp.status_code}: {resp.text[:300]}"
)
return 0

body = resp.json()
if body.get("errors"):
items = body.get("items", [])
err = sum(1 for i in items if "error" in i.get("create", {}))
if err:
print(f"\nBatch {batch_num}: {err} item errors")

return len(docs)


def main():
total_docs = 0
batch_num = 0

with requests.Session() as session:
session.headers.update({"Content-Type": "application/x-ndjson"})

with gzip.open("hits.json.gz", mode="rt", encoding="utf-8") as f:
print("Reading from hits.json.gz")
while True:
docs = list(islice(f, BULK_SIZE))
if not docs:
break
batch_num += 1
total_docs += send_bulk(session, docs, batch_num)
pct = (total_docs / TOTAL_RECORDS) * 100 if TOTAL_RECORDS else 0
print(f" {pct:.2f}% ({total_docs}/{TOTAL_RECORDS})")

print(f"\nTotal docs sent: {total_docs}")


if __name__ == "__main__":
main()
Loading