Skip to content

Latest commit

 

History

History
437 lines (318 loc) · 25.5 KB

File metadata and controls

437 lines (318 loc) · 25.5 KB

Benchmarks

Performance comparison of Stratum vs DuckDB across standard analytical benchmarks. Primary results on an 8-core Intel Lunar Lake laptop (10M rows), with partial 100M-row results from a 20-core AMD EPYC server.

Methodology

  • Warmup: 5 iterations (10M) / 2 iterations (100M) before measurement
  • Measurement: 10 iterations (10M) / 5 iterations (100M), median reported
  • GC: System.gc() + 200ms sleep between benchmarks to prevent GC pause inflation
  • Threading: 1T = single-threaded, NT = all cores
  • DuckDB: v1.4.4, JDBC in-process, SET threads TO 1 for single-threaded comparisons
  • Data: Real CSV inputs (TPC-H lineitem, ClickBench hits, NYC taxi) and synthetic data (H2O db-benchmark, TPC-DS via DuckDB)
  • Input modes: Arrays (raw long[]/double[]) and Indices (PersistentColumnIndex). Tables show best of both modes.
  • Validation: All queries cross-checked against DuckDB results
  • JIT warmup: All Java hot paths pre-exercised with tiny data (1000 rows) before benchmarks to ensure stable JIT compilations

Ratio > 1.0x = Stratum faster, < 1.0x = DuckDB faster.

Results (10M rows)

Hardware: Intel Core Ultra 7 258V (8 cores, Lunar Lake), Linux, JVM 25.0.1, DuckDB v1.4.4.

Tier 1: TPC-H / SSB

Standard decision-support queries on TPC-H lineitem data (6M rows from CSV).

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
B1 TPC-H Q6: filter + SUM(price*discount) 14.5ms 12.5ms 29.9ms 8.3ms 2.1x
B2 TPC-H Q1: GROUP BY + 7 aggregates 78.5ms 31.5ms 98.0ms 21.5ms 1.2x
B3 SSB Q1.1: filter + SUM(price*discount) 16.2ms 7.2ms 33.2ms 7.4ms 2.1x
B4 COUNT(*) no filter 0.1ms - 0.4ms 0.3ms 4.6x
B5 Filtered COUNT (NEQ predicate) 4.1ms 2.1ms 13.0ms 7.3ms 3.2x
B6 Low-cardinality GROUP BY + COUNT 22.7ms 15.1ms 30.5ms 14.2ms 1.3x
SSB-Q1.2 Tighter filter + SUM(price*discount) 21.5ms 9.9ms 47.0ms 20.1ms 2.2x

Stratum's fused filter+aggregate execution evaluates predicates and accumulates results in a single SIMD pass, avoiding intermediate arrays.

Tier 2: H2O.ai db-benchmark

Group-by queries from the H2O.ai db-benchmark, testing various group cardinalities and aggregation types.

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
Q1 GROUP BY id1 (string, 100 groups), SUM 30.8ms 11.2ms 62.1ms 18.2ms 2.0x
Q2 GROUP BY id1,id2 (string, 10K groups), SUM 41.7ms 21.6ms 129.6ms 61.2ms 3.1x
Q3 GROUP BY id3 (string, 100K groups), SUM+AVG 78.6ms 80.9ms 392.6ms 223.8ms 5.0x
Q4 GROUP BY id4 (int, 100 groups), 3xAVG 53.3ms 14.5ms 51.2ms 13.1ms 0.96x
Q5 GROUP BY id6 (int, 100K groups), 3xSUM 93.7ms 92.4ms 160.8ms 146.3ms 1.7x
Q6 GROUP BY id4,id5 (10K groups), STDDEV 33.3ms 19.1ms 91.3ms 56.9ms 2.7x
Q7 GROUP BY id3 (string, 100K groups), MAX-MIN 97.2ms 107.3ms 385.7ms 228.0ms 4.0x
Q8 Top-2 per group (ROW_NUMBER window) 1906.6ms 1373.6ms 1274.4ms 471.5ms DuckDB 1.5x
Q9 GROUP BY id2,id4 (10K groups), CORR 67.5ms 29.3ms 142.4ms 74.2ms 2.1x
Q10 GROUP BY 6 columns (10M unique groups) 1070.8ms 903.5ms 10269.8ms 7018.6ms 9.6x

Integer Pipeline Queries (native long[] accumulation, no longToDouble conversion):

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
INT-Q1 GROUP BY id4 (100 groups), SUM(v1)+SUM(v2)+COUNT 55.6ms 16.4ms 41.7ms 7.6ms DuckDB 1.3x
INT-Q2 GROUP BY id4 (100 groups), SUM+MIN+MAX+COUNT 77.2ms 25.9ms 43.6ms 12.8ms DuckDB 1.8x
INT-Q3 SUM(v1)+SUM(v2)+COUNT (global multi-sum) 0.5ms 0.6ms 17.5ms 4.0ms 38.8x

INT-Q3 demonstrates the all-long SIMD multi-sum path with LongVector accumulators: 34x faster than DuckDB by avoiding all type conversion overhead. INT-Q1/Q2 show the dense group-by long path; the Clojure orchestration overhead is visible at 100 groups where Java compute is only ~20ms.

H2O Join Queries:

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
J1 INNER JOIN small (100 rows), 2xSUM 34.2ms 9.2ms 32.5ms 6.0ms DuckDB 1.1x
J2 INNER JOIN medium (10K), 2-column key, SUM 79.2ms 39.5ms 75.2ms 62.8ms DuckDB 1.1x
J3 LEFT JOIN medium (10K), 2-column key, SUM 82.3ms 39.8ms 82.0ms 59.2ms 1.0x

Bitmap Semi-Join Queries (existence-only joins via BitSet probe):

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
SEMI-Q1 SUM(v1) WHERE EXISTS customer(nation=5) 93.1ms 45.3ms 272.3ms 124.7ms 2.9x
SEMI-Q2 SUM(v1) GROUP BY id4 WHERE EXISTS 66.2ms 62.0ms 293.5ms 128.9ms 4.4x
SEMI-Q3 COUNT(*) WHERE EXISTS customer(nation=5) 69.0ms 42.8ms 229.4ms 122.0ms 3.3x

Bitmap semi-join fires automatically when a join only tests existence (no dimension columns in output). Replaces hash join with BitSet probe -- no hash table build, no row materialization.

Tier 3: ClickBench

Queries from the ClickBench web analytics benchmark (6M rows from CSV). Organized by query type.

Metadata / Stats-only queries (answered from pre-computed per-chunk statistics without scanning data):

Query Description Stratum 1T DuckDB 1T 1T Ratio
Q2 SUM + COUNT + AVG (3 aggregates) 0.2ms 9.9ms ~48x
Q3 AVG(UserID) 0.1ms 8.7ms ~76x
Q6 MIN + MAX(EventTime) 0.1ms 7.7ms ~76x

Filter + aggregate:

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
Q1 COUNT WHERE AdvEngineID != 0 24.4ms 13.1ms 5.3ms 1.1ms DuckDB 4.6x
Q7 COUNT WHERE 2 predicates 51.3ms 14.2ms 11.4ms 2.2ms DuckDB 4.5x

Group-by:

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
GRP-SE GROUP BY SearchEngineID (65 groups) 15.6ms 7.9ms 20.1ms 3.8ms 1.3x
GRP-REG GROUP BY RegionID (3K groups), SUM+COUNT 34.6ms 24.9ms 40.3ms 23.9ms 1.2x
Q15 GROUP BY UserID (1.1M groups), COUNT 127.3ms 60.9ms 88.4ms 33.4ms DuckDB 1.4x
Q19+ GROUP BY EXTRACT(minute), COUNT 12.0ms 1.9ms 1043.7ms 190.6ms 87x
Q43 GROUP BY DATE_TRUNC(minute), COUNT 73.3ms 38.0ms 59.2ms 25.0ms DuckDB 1.2x
Q12 GROUP BY SearchPhrase (string), COUNT 314.8ms 281.1ms 114.6ms 26.8ms DuckDB 2.7x
Q33 GROUP BY URL (1.9M string groups), COUNT 134.1ms 78.3ms 389.0ms 114.5ms 2.9x

COUNT DISTINCT:

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
Q5 COUNT(DISTINCT UserID) -- 1.1M distinct 123.2ms 142.6ms 67.4ms 27.0ms DuckDB 1.8x
CD-GRP COUNT(DISTINCT AdvEngineID) GROUP BY RegionID 33.3ms 16.9ms 37.3ms 23.4ms 1.1x
Q8 GROUP BY RegionID, COUNT(DISTINCT UserID), TOP 10 81.8ms 93.7ms 84.6ms 33.3ms 1.0x
Q9 GROUP BY RegionID, SUM+COUNT+AVG+COUNT(DISTINCT) 121.1ms 126.5ms 125.8ms 58.0ms 1.0x

LIKE pattern matching:

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
LIKE1 COUNT WHERE URL LIKE '%example.com/page%' 26.4ms 23.9ms 280.6ms 49.0ms 10.6x
LIKE2 COUNT WHERE URL LIKE '%search%' 55.7ms 27.0ms 268.7ms 45.2ms 4.8x
LIKE3 GROUP BY SearchEngineID WHERE URL LIKE '%shop%' 31.5ms 25.2ms 271.6ms 47.7ms 8.6x
Q20 COUNT WHERE URL LIKE '%google%' 27.7ms 34.7ms 205.9ms 33.8ms 7.4x

String functions:

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
Q27 GROUP BY CounterID, AVG(LENGTH(URL)) HAVING COUNT > 100K 45.8ms 26.3ms 200.9ms 36.2ms 4.4x
Q28 AVG(LENGTH(URL)) 42.0ms 24.6ms 197.2ms 32.0ms 4.7x

+CB-Q19: DuckDB v1.4.4 regression -- EXTRACT uses full scan instead of direct aggregation. Fixed in DuckDB v1.5.0.

Tier 4: NYC Taxi

Real-world trip data (~5.8M rows from CSV).

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
Q1 AVG(fare) GROUP BY payment_type 15.5ms 4.4ms 21.9ms 5.4ms 1.4x
Q2 AVG(tip) GROUP BY passenger_count 45.3ms 31.9ms 24.1ms 4.4ms DuckDB 1.9x
Q3 COUNT GROUP BY hour, day-of-week 16.7ms 10.9ms 17.6ms 3.3ms 1.1x
Q4 SUM(total) WHERE fare > 10 GROUP BY month 41.2ms 8.5ms 30.9ms 6.5ms DuckDB 1.3x

Tier 5: Hash Join

Fact table (10M rows) joined to dimension table (1K rows), followed by GROUP BY + SUM.

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
JOIN-Q1 Fact JOIN Dim, GROUP BY category, SUM 25.6ms 5.3ms 42.7ms 7.6ms 1.7x

Tier 6: Statistical Aggregates

Exact median/percentile (QuickSelect) and approximate quantiles (t-digest) on TPC-H price column (6M rows).

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
STAT-Q1 MEDIAN(price) 67.8ms 67.8ms 157.6ms 134.0ms 2.3x
STAT-Q2 GROUP BY returnflag, MEDIAN(price) 96.1ms 96.8ms 157.6ms 131.8ms 1.6x
STAT-Q3 PERCENTILE_CONT(0.95, price) 52.7ms 37.4ms 128.2ms 114.2ms 2.4x
STAT-Q4 APPROX_QUANTILE(price, 0.95) 243.3ms 41.4ms 278.1ms 43.9ms 1.1x
STAT-Q5 P25, P50, P75 of price 190.9ms 191.1ms 428.5ms 408.8ms 2.2x

Tier 7: Window Functions

Window operations on TPC-H lineitem (6M rows).

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
WIN-Q1 ROW_NUMBER() OVER (PARTITION BY orderkey) 316.3ms 191.1ms 425.6ms 118.4ms 1.3x
WIN-Q2 LAG(price, 1) OVER (PARTITION BY orderkey) 352.2ms 215.1ms 501.7ms 137.0ms 1.4x
WIN-Q3 Running SUM(price) OVER (PARTITION BY orderkey) 387.2ms 263.8ms 823.4ms 252.5ms 2.1x

Window functions benefit from multi-threading when partition sizes exceed 8 rows. Single-threaded performance is 1.3-2.1x faster than DuckDB. DuckDB's parallel window implementation achieves stronger NT scaling (3-4x vs Stratum's 1.5-1.7x).

Tier 8: TPC-DS (sf=1, ~2.9M rows)

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
DS-Q1 GROUP BY store, SUM + COUNT 22.7ms 11.1ms 12.6ms 2.9ms DuckDB 1.8x
DS-Q98 ROW_NUMBER OVER (PARTITION BY store) 299.2ms 98.8ms 376.1ms 89.2ms 1.3x

10M Summary

Stratum wins 39 of 52 queries, DuckDB wins 13 (single-threaded comparison, queries > 0.1ms). Last verified 2026-04-04 with query planner enabled (no regressions).

Isolation Forest

Operation 100K rows 1M rows
Train (100 trees x 256) 27ms 6ms
Score (parallel) 155ms 419ms
Score (1-thread) 152ms 1555ms

Results (100M rows)

Hardware: AMD EPYC 7313 16-Core Processor (20 cores), Linux, JVM 25.0.1 (OpenJDK), DuckDB v1.4.4. Index mode only.

Tier 1: TPC-H / SSB (100M rows)

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
B1 TPC-H Q6: filter + SUM(price*discount) 362.3ms 65.5ms 681.8ms 57.6ms 1.9x
B2 TPC-H Q1: GROUP BY + 7 aggregates 2525.7ms 218.7ms 2547.0ms 197.9ms 1.0x
B3 SSB Q1.1: filter + SUM(price*discount) 349.7ms 57.1ms 771.4ms 64.6ms 2.2x
B4 COUNT(*) no filter 0.3ms -- 8.5ms 1.4ms 27x
B5 Filtered COUNT (NEQ predicate) 49.5ms 15.6ms 423.9ms 37.3ms 8.6x
B6 Low-cardinality GROUP BY + COUNT 957.6ms 92.4ms 781.0ms 59.2ms DuckDB 1.2x
SSB-Q1.2 Tighter filter + SUM(price*discount) 366.1ms 50.6ms 501.2ms 43.7ms 1.4x

Tier 2: H2O.ai Group-By (100M rows)

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
Q1 GROUP BY id1 (string, 100 groups) 398.8ms 44.4ms 1636.1ms 146.1ms 4.1x
Q2 GROUP BY id1,id2 (string, 10K groups) 482.3ms 64.3ms 3286.9ms 296.7ms 6.8x
Q3 GROUP BY id3 (string, 1M groups) 7356.0ms 1997.7ms 7821.1ms 1300.1ms 1.1x
Q4 GROUP BY id4 (int, 100 groups), 3xAVG 1092.0ms 79.4ms 705.8ms 62.1ms DuckDB 1.5x
Q5 GROUP BY id6 (int, 1M groups), 3xSUM 9102.3ms 2324.4ms 3623.5ms 758.1ms DuckDB 2.5x
Q7 GROUP BY id3 (string, 1M groups), MAX-MIN 9194.8ms 2278.0ms 7991.2ms 1218.4ms DuckDB 1.2x
Q10 GROUP BY 6 columns (100M unique groups) 137.9s 135.3s 175.4s 144.1s 1.3x

H2O Join Queries (100M rows):

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
J1 INNER JOIN small (100 rows), 2xSUM 527.6ms 52.7ms 414.0ms 34.2ms DuckDB 1.3x
J2 INNER JOIN medium (10K), 2-column key, SUM 5979.7ms 668.3ms 6645.9ms 2243.4ms 1.1x
J3 LEFT JOIN medium (10K), 2-column key, SUM 6361.8ms 664.1ms 6893.5ms 2253.5ms 1.1x

At 100M rows, multi-column joins (J2, J3) show Stratum's NT advantage growing to 3.4x over DuckDB NT.

Tier 3: ClickBench (100M rows)

~100M rows from the full ClickBench hits dataset.

Metadata / Stats-only queries (answered from pre-computed per-chunk statistics without scanning data):

Query Description Stratum 1T DuckDB 1T 1T Ratio
Q2 SUM + COUNT + AVG (3 aggregates) 1.9ms 193.6ms 102x
Q3 AVG(UserID) 0.4ms 159.8ms 400x
Q6 MIN + MAX(EventTime) 0.7ms 200.5ms 286x

Filter + aggregate:

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
Q1 COUNT WHERE AdvEngineID != 0 142.5ms 59.8ms 66.8ms 8.1ms DuckDB 2.1x
Q7 COUNT WHERE 2 predicates 879.8ms 341.8ms 160.6ms 22.9ms DuckDB 5.5x

Group-by:

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
GRP-SE GROUP BY SearchEngineID (96 groups) 382.6ms 33.9ms 417.9ms 28.8ms 1.1x
GRP-REG GROUP BY RegionID (9K groups), SUM+COUNT 669.8ms 191.2ms 748.9ms 99.0ms 1.1x
Q15 GROUP BY UserID (17.6M groups), COUNT 4191.6ms 1163.8ms 3275.6ms 327.4ms DuckDB 1.3x
Q19+ GROUP BY EXTRACT(minute), COUNT 301.1ms 138.0ms 25144.8ms 1804.6ms regression+
Q43 GROUP BY DATE_TRUNC(minute), COUNT 3511.0ms 944.9ms 1112.1ms 153.2ms DuckDB 3.2x
Q12 GROUP BY SearchPhrase (string), COUNT 7762.1ms 7497.4ms 3947.2ms 241.1ms DuckDB 2.0x
Q33 GROUP BY URL (18.3M string groups), COUNT 6551.8ms 2460.0ms 11608.9ms 907.5ms 1.8x

COUNT DISTINCT:

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
Q5 COUNT(DISTINCT UserID) -- 17.6M distinct 3663.9ms 4059.9ms 2961.3ms 198.5ms DuckDB 1.2x
CD-GRP COUNT(DISTINCT AdvEngineID) GROUP BY RegionID 1204.8ms 382.4ms 766.3ms 105.5ms DuckDB 1.6x
Q8 GROUP BY RegionID, COUNT(DISTINCT UserID), TOP 10 6963.6ms 1974.7ms 5969.9ms 314.4ms DuckDB 1.2x
Q9 GROUP BY RegionID, SUM+COUNT+AVG+COUNT(DISTINCT) 9114.1ms 3811.1ms 6962.1ms 435.1ms DuckDB 1.3x

String functions:

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
Q28 AVG(LENGTH(URL)) 1151.2ms 580.0ms 4435.5ms 325.3ms 3.9x

+CB-Q19: DuckDB v1.4.4 regression -- EXTRACT uses full scan instead of direct aggregation. Fixed in DuckDB v1.5.0.

Tier 4: NYC Taxi (100M run, ~39.7M rows)

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
Q1 AVG(fare) GROUP BY payment_type 149.7ms 17.5ms 248.2ms 17.5ms 1.7x
Q2 AVG(tip) GROUP BY passenger_count 784.1ms 548.1ms 257.9ms 19.7ms DuckDB 3.0x
Q3 COUNT GROUP BY hour, day-of-week 183.9ms 22.7ms 183.8ms 15.3ms 1.0x
Q4 SUM(total) WHERE fare > 10 GROUP BY month 537.9ms 49.2ms 400.1ms 32.1ms DuckDB 1.3x

Tier 5: Hash Join (100M rows)

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
JOIN-Q1 Fact JOIN Dim, GROUP BY category, SUM 833.6ms 54.8ms 576.0ms 48.5ms DuckDB 1.4x

Tier 6: Statistical Aggregates (100M rows)

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
STAT-Q1 MEDIAN(price) 1486.0ms 1650.4ms 4315.5ms 4198.5ms 2.9x
STAT-Q2 GROUP BY returnflag, MEDIAN(price) 1844.1ms 1830.5ms 5091.3ms 3617.1ms 2.8x
STAT-Q3 PERCENTILE_CONT(0.95, price) 1300.6ms 1474.6ms 3874.9ms 2684.0ms 3.0x
STAT-Q4 APPROX_QUANTILE(price, 0.95) 6139.9ms 657.9ms 6182.4ms 378.9ms 1.0x
STAT-Q5 P25, P50, P75 of price 4189.6ms 4149.4ms 12291.1ms 11955.1ms 2.9x

Exact median and percentile operations scale well -- Stratum maintains a ~3x advantage at 100M rows using O(N) QuickSelect.

Tier 7: TPC-DS (sf=1, ~2.9M rows, 100M run)

Query Description Stratum 1T Stratum NT DuckDB 1T DuckDB NT 1T Ratio
DS-Q1 GROUP BY store, SUM + COUNT 36.5ms 14.2ms 19.5ms 4.0ms DuckDB 1.9x

100M Summary

Stratum wins 15 of 34 queries, DuckDB wins 19 (single-threaded comparison, queries > 0.1ms).

Window functions, LIKE pattern matching, VARIANCE/CORR group-by, and string function group-by benchmarks are omitted from the 100M results -- these code paths were optimized after this hardware was last available and will be added when re-run.

DuckDB's advantage at 100M is concentrated in high-cardinality hash group-by (1M-100M unique groups) where hash tables become DRAM-bound. Stratum wins on filter+aggregate, statistical aggregates, and moderate-cardinality group-by.


Key Performance Characteristics

  • Query planner: Multi-pass optimizer builds physical plan trees with predicate pushdown, operator fusion, cost-based strategy selection (zone-map + sample selectivity estimation), and bitmap semi-join rewriting. Planning overhead is ~0.1% of query time (~5μs).
  • Fused execution: Predicate evaluation and aggregation run in a single SIMD pass, avoiding intermediate array allocation
  • SIMD vectorization: Java Vector API (DoubleVector/LongVector) processes 4 elements per cycle for all filter, aggregate, and group-by operations
  • Native long[] pipeline: Integer columns stay as long[] through expressions, aggregation, and results -- no longToDouble conversion overhead (INT-Q3: 39x vs DuckDB)
  • Dense group-by: Direct array-indexed accumulation for low/moderate-cardinality groups -- no hash function overhead
  • Bitmap semi-join: Existence-only joins use BitSet probe instead of hash tables (3-4x vs DuckDB on filtered joins)
  • Zone map pruning: Per-chunk min/max statistics allow skipping entire chunks that cannot match predicates
  • Stats-only aggregation: SUM/COUNT/MIN/MAX/AVG answered directly from chunk statistics without touching data (CB-Q2/Q6: 0.1-0.2ms at 10M)
  • LIKE fast-path: Dictionary pre-filtering with bitmask pruning + contains/startsWith/endsWith (5-11x vs DuckDB on string pattern matching)
  • QuickSelect for median/percentile: O(N) average-case exact quantile computation (2-3x vs DuckDB)
  • Hash join with fused aggregation: Single-pass probe + accumulate eliminates intermediate materialization
  • JIT warmup on startup: Pre-exercises all Java code paths to produce stable polymorphic JIT compilations before real queries

Where DuckDB Wins

  • High-cardinality group-by at scale: At 100M rows with 1M+ unique groups, DuckDB's parallel hash aggregate scales better
  • COUNT(DISTINCT) on high cardinality: DuckDB's HyperLogLog is faster for 1M+ distinct values
  • Sparse-selectivity filters (CB-Q1, Q7): When few rows match, DuckDB's column compression and zone maps can skip more data
  • Multi-threaded scaling: DuckDB's NT advantage is generally larger -- better thread utilization at high parallelism, particularly for window functions and high-cardinality group-by
  • Dense group-by Clojure overhead: For small group counts (100 groups), Clojure orchestration overhead (~20ms) is visible when Java compute is fast (INT-Q1/Q2)

Known Multi-Threading Limitations

In some scenarios, Stratum's multi-threaded execution does not improve over single-threaded:

  • High-cardinality group-by (H2O-Q3/Q5/Q7, 60K-100K groups): NT ~ 1T -- no speedup. Accumulator memory exceeds L2 cache per thread, causing DRAM bandwidth saturation. L3-adaptive thread capping mitigates the worst cases but doesn't eliminate the fundamental issue.
  • Mixed aggregation with COUNT DISTINCT (CB-Q8/Q9): NT is slightly slower than 1T due to per-group hash table overhead scaling with thread count.
  • Statistical aggregates (STAT-Q1/Q2/Q5): MEDIAN and multi-percentile use sequential QuickSelect, preventing parallelism.

Root causes and planned improvements:

  1. High-cardinality group-by needs pre-aggregation or partition-aware scheduling to keep working sets L2-resident
  2. COUNT DISTINCT could benefit from radix-partitioned per-group hash tables to reduce sequential merge cost
  3. String-heavy paths need better parallelization of dictionary encoding and hash table probing

Reproducing

# All tiers, 10M rows (default)
clj -M:olap

# Custom scale
clj -M:olap 1000000
clj -M:olap 100000000

# Specific tiers
clj -M:olap t1          # TPC-H/SSB only
clj -M:olap h2o         # H2O.ai only
clj -M:olap cb          # ClickBench only
clj -M:olap taxi        # NYC Taxi only
clj -M:olap join        # Hash Join only
clj -M:olap stat        # Statistical only

# Isolation forest
clj -M:iforest

# With indices
clj -M:olap idx

Insert Benchmark

Throughput for appending to and maintaining a sorted PersistentColumnIndex. Run with clj -M:insert-bench.

Append-only (1M elements)

Append to the end of the index in batches. This is the common case for time-series data.

Batch size Time Throughput
1 8308ms 120K ops/sec
10 1622ms 617K ops/sec
100 967ms 1.03M ops/sec
1000 900ms 1.11M ops/sec

Sorted insert (100K elements)

Insert into a sorted position. Binary search via idx-get-long to find the position, then copy-on-write of the modified chunk.

Batch size Time Throughput
1 1066ms 94K ops/sec
10 996ms 100K ops/sec
100 1065ms 94K ops/sec
1000 1060ms 94K ops/sec

Mixed 80/20 (100K elements)

80% append, 20% sorted insert -- similar to a write-ahead index with mostly new entries and occasional backdated facts.

Batch size Time Throughput
1 1151ms 87K ops/sec
10 353ms 283K ops/sec
100 312ms 321K ops/sec
1000 318ms 315K ops/sec

Related Documentation