Add spark-gluten-clickhouse entry (Spark + Gluten with the CH backend)#861
Open
alexey-milovidov wants to merge 1 commit intomainfrom
Open
Add spark-gluten-clickhouse entry (Spark + Gluten with the CH backend)#861alexey-milovidov wants to merge 1 commit intomainfrom
alexey-milovidov wants to merge 1 commit intomainfrom
Conversation
Adds a spark-gluten-clickhouse/ entry that runs the ClickBench query
suite against Apache Spark with Apache Gluten configured to use the
ClickHouse backend ('ch'), in which Gluten loads libch.so (a fork of
ClickHouse v23.1) into the Spark executor JVM and runs the columnar
plan natively through it.
Compared with spark-gluten/ (which uses the Velox backend), this
exercises a meaningfully different execution path: Catalyst -> Substrait
-> ClickHouse engine, rather than Catalyst -> Substrait -> Velox.
No pre-built bundle is published for the CH backend (the Apache Gluten
release tarball ships only the Velox bundle), so benchmark.sh builds
both libch.so and the Gluten Spark plugin from source. The build is
memory-hungry; a 64 GB host (c6a.8xlarge or larger) is recommended.
Queries use ClickHouse-style regex backreferences (\1) since the regex
evaluation runs inside libch.so, as anticipated in the spark-gluten/
README.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
spark-gluten-clickhouse/entry that runs Apache Spark with Apache Gluten configured to use the ClickHouse backend (spark.gluten.sql.columnar.backend.lib=ch). Gluten loadslibch.so(a fork of ClickHouse v23.1) into the Spark executor JVM and runs the columnar physical plan natively through it.spark-gluten/(Velox backend) and the proposedspark-velox/(Add spark-velox entry (Spark + Velox via Apache Gluten) #858) — this entry exercises a meaningfully different execution path: Catalyst → Substrait → ClickHouse engine, rather than Catalyst → Substrait → Velox.Build
No pre-built bundle is published for the CH backend (Apache Gluten v1.4.0 ships only the Velox bundle, and Maven Central has nothing).
benchmark.shtherefore builds two things from source:libch.so— built from Kyligence/ClickHouse at the branch pinned ingluten/cpp-ch/clickhouse.version(currentlyrebase_ch/20250326). Uses Clang 18 / cmake / ninja.-Pbackends-clickhouse,spark-3.5,scala-2.12under JDK 8.Limitations
libch.socompile is essentially a ClickHouse build and is RAM-hungry; Gluten's docs recommend ≥64 GB. On c6a.4xlarge (32 GB) it may OOM —c6a.8xlargeor larger is recommended for a clean run, hence the default machine label inbenchmark.sh.Notes
\1) rather than Spark's$1, because regex evaluation runs insidelibch.so. This was anticipated in the existingspark-gluten/README.mdand Gluten issue #7545.Test plan
benchmark.shclones gluten + Kyligence/ClickHouse, buildslibch.soand the Spark plugin, runs all 43 queries, and writesresults/<machine>.json.🤖 Generated with Claude Code