Skip to content

Add benchmarks for Parquet struct leaf-level projection pruning#21180

Merged
adriangb merged 2 commits intoapache:mainfrom
pydantic:friendlymatthew/parquet-struct-bench
Mar 26, 2026
Merged

Add benchmarks for Parquet struct leaf-level projection pruning#21180
adriangb merged 2 commits intoapache:mainfrom
pydantic:friendlymatthew/parquet-struct-bench

Conversation

@friendlymatthew
Copy link
Copy Markdown
Contributor

Rationale for this change

This PR adds benchmarks that measure the perf of projecting individual fields from struct columns in Parquet files. #20925 introduced leaf-level projection masking so that select s['small_int'] on a struct with large string fields only reads the small integer leaf, skipping the expensive string decoding entirely

3 dataset shapes are coevered, each with ~262K rows of 8kb string payloads: a narrow struct (2 leaves), a wide struct (5 leaves), and a nested struct. Each shape benchmarks full-struct reads against single-field projections

@github-actions github-actions bot added the core Core DataFusion crate label Mar 26, 2026
Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh how I wish we could write SQL benchmarks easily... adding to #21165

@adriangb adriangb added this pull request to the merge queue Mar 26, 2026
Merged via the queue into apache:main with commit 1416ed4 Mar 26, 2026
33 checks passed
@adriangb adriangb deleted the friendlymatthew/parquet-struct-bench branch March 26, 2026 19:19
github-merge-queue bot pushed a commit that referenced this pull request Mar 27, 2026
This PR reduces the data volume in the parquet struct projection
benchmark so it runs faster.
It amends the recently introduced benchmarks in
#21180.

---------

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 30, 2026
## Rationale for this change

This PR adds a benchmark comparing top-level column access against
struct field access for the same logical data

#20925 introduced leaf level projection masking so that projecting a
single struct field skips decoding its siblings. #21180 added benchmarks
measuring that improvement across different strcut shapes. But neither
benchmark answers how struct field access compare to reading the same
column at the top level. Without that baseline, it's hard to know how
much overhead the struct access path itself adds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants