Add benchmarks for Parquet struct leaf-level projection pruning by friendlymatthew · Pull Request #21180 · apache/datafusion

friendlymatthew · 2026-03-26T15:01:49Z

Rationale for this change

This PR adds benchmarks that measure the perf of projecting individual fields from struct columns in Parquet files. #20925 introduced leaf-level projection masking so that select s['small_int'] on a struct with large string fields only reads the small integer leaf, skipping the expensive string decoding entirely

3 dataset shapes are coevered, each with ~262K rows of 8kb string payloads: a narrow struct (2 leaves), a wide struct (5 leaves), and a nested struct. Each shape benchmarks full-struct reads against single-field projections

adriangb

Oh how I wish we could write SQL benchmarks easily... adding to #21165

This PR reduces the data volume in the parquet struct projection benchmark so it runs faster. It amends the recently introduced benchmarks in #21180. --------- Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Rationale for this change This PR adds a benchmark comparing top-level column access against struct field access for the same logical data #20925 introduced leaf level projection masking so that projecting a single struct field skips decoding its siblings. #21180 added benchmarks measuring that improvement across different strcut shapes. But neither benchmark answers how struct field access compare to reading the same column at the top level. Without that baseline, it's hard to know how much overhead the struct access path itself adds

commit benchmarks

f2830a9

github-actions bot added the core Core DataFusion crate label Mar 26, 2026

adriangb approved these changes Mar 26, 2026

View reviewed changes

try 128kb string

41a8238

adriangb mentioned this pull request Mar 26, 2026

[EPIC] Benchmark improvements #21165

Open

adriangb added this pull request to the merge queue Mar 26, 2026

Merged via the queue into apache:main with commit 1416ed4 Mar 26, 2026
33 checks passed

adriangb deleted the friendlymatthew/parquet-struct-bench branch March 26, 2026 19:19

adriangb mentioned this pull request Mar 27, 2026

Reduce parquet struct projection benchmark data volume #21187

Merged

friendlymatthew mentioned this pull request Mar 30, 2026

Add flat vs. struct field projection benchmarks #21257

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarks for Parquet struct leaf-level projection pruning#21180

Add benchmarks for Parquet struct leaf-level projection pruning#21180
adriangb merged 2 commits intoapache:mainfrom
pydantic:friendlymatthew/parquet-struct-bench

friendlymatthew commented Mar 26, 2026

Uh oh!

adriangb left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

friendlymatthew commented Mar 26, 2026

Rationale for this change

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants