Add end-to-end Parquet tests for List and LargeList struct schema evolution by kosiew · Pull Request #20840 · apache/datafusion

kosiew · 2026-03-10T07:12:21Z

Which issue does this PR close?

Part of Support additive schema evolution for List<Struct> / nested container types in Parquet scans #20835

Rationale for this change

While the core fixes for nested struct schema evolution have landed in #20907, existing coverage is primarily at the unit/helper level. This PR adds end-to-end Parquet-based integration tests to validate that List and LargeList schema evolution behaves correctly through the full execution pipeline (planning, scanning, and projection).

This ensures that real-world query paths such as SELECT * and nested field projection behave consistently and that previous repro cases are no longer failing.

What changes are included in this PR?

1. End-to-end Rust integration tests

Added comprehensive tests in:

datafusion/core/tests/parquet/expr_adapter.rs

These tests:

Generate old/new Parquet files with differing nested struct schemas
Cover both List<Struct<...>> and LargeList<Struct<...>>
Validate:
- SELECT * correctness
- Nested field projection via get_field(...)
- NULL backfilling for missing nullable fields
- Ignoring extra source-only fields

2. Error-path coverage

Added failure tests for both List and LargeList:

Non-nullable missing field → error
Incompatible nested field type → error

Ensures parity across both list encodings and prevents partial regressions.

3. Test utilities and refactoring

Introduced reusable helpers to simplify nested test setup:

NestedListKind abstraction for List vs LargeList
NestedMessageRow test fixture struct
Batch builders and schema helpers
Macro test_struct_schema_evolution_pair! to generate paired tests

These reduce duplication and make it easier to extend the test matrix.

4. End-user API coverage via `.slt`

Added:

datafusion/sqllogictest/test_files/schema_evolution_nested.slt

This validates behavior through SQL-only workflows:

Uses COPY ... TO PARQUET to generate test files
Uses CREATE EXTERNAL TABLE to query them

Covers:

Mixed-schema reads
Nested projection queries
Both List and LargeList

Are these changes tested?

Yes.

This PR adds both:

Rust integration tests
- End-to-end Parquet scan behavior
- Success and failure scenarios
- Covers both List and LargeList
sqllogictest (.slt) tests
- Validates behavior through end-user SQL interface
- Uses generated Parquet fixtures (no checked-in binaries)

All tests pass locally, including:

test_list_struct_schema_evolution_end_to_end
test_large_list_struct_schema_evolution_end_to_end
Error-path variants for both list encodings

Are there any user-facing changes?

No direct user-facing changes.

This PR improves correctness guarantees and test coverage for nested schema evolution, ensuring more predictable behavior for users working with evolving Parquet schemas.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

TheBuilderJR · 2026-03-19T20:19:45Z

@alamb any chance we can get a core maintainer to review this? It would drastically help https://telemetry.sh query times since now we basically have to rewrite the parquet files at query time.

Implement comprehensive tests for List<Struct> and LargeList<Struct> schema evolution. Validate SELECT *, nested-field access, and include negative cases for missing non-nullable fields and incompatible nested types.

Expand the negative-test matrix for List and LargeList to cover missing non-nullable fields and incompatible field types, resolving the review blocker. Apply cleanup by extracting field_by_name helper in nested_messages_batch to eliminate repeated scans and simplify fixture logic for easier extensions.

Streamline column handling by removing manual push logic and introducing precomputed vectors: ids_vec, names_vec, chain_vec, and ignored_vec. Construct columns declaratively using fields.iter().map(...), and utilize field names to determine data/array types. Handle chain management for DataType::Utf8 with StringArray and DataType::Struct with StructArray wrapping StringArray results. Remove the unused field_by_name helper.

Remove redundant list-type validation from assert_nested_list_struct_schema_evolution. Retain kind-based downcast logic as the sole source of truth for runtime validation paths.

Introduce a new helper function `incompatible_chain_type()` and replace inline builders in both tests with this helper.

Introduce a new helper function `nested_list_table_schema` to dynamically create schema references for nested lists. This change removes duplicated schema creation logic in the `assert_nested_list_struct_schema_evolution` and `assert_nested_list_struct_schema_evolution_errors` functions, improving code maintainability and readability.

Implement a tiny internal error-case helper with an optional default chain-type provider. Introduce two scenario-focused wrappers to streamline functionality. Ensure all four tokio tests remain separate while routing them through the new wrappers for improved organization and clarity.

Combine the previously separate passes for ids, names, chain, and ignored into a single-pass precompute using fold. This change enhances performance by building all four vectors together with preallocated capacity, reducing the number of iterations over messages.

alamb · 2026-03-23T13:18:45Z

Will do so shortly

…ed performance

…r improved performance" This reverts commit cf179840f52a6083e901c98a097ef7174890861c.

Refactor the nested_messages_batch function to improve performance by converting vectors to ArrayRef only once. This change reduces the need to clone data for each field mapping, allowing for cheaper Arc::clone() operations during processing. Overall, this enhances efficiency and memory usage.

Create build_message_columns() to construct columns in canonical order based on schema's optional fields. Eliminate the name-based dispatcher by directly adding id and name arrays, simplifying field matching and avoiding panics. Update nested_messages_batch to utilize the new helper for cleaner code.

Create a dedicated helper function `target_message_fields` to construct the target message schema. Replace manual schema builds in two locations with calls to this new function, ensuring cleaner code and removing unnecessary clone calls.

Consolidate common setup pattern in a dedicated async function, setup_nested_list_test(). Update tests assert_nested_list_struct_schema_evolution and assert_nested_list_struct_schema_evolution_errors to utilize this helper for cleaner code and improved maintainability.

Removed unnecessary type aliases and wrapper functions to simplify error handling for chain field schema evolution. Directly called assert_nested_list_struct_schema_evolution_errors with concrete parameters for better clarity and reduced complexity.

Create extract_nested_list_values() helper function to encapsulate duplicate extraction logic for both ListArray and LargeListArray. Replace the inline match block in the test code with a single function call, streamlining the extraction pattern for improved readability and maintainability.

Eliminate unused Clone derive from MessageValue and simplify the struct definition to reduce noise. In nested_messages_batch(), optimize item_field ownership by computing message_data_type first, and use it when constructing the schema, minimizing clones. Added a comment to clarify the optimization.

Create `test_struct_schema_evolution_pair!` macro to consolidate test definitions for void and Result return types. Expand to distinct test functions for various schema evolution scenarios, improving test organization and reducing redundancy.

kosiew · 2026-03-23T14:20:21Z

@alamb
I did a large force push after #20907 (which overlaps with #20835) landed.

alamb

Thanks @kosiew -- I would still say these tests are "unit test" level as they seem to be exercising the Rust API functions

Is it possible to implement "end user API" tests -- specifically, .slt tests?

It isn't clear to me why we can't create these cases using SQL (or DataFrame)

alamb · 2026-03-25T12:47:00Z

datafusion/core/tests/parquet/expr_adapter.rs

+}
+
+#[derive(Debug)]
+struct MessageValue<'a> {


I am niot sure what a MessageValue is 🤔

Renamed to NestedMessageRow

It is Fixture row for one nested struct element inside the messages list column.

alamb · 2026-03-25T12:49:58Z

datafusion/core/tests/parquet/expr_adapter.rs

+        kind,
+        1,
+        &[
+            MessageValue {


I don't understand the reason to create these MessageValue / nested batches in the test setup -- can't we just make the batches directly? If we need another abstraction, at least perhaps we can add comments explaining what is going in

The abstraction was meant to avoid repeating the same nested batch construction for both List and LargeList, and to make it easy to generate the “old schema” and “new schema” Parquet files that differ only by nested struct fields.

The goal here was to exercise the full Parquet scan + schema adaptation path end-to-end, not just a unit helper. That said, I agree the current helpers obscure the intent. I'll add comments that explain the old/new file shapes and why we need actual nested Parquet batches.

kosiew · 2026-03-26T09:36:16Z

@alamb

these tests are "unit test" level as they seem to be exercising the Rust API functions
Is it possible to implement "end user API" tests -- specifically, .slt tests?
It isn't clear to me why we can't create these cases using SQL (or DataFrame)

The main reason I wrote them this way is fixture creation: the cases need multiple Parquet files with intentionally different nested physical schemas (List<Struct<...>> / LargeList<Struct<...>>, additive nullable fields, extra fields, and incompatible variants).

SQL/DataFrame is a good fit for the read/query side, but it doesn’t naturally create those mismatched Parquet fixtures in a self-contained way. In Rust I can generate the files inline and keep the test focused on the exact evolution shapes without checking in binary fixtures.

That said, I agree there is value in an end-user API test. I can look at splitting this into:

.slt coverage for the happy-path query behavior (SELECT * and projected nested fields) using checked-in fixtures or generated fixtures if we have a pattern for that
Rust tests retained for the fixture-heavy/error-path cases

Remove projected nested-field SQL assertion from the success-path helper. Focus Rust tests on direct nested schema adaptation across Parquet files, while retaining List/LargeList fixtures, failure-path tests, and new .slt coverage for SELECT * output and nested-field projection.

Update all call sites and function parameters to use NestedMessageRow in the same file. Add a brief comment above the struct to clarify that it represents one nested row element in the messages list fixture.

Add comments to clarify that the old batch construction lacks a chain in the nested struct. Highlight that the new batch construction includes a nullable chain and extra ignored fields. Also, clarify that the logical table schema expects an evolved shape and ignores source-only ignored fields.

github-actions bot added core Core DataFusion crate common Related to common crate labels Mar 10, 2026

kosiew marked this pull request as ready for review March 10, 2026 09:09

kosiew added 8 commits March 23, 2026 18:38

Add end-to-end Parquet tests for schema evolution

046d34c

Implement comprehensive tests for List<Struct> and LargeList<Struct> schema evolution. Validate SELECT *, nested-field access, and include negative cases for missing non-nullable fields and incompatible nested types.

Refactor schema validation logic

9455eeb

Remove redundant list-type validation from assert_nested_list_struct_schema_evolution. Retain kind-based downcast logic as the sole source of truth for runtime validation paths.

Add helper for incompatible chain type and simplify tests

96f796d

Introduce a new helper function `incompatible_chain_type()` and replace inline builders in both tests with this helper.

kosiew added 11 commits March 23, 2026 21:20

Refactor nested_messages_batch to use pre-allocated arrays for improv…

902ef9c

…ed performance

Revert "Refactor nested_messages_batch to use pre-allocated arrays fo…

1c8b75e

…r improved performance" This reverts commit cf179840f52a6083e901c98a097ef7174890861c.

Fix memory allocation in build_message_columns for Utf8 and Struct types

770676e

kosiew force-pushed the schema-01-20835 branch from 9ae9da9 to 770676e Compare March 23, 2026 14:10

github-actions bot removed the common Related to common crate label Mar 23, 2026

kosiew changed the title ~~Support recursive schema compatibility validation for container types wrapping evolved structs~~ Add end-to-end Parquet tests for List and LargeList struct schema evolution Mar 23, 2026

kosiew added 2 commits March 23, 2026 22:23

clippy fix

6b38c19

Refactor nested_messages_batch calls to use direct references for fields

b657079

alamb reviewed Mar 25, 2026

View reviewed changes

kosiew added 4 commits March 26, 2026 18:29

Add end-user tests for nested list/struct Parquet schema evolution

4fe1190

Rename MessageValue to NestedMessageRow

460c1d6

Update all call sites and function parameters to use NestedMessageRow in the same file. Add a brief comment above the struct to clarify that it represents one nested row element in the messages list fixture.

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add end-to-end Parquet tests for List and LargeList struct schema evolution#20840

Add end-to-end Parquet tests for List and LargeList struct schema evolution#20840
kosiew wants to merge 25 commits intoapache:mainfrom
kosiew:schema-01-20835

kosiew commented Mar 10, 2026 •

edited

Loading

Uh oh!

TheBuilderJR commented Mar 19, 2026

Uh oh!

alamb commented Mar 23, 2026

Uh oh!

kosiew commented Mar 23, 2026 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Mar 25, 2026

Uh oh!

kosiew Mar 26, 2026

Uh oh!

alamb Mar 25, 2026

Uh oh!

kosiew Mar 26, 2026

Uh oh!

kosiew commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kosiew commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

1. End-to-end Rust integration tests

2. Error-path coverage

3. Test utilities and refactoring

4. End-user API coverage via .slt

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Uh oh!

TheBuilderJR commented Mar 19, 2026

Uh oh!

alamb commented Mar 23, 2026

Uh oh!

kosiew commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kosiew commented Mar 10, 2026 •

edited

Loading

4. End-user API coverage via `.slt`

kosiew commented Mar 23, 2026 •

edited

Loading