Skip to content

Add end-to-end Parquet tests for List and LargeList struct schema evolution#20840

Open
kosiew wants to merge 25 commits intoapache:mainfrom
kosiew:schema-01-20835
Open

Add end-to-end Parquet tests for List and LargeList struct schema evolution#20840
kosiew wants to merge 25 commits intoapache:mainfrom
kosiew:schema-01-20835

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Mar 10, 2026

Which issue does this PR close?

Rationale for this change

While the core fixes for nested struct schema evolution have landed in #20907, existing coverage is primarily at the unit/helper level. This PR adds end-to-end Parquet-based integration tests to validate that List and LargeList schema evolution behaves correctly through the full execution pipeline (planning, scanning, and projection).

This ensures that real-world query paths such as SELECT * and nested field projection behave consistently and that previous repro cases are no longer failing.

What changes are included in this PR?

1. End-to-end Rust integration tests

Added comprehensive tests in:

  • datafusion/core/tests/parquet/expr_adapter.rs

These tests:

  • Generate old/new Parquet files with differing nested struct schemas

  • Cover both List<Struct<...>> and LargeList<Struct<...>>

  • Validate:

    • SELECT * correctness
    • Nested field projection via get_field(...)
    • NULL backfilling for missing nullable fields
    • Ignoring extra source-only fields

2. Error-path coverage

Added failure tests for both List and LargeList:

  • Non-nullable missing field → error
  • Incompatible nested field type → error

Ensures parity across both list encodings and prevents partial regressions.

3. Test utilities and refactoring

Introduced reusable helpers to simplify nested test setup:

  • NestedListKind abstraction for List vs LargeList
  • NestedMessageRow test fixture struct
  • Batch builders and schema helpers
  • Macro test_struct_schema_evolution_pair! to generate paired tests

These reduce duplication and make it easier to extend the test matrix.

4. End-user API coverage via .slt

Added:

  • datafusion/sqllogictest/test_files/schema_evolution_nested.slt

This validates behavior through SQL-only workflows:

  • Uses COPY ... TO PARQUET to generate test files
  • Uses CREATE EXTERNAL TABLE to query them

Covers:

  • Mixed-schema reads
  • Nested projection queries
  • Both List and LargeList

Are these changes tested?

Yes.

This PR adds both:

  1. Rust integration tests

    • End-to-end Parquet scan behavior
    • Success and failure scenarios
    • Covers both List and LargeList
  2. sqllogictest (.slt) tests

    • Validates behavior through end-user SQL interface
    • Uses generated Parquet fixtures (no checked-in binaries)

All tests pass locally, including:

  • test_list_struct_schema_evolution_end_to_end
  • test_large_list_struct_schema_evolution_end_to_end
  • Error-path variants for both list encodings

Are there any user-facing changes?

No direct user-facing changes.

This PR improves correctness guarantees and test coverage for nested schema evolution, ensuring more predictable behavior for users working with evolving Parquet schemas.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate labels Mar 10, 2026
@kosiew kosiew marked this pull request as ready for review March 10, 2026 09:09
@TheBuilderJR
Copy link
Contributor

@alamb any chance we can get a core maintainer to review this? It would drastically help https://telemetry.sh query times since now we basically have to rewrite the parquet files at query time.

kosiew added 8 commits March 23, 2026 18:38
Implement comprehensive tests for List<Struct> and LargeList<Struct>
schema evolution. Validate SELECT *, nested-field access, and include
negative cases for missing non-nullable fields and incompatible
nested types.
Expand the negative-test matrix for List and LargeList to cover
missing non-nullable fields and incompatible field types,
resolving the review blocker.

Apply cleanup by extracting field_by_name helper
in nested_messages_batch to eliminate repeated scans and
simplify fixture logic for easier extensions.
Streamline column handling by removing manual push logic and
introducing precomputed vectors: ids_vec, names_vec,
chain_vec, and ignored_vec. Construct columns declaratively
using fields.iter().map(...), and utilize field names to
determine data/array types. Handle chain management for
DataType::Utf8 with StringArray and DataType::Struct with
StructArray wrapping StringArray results. Remove the
unused field_by_name helper.
Remove redundant list-type validation from
assert_nested_list_struct_schema_evolution. Retain
kind-based downcast logic as the sole source of truth
for runtime validation paths.
Introduce a new helper function `incompatible_chain_type()`
and replace inline builders in both tests with this helper.
Introduce a new helper function `nested_list_table_schema` to
dynamically create schema references for nested lists. This change
removes duplicated schema creation logic in the
`assert_nested_list_struct_schema_evolution` and
`assert_nested_list_struct_schema_evolution_errors` functions,
improving code maintainability and readability.
Implement a tiny internal error-case helper with an optional
default chain-type provider. Introduce two scenario-focused
wrappers to streamline functionality. Ensure all four
tokio tests remain separate while routing them through
the new wrappers for improved organization and clarity.
Combine the previously separate passes for ids, names, chain, and
ignored into a single-pass precompute using fold. This change
enhances performance by building all four vectors together with
preallocated capacity, reducing the number of iterations over
messages.
@alamb
Copy link
Contributor

alamb commented Mar 23, 2026

Will do so shortly

kosiew added 11 commits March 23, 2026 21:20
…r improved performance"

This reverts commit cf179840f52a6083e901c98a097ef7174890861c.
Refactor the nested_messages_batch function to improve
performance by converting vectors to ArrayRef only once.
This change reduces the need to clone data for each field
mapping, allowing for cheaper Arc::clone() operations
during processing. Overall, this enhances efficiency and
memory usage.
Create build_message_columns() to construct columns in
canonical order based on schema's optional fields.
Eliminate the name-based dispatcher by directly adding
id and name arrays, simplifying field matching and
avoiding panics. Update nested_messages_batch to
utilize the new helper for cleaner code.
Create a dedicated helper function `target_message_fields` to
construct the target message schema. Replace manual schema
builds in two locations with calls to this new function,
ensuring cleaner code and removing unnecessary clone calls.
Consolidate common setup pattern in a dedicated async
function, setup_nested_list_test(). Update tests
assert_nested_list_struct_schema_evolution and
assert_nested_list_struct_schema_evolution_errors to
utilize this helper for cleaner code and improved
maintainability.
Removed unnecessary type aliases and wrapper functions to simplify
error handling for chain field schema evolution. Directly called
assert_nested_list_struct_schema_evolution_errors with
concrete parameters for better clarity and reduced complexity.
Create extract_nested_list_values() helper function
to encapsulate duplicate extraction logic for both
ListArray and LargeListArray. Replace the inline match
block in the test code with a single function call,
streamlining the extraction pattern for improved
readability and maintainability.
Eliminate unused Clone derive from MessageValue and simplify the
struct definition to reduce noise. In nested_messages_batch(),
optimize item_field ownership by computing message_data_type first,
and use it when constructing the schema, minimizing clones.
Added a comment to clarify the optimization.
Create `test_struct_schema_evolution_pair!` macro to
consolidate test definitions for void and Result return types.
Expand to distinct test functions for various schema evolution
scenarios, improving test organization and reducing redundancy.
@github-actions github-actions bot removed the common Related to common crate label Mar 23, 2026
@kosiew kosiew changed the title Support recursive schema compatibility validation for container types wrapping evolved structs Add end-to-end Parquet tests for List and LargeList struct schema evolution Mar 23, 2026
@kosiew
Copy link
Contributor Author

kosiew commented Mar 23, 2026

@alamb
I did a large force push after #20907 (which overlaps with #20835) landed.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kosiew -- I would still say these tests are "unit test" level as they seem to be exercising the Rust API functions

Is it possible to implement "end user API" tests -- specifically, .slt tests?

It isn't clear to me why we can't create these cases using SQL (or DataFrame)

}

#[derive(Debug)]
struct MessageValue<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am niot sure what a MessageValue is 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to NestedMessageRow

It is Fixture row for one nested struct element inside the messages list column.

kind,
1,
&[
MessageValue {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the reason to create these MessageValue / nested batches in the test setup -- can't we just make the batches directly? If we need another abstraction, at least perhaps we can add comments explaining what is going in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The abstraction was meant to avoid repeating the same nested batch construction for both List and LargeList, and to make it easy to generate the “old schema” and “new schema” Parquet files that differ only by nested struct fields.

The goal here was to exercise the full Parquet scan + schema adaptation path end-to-end, not just a unit helper. That said, I agree the current helpers obscure the intent. I'll add comments that explain the old/new file shapes and why we need actual nested Parquet batches.

@kosiew
Copy link
Contributor Author

kosiew commented Mar 26, 2026

@alamb

these tests are "unit test" level as they seem to be exercising the Rust API functions
Is it possible to implement "end user API" tests -- specifically, .slt tests?
It isn't clear to me why we can't create these cases using SQL (or DataFrame)

The main reason I wrote them this way is fixture creation: the cases need multiple Parquet files with intentionally different nested physical schemas (List<Struct<...>> / LargeList<Struct<...>>, additive nullable fields, extra fields, and incompatible variants).

SQL/DataFrame is a good fit for the read/query side, but it doesn’t naturally create those mismatched Parquet fixtures in a self-contained way. In Rust I can generate the files inline and keep the test focused on the exact evolution shapes without checking in binary fixtures.

That said, I agree there is value in an end-user API test. I can look at splitting this into:

  • .slt coverage for the happy-path query behavior (SELECT * and projected nested fields) using checked-in fixtures or generated fixtures if we have a pattern for that
  • Rust tests retained for the fixture-heavy/error-path cases

kosiew added 4 commits March 26, 2026 18:29
Remove projected nested-field SQL assertion from the success-path
helper. Focus Rust tests on direct nested schema adaptation
across Parquet files, while retaining List/LargeList fixtures,
failure-path tests, and new .slt coverage for SELECT *
output and nested-field projection.
Update all call sites and function parameters to use
NestedMessageRow in the same file. Add a brief comment
above the struct to clarify that it represents one
nested row element in the messages list fixture.
Add comments to clarify that the old batch construction lacks
a chain in the nested struct. Highlight that the new batch
construction includes a nullable chain and extra ignored fields.
Also, clarify that the logical table schema expects an evolved
shape and ignores source-only ignored fields.
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants