Add end-to-end Parquet tests for List and LargeList struct schema evolution#20840
Add end-to-end Parquet tests for List and LargeList struct schema evolution#20840kosiew wants to merge 25 commits intoapache:mainfrom
Conversation
|
@alamb any chance we can get a core maintainer to review this? It would drastically help https://telemetry.sh query times since now we basically have to rewrite the parquet files at query time. |
Implement comprehensive tests for List<Struct> and LargeList<Struct> schema evolution. Validate SELECT *, nested-field access, and include negative cases for missing non-nullable fields and incompatible nested types.
Expand the negative-test matrix for List and LargeList to cover missing non-nullable fields and incompatible field types, resolving the review blocker. Apply cleanup by extracting field_by_name helper in nested_messages_batch to eliminate repeated scans and simplify fixture logic for easier extensions.
Streamline column handling by removing manual push logic and introducing precomputed vectors: ids_vec, names_vec, chain_vec, and ignored_vec. Construct columns declaratively using fields.iter().map(...), and utilize field names to determine data/array types. Handle chain management for DataType::Utf8 with StringArray and DataType::Struct with StructArray wrapping StringArray results. Remove the unused field_by_name helper.
Remove redundant list-type validation from assert_nested_list_struct_schema_evolution. Retain kind-based downcast logic as the sole source of truth for runtime validation paths.
Introduce a new helper function `incompatible_chain_type()` and replace inline builders in both tests with this helper.
Introduce a new helper function `nested_list_table_schema` to dynamically create schema references for nested lists. This change removes duplicated schema creation logic in the `assert_nested_list_struct_schema_evolution` and `assert_nested_list_struct_schema_evolution_errors` functions, improving code maintainability and readability.
Implement a tiny internal error-case helper with an optional default chain-type provider. Introduce two scenario-focused wrappers to streamline functionality. Ensure all four tokio tests remain separate while routing them through the new wrappers for improved organization and clarity.
Combine the previously separate passes for ids, names, chain, and ignored into a single-pass precompute using fold. This change enhances performance by building all four vectors together with preallocated capacity, reducing the number of iterations over messages.
|
Will do so shortly |
…r improved performance" This reverts commit cf179840f52a6083e901c98a097ef7174890861c.
Refactor the nested_messages_batch function to improve performance by converting vectors to ArrayRef only once. This change reduces the need to clone data for each field mapping, allowing for cheaper Arc::clone() operations during processing. Overall, this enhances efficiency and memory usage.
Create build_message_columns() to construct columns in canonical order based on schema's optional fields. Eliminate the name-based dispatcher by directly adding id and name arrays, simplifying field matching and avoiding panics. Update nested_messages_batch to utilize the new helper for cleaner code.
Create a dedicated helper function `target_message_fields` to construct the target message schema. Replace manual schema builds in two locations with calls to this new function, ensuring cleaner code and removing unnecessary clone calls.
Consolidate common setup pattern in a dedicated async function, setup_nested_list_test(). Update tests assert_nested_list_struct_schema_evolution and assert_nested_list_struct_schema_evolution_errors to utilize this helper for cleaner code and improved maintainability.
Removed unnecessary type aliases and wrapper functions to simplify error handling for chain field schema evolution. Directly called assert_nested_list_struct_schema_evolution_errors with concrete parameters for better clarity and reduced complexity.
Create extract_nested_list_values() helper function to encapsulate duplicate extraction logic for both ListArray and LargeListArray. Replace the inline match block in the test code with a single function call, streamlining the extraction pattern for improved readability and maintainability.
Eliminate unused Clone derive from MessageValue and simplify the struct definition to reduce noise. In nested_messages_batch(), optimize item_field ownership by computing message_data_type first, and use it when constructing the schema, minimizing clones. Added a comment to clarify the optimization.
Create `test_struct_schema_evolution_pair!` macro to consolidate test definitions for void and Result return types. Expand to distinct test functions for various schema evolution scenarios, improving test organization and reducing redundancy.
alamb
left a comment
There was a problem hiding this comment.
Thanks @kosiew -- I would still say these tests are "unit test" level as they seem to be exercising the Rust API functions
Is it possible to implement "end user API" tests -- specifically, .slt tests?
It isn't clear to me why we can't create these cases using SQL (or DataFrame)
| } | ||
|
|
||
| #[derive(Debug)] | ||
| struct MessageValue<'a> { |
There was a problem hiding this comment.
I am niot sure what a MessageValue is 🤔
There was a problem hiding this comment.
Renamed to NestedMessageRow
It is Fixture row for one nested struct element inside the messages list column.
| kind, | ||
| 1, | ||
| &[ | ||
| MessageValue { |
There was a problem hiding this comment.
I don't understand the reason to create these MessageValue / nested batches in the test setup -- can't we just make the batches directly? If we need another abstraction, at least perhaps we can add comments explaining what is going in
There was a problem hiding this comment.
The abstraction was meant to avoid repeating the same nested batch construction for both List and LargeList, and to make it easy to generate the “old schema” and “new schema” Parquet files that differ only by nested struct fields.
The goal here was to exercise the full Parquet scan + schema adaptation path end-to-end, not just a unit helper. That said, I agree the current helpers obscure the intent. I'll add comments that explain the old/new file shapes and why we need actual nested Parquet batches.
The main reason I wrote them this way is fixture creation: the cases need multiple Parquet files with intentionally different nested physical schemas (List<Struct<...>> / LargeList<Struct<...>>, additive nullable fields, extra fields, and incompatible variants). SQL/DataFrame is a good fit for the read/query side, but it doesn’t naturally create those mismatched Parquet fixtures in a self-contained way. In Rust I can generate the files inline and keep the test focused on the exact evolution shapes without checking in binary fixtures. That said, I agree there is value in an end-user API test. I can look at splitting this into:
|
Remove projected nested-field SQL assertion from the success-path helper. Focus Rust tests on direct nested schema adaptation across Parquet files, while retaining List/LargeList fixtures, failure-path tests, and new .slt coverage for SELECT * output and nested-field projection.
Update all call sites and function parameters to use NestedMessageRow in the same file. Add a brief comment above the struct to clarify that it represents one nested row element in the messages list fixture.
Add comments to clarify that the old batch construction lacks a chain in the nested struct. Highlight that the new batch construction includes a nullable chain and extra ignored fields. Also, clarify that the logical table schema expects an evolved shape and ignores source-only ignored fields.
Which issue does this PR close?
List<Struct>/ nested container types in Parquet scans #20835Rationale for this change
While the core fixes for nested struct schema evolution have landed in #20907, existing coverage is primarily at the unit/helper level. This PR adds end-to-end Parquet-based integration tests to validate that List and LargeList schema evolution behaves correctly through the full execution pipeline (planning, scanning, and projection).
This ensures that real-world query paths such as
SELECT *and nested field projection behave consistently and that previous repro cases are no longer failing.What changes are included in this PR?
1. End-to-end Rust integration tests
Added comprehensive tests in:
datafusion/core/tests/parquet/expr_adapter.rsThese tests:
Generate old/new Parquet files with differing nested struct schemas
Cover both
List<Struct<...>>andLargeList<Struct<...>>Validate:
SELECT *correctnessget_field(...)2. Error-path coverage
Added failure tests for both
ListandLargeList:Ensures parity across both list encodings and prevents partial regressions.
3. Test utilities and refactoring
Introduced reusable helpers to simplify nested test setup:
NestedListKindabstraction for List vs LargeListNestedMessageRowtest fixture structtest_struct_schema_evolution_pair!to generate paired testsThese reduce duplication and make it easier to extend the test matrix.
4. End-user API coverage via
.sltAdded:
datafusion/sqllogictest/test_files/schema_evolution_nested.sltThis validates behavior through SQL-only workflows:
COPY ... TO PARQUETto generate test filesCREATE EXTERNAL TABLEto query themCovers:
ListandLargeListAre these changes tested?
Yes.
This PR adds both:
Rust integration tests
ListandLargeListsqllogictest (
.slt) testsAll tests pass locally, including:
test_list_struct_schema_evolution_end_to_endtest_large_list_struct_schema_evolution_end_to_endAre there any user-facing changes?
No direct user-facing changes.
This PR improves correctness guarantees and test coverage for nested schema evolution, ensuring more predictable behavior for users working with evolving Parquet schemas.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.