Skip to content

fix: build_fallback_field_id_map produces incorrect column indices for schemas with nested types#2307

Open
mbutrovich wants to merge 6 commits intoapache:mainfrom
mbutrovich:field_id_nested_types
Open

fix: build_fallback_field_id_map produces incorrect column indices for schemas with nested types#2307
mbutrovich wants to merge 6 commits intoapache:mainfrom
mbutrovich:field_id_nested_types

Conversation

@mbutrovich
Copy link
Copy Markdown
Collaborator

@mbutrovich mbutrovich commented Mar 31, 2026

Which issue does this PR close?

What changes are included in this PR?

build_fallback_field_id_map iterated over Parquet leaf columns instead of top-level fields when building the field ID to column index mapping for migrated files (no embedded field IDs). When nested types (struct, list, map) precede a primitive column, they expand into multiple leaves, causing the mapping to diverge from add_fallback_field_ids_to_arrow_schema which correctly assigns ordinal IDs to top-level Arrow fields. This made predicates on columns after nested types resolve to a leaf inside the group, crashing with "Leaf column id in predicates isn't a root column in Parquet schema".

The fix iterates root_schema().get_fields() directly, assigning ordinal IDs only to top-level fields. For non-primitive fields (struct/list/map), it uses get_column_root_idx to advance past their leaf columns. This mirrors iceberg-java's ParquetSchemaUtil.addFallbackIds(), which iterates fileSchema.getFields() assigning ordinal IDs to top-level fields.

Also renames "Leave column" to "Leaf column" in error messages.

Are these changes tested?

@mbutrovich mbutrovich requested a review from blackmwk April 1, 2026 15:12
Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich for this fix! Generally LGTM!

@mbutrovich
Copy link
Copy Markdown
Collaborator Author

Most recent changes:

  • crates/iceberg/Cargo.toml — added serde_arrow = { version = "0.14", features = ["arrow-58"] } dev-dependency
  • crates/iceberg/src/arrow/reader.rs — replaced three separate tests (test_predicate_on_migrated_file_with_{struct,list,map}) and the shared helper with one test_predicate_on_migrated_file_with_nested_types that:
    • Uses serde_arrow with Serialize/Deserialize structs for readable data construction
    • Tests a schema with all three nested types (struct, list, map) before the id column

@mbutrovich mbutrovich requested a review from blackmwk April 2, 2026 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: build_fallback_field_id_map produces incorrect column indices for schemas with nested types

2 participants