Skip to content

fix: FilterPushdown incorrectly remaps filters through ProjectionExec with duplicate column names#21247

Open
yashrb24 wants to merge 3 commits intoapache:mainfrom
yashrb24:fix/filter-pushdown-duplicate-columns
Open

fix: FilterPushdown incorrectly remaps filters through ProjectionExec with duplicate column names#21247
yashrb24 wants to merge 3 commits intoapache:mainfrom
yashrb24:fix/filter-pushdown-duplicate-columns

Conversation

@yashrb24
Copy link
Copy Markdown

@yashrb24 yashrb24 commented Mar 30, 2026

Which issue does this PR close?

Rationale for this change

ProjectionExec::gather_filters_for_pushdown silently rewrites filter predicates to the wrong source column when the output schema contains duplicate column names — a structure that arises above joins where both sides share a column name. Two functions use name-only schema lookups (column_with_name and index_of) that always return the first match, which is incorrect when duplicate names exist:

  1. collect_reverse_alias — HashMap key collision causes the second duplicate to overwrite the first.
  2. FilterRemapper::try_remapindex_of silently rewrites column indices from non-first duplicates to position 0.

This code path is not exercised through normal SQL because the logical optimizer's PushDownFilter resolves qualified column references and pushes filters below projections before the physical plan is created. However, it affects any direct construction of physical plans.

What changes are included in this PR?

  1. collect_reverse_alias: Use enumerate() index instead of column_with_name(). Projection expressions are positionally aligned with the output schema, so idx is the correct output column index.

  2. gather_filters_for_pushdown: Replace FilterRemapper::try_remap (which uses index_of) with direct validation against the alias map's exact (name, index) keys. The PhysicalColumnRewriter already does an exact-key lookup, so try_remap was both redundant and wrong for this case.

Are these changes tested?

Yes. A regression test is added that constructs the exact physical plan structure triggering the bug (FilterExec → ProjectionExec with duplicate column names → HashJoinExec), runs the FilterPushdown optimizer, and verifies the optimized plan returns correct results (3 rows instead of the previous 0).

Are there any user-facing changes?

No API changes. Fixes incorrect query results for physical plans with duplicate column names in projections.

… with duplicate column names

collect_reverse_alias used column_with_name() which returns the first
match for a given name. When a projection has duplicate column names
(e.g. two "id" columns from both sides of a join), both entries get
the same HashMap key and the second silently overwrites the first.

gather_filters_for_pushdown used FilterRemapper::try_remap which calls
index_of(), also returning the first match. A filter on id@2 (second
occurrence) gets silently rewritten to id@0.

Fix 1: Use enumerate index in collect_reverse_alias instead of
column_with_name. Projection expressions are positionally aligned
with the output schema.

Fix 2: Replace FilterRemapper::try_remap with direct validation
against the alias map's exact (name, index) keys via collect_columns.
@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FilterPushdown physical optimizer incorrectly remaps filters through ProjectionExec with duplicate column names

1 participant