Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Dec 29, 2025

Which issue does this PR close?

Rationale for this change

DataFusion’s Parquet row-level filter pushdown previously rejected all nested Arrow types (lists/structs), which prevented common and performance-sensitive filters on list columns (for example array_has, array_has_all, array_has_any) from being evaluated during Parquet decoding.

Enabling safe pushdown for a small, well-defined set of list-aware predicates allows Parquet decoding to apply these filters earlier, reducing materialization work and improving scan performance, while still keeping unsupported nested projections (notably structs) evaluated after batches are materialized.

What changes are included in this PR?

  • Allow a registry of list-aware predicates to be considered pushdown-compatible:

    • array_has, array_has_all, array_has_any
    • IS NULL / IS NOT NULL
  • Introduce supported_predicates module to detect whether an expression tree contains supported list predicates.

  • Update Parquet filter candidate selection to:

    • Accept list columns only when the predicate semantics are supported.
    • Continue rejecting struct columns (and other unsupported nested types).
  • Switch Parquet projection mask construction from root indices to leaf indices (ProjectionMask::leaves) so nested list filters project the correct leaf columns for decoding-time evaluation.

  • Expand root column indices to leaf indices for nested columns using the Parquet SchemaDescriptor.

  • Add unit tests verifying:

    • List columns are accepted for pushdown when used by supported predicates.
    • Struct columns (and mixed struct+primitive predicates) prevent pushdown.
    • array_has, array_has_all, array_has_any actually filter rows during decoding using a temp Parquet file.
  • Add sqllogictest coverage proving both correctness and plan behavior:

    • Queries return expected results.
    • EXPLAIN shows predicates pushed into DataSourceExec for Parquet.

Are these changes tested?

Yes.

  • Rust unit tests in datafusion/datasource-parquet/src/row_filter.rs:

    • Validate pushdown eligibility for list vs struct predicates.
    • Create a temp Parquet file and confirm list predicates prune/match the expected rows via Parquet decoding row filtering.
  • SQL logic tests in datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:

    • Add end-to-end coverage for array_has, array_has_all, array_has_any and combinations (OR / AND with other predicates).
    • Confirm pushdown appears in the physical plan (DataSourceExec ... predicate=...).

Are there any user-facing changes?

Yes.

  • Parquet filter pushdown now supports list columns for the following predicates:

    • array_has, array_has_all, array_has_any
    • IS NULL, IS NOT NULL

This can improve query performance for workloads that filter on array/list columns.

No breaking changes are introduced; unsupported nested types (for example structs) continue to be evaluated after decoding.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

Document supported nested pushdown semantics and update
row-level predicate construction to utilize leaf-based
projection masks. Enable list-aware predicates like
array_has_all while maintaining unsupported nested
structures on the fallback path.

Expand filter candidate building for root and leaf
projections of nested columns, facilitating cost
estimation and mask creation aligned with Parquet leaf
layouts. Include struct/list pushdown checks and add a
new integration test to validate array_has_all
pushdown behavior against Parquet row filters.
Introduce dev dependencies for nested function helpers
and temporary file creation used in the new tests.
Extract supports_list_predicates() into its own module and
create a SUPPORTED_ARRAY_FUNCTIONS constant registry for
improved management. Add is_supported_list_predicate()
helper function for easier extensibility, along with
comprehensive documentation and unit tests.

Refactor check_single_column() using intermediate variables
to clarify logic for handling structs and unsupported lists.
Introduce a new test case for mixed primitive and struct
predicates to ensure proper functionality and validation
of pushable predicates.
Extract common test logic into test_array_predicate_pushdown helper
function to reduce duplication and ensure parity across all three
supported array functions (array_has, array_has_all, array_has_any).

This makes it easier to maintain and extend test coverage for new
array functions in the future.

Benefits:
- Reduces code duplication from ~70 lines × 3 to ~10 lines × 3
- Ensures consistent test methodology across all array functions
- Clear documentation of expected behavior for each function
- Easier to add tests for new supported functions
Add detailed rustdoc examples to can_expr_be_pushed_down_with_schemas()
showing three key scenarios:

1. Primitive column filters (allowed) - e.g., age > 30
2. Struct column filters (blocked) - e.g., person IS NOT NULL
3. List column filters with supported predicates (allowed) -
   e.g., array_has_all(tags, ['rust'])

These examples help users understand when filter pushdown to the
Parquet decoder is available and guide them in writing efficient
queries.

Benefits:
- Clear documentation of supported and unsupported cases
- Helps users optimize query performance
- Provides copy-paste examples for common patterns
- Updated to reflect new list column support
- Replace 'while let Some(batch) = reader.next()' with idiomatic 'for batch in reader'
- Remove unnecessary mut from reader variable
- Addresses clippy::while_let_on_iterator warning
- Document function name detection assumption in supported_predicates
  - Note reliance on exact string matching
  - Suggest trait-based approach for future robustness
- Explain ProjectionMask::leaves() choice for nested columns
  - Clarify why leaf indices are needed for nested structures
  - Helps reviewers understand Parquet schema descriptor usage

These comments address Low Priority suggestions from code review,
improving maintainability and onboarding for future contributors.
Remove SUPPORTED_ARRAY_FUNCTIONS array. Introduce dedicated predicate
functions for NULL checks and scalar function support. Utilize
pattern matching with matches! macro instead of array lookups.
Enhance code clarity and idiomatic Rust usage with is_some_and()
for condition checks and simplify recursion using a single
expression.
Extract helper functions to reduce code duplication in
array pushdown and physical plan tests. Consolidate similar
assertions and checks, simplifying tests from ~50 to ~30
lines. Transform display tests into a single parameterized
test, maintaining coverage while eliminating repeated code.
…monstrations"

This reverts commit 94f1a99cee4e44e5176450156a684a2316af78e1.
Extract handle_nested_type() to encapsulate logic for
determining if a nested type prevents pushdown. Introduce
is_nested_type_supported() to isolate type checking for
List/LargeList/FixedSizeList and predicate support.
Simplify check_single_column() by reducing nesting depth
and delegating nested type logic to helper methods.
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Dec 29, 2025
@kosiew kosiew marked this pull request as ready for review December 29, 2025 14:15
@kosiew kosiew requested a review from zhuqi-lucas December 29, 2025 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support nested datatype filter pushdown to parquet

1 participant