-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Parquet: Push down supported list predicates (array_has/any/all) during decoding #19545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kosiew
wants to merge
22
commits into
apache:main
Choose a base branch
from
kosiew:nested-filter-18560
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+657
−38
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Document supported nested pushdown semantics and update row-level predicate construction to utilize leaf-based projection masks. Enable list-aware predicates like array_has_all while maintaining unsupported nested structures on the fallback path. Expand filter candidate building for root and leaf projections of nested columns, facilitating cost estimation and mask creation aligned with Parquet leaf layouts. Include struct/list pushdown checks and add a new integration test to validate array_has_all pushdown behavior against Parquet row filters. Introduce dev dependencies for nested function helpers and temporary file creation used in the new tests.
Extract supports_list_predicates() into its own module and create a SUPPORTED_ARRAY_FUNCTIONS constant registry for improved management. Add is_supported_list_predicate() helper function for easier extensibility, along with comprehensive documentation and unit tests. Refactor check_single_column() using intermediate variables to clarify logic for handling structs and unsupported lists. Introduce a new test case for mixed primitive and struct predicates to ensure proper functionality and validation of pushable predicates.
Extract common test logic into test_array_predicate_pushdown helper function to reduce duplication and ensure parity across all three supported array functions (array_has, array_has_all, array_has_any). This makes it easier to maintain and extend test coverage for new array functions in the future. Benefits: - Reduces code duplication from ~70 lines × 3 to ~10 lines × 3 - Ensures consistent test methodology across all array functions - Clear documentation of expected behavior for each function - Easier to add tests for new supported functions
Add detailed rustdoc examples to can_expr_be_pushed_down_with_schemas() showing three key scenarios: 1. Primitive column filters (allowed) - e.g., age > 30 2. Struct column filters (blocked) - e.g., person IS NOT NULL 3. List column filters with supported predicates (allowed) - e.g., array_has_all(tags, ['rust']) These examples help users understand when filter pushdown to the Parquet decoder is available and guide them in writing efficient queries. Benefits: - Clear documentation of supported and unsupported cases - Helps users optimize query performance - Provides copy-paste examples for common patterns - Updated to reflect new list column support
- Replace 'while let Some(batch) = reader.next()' with idiomatic 'for batch in reader' - Remove unnecessary mut from reader variable - Addresses clippy::while_let_on_iterator warning
- Document function name detection assumption in supported_predicates - Note reliance on exact string matching - Suggest trait-based approach for future robustness - Explain ProjectionMask::leaves() choice for nested columns - Clarify why leaf indices are needed for nested structures - Helps reviewers understand Parquet schema descriptor usage These comments address Low Priority suggestions from code review, improving maintainability and onboarding for future contributors.
Remove SUPPORTED_ARRAY_FUNCTIONS array. Introduce dedicated predicate functions for NULL checks and scalar function support. Utilize pattern matching with matches! macro instead of array lookups. Enhance code clarity and idiomatic Rust usage with is_some_and() for condition checks and simplify recursion using a single expression.
Extract helper functions to reduce code duplication in array pushdown and physical plan tests. Consolidate similar assertions and checks, simplifying tests from ~50 to ~30 lines. Transform display tests into a single parameterized test, maintaining coverage while eliminating repeated code.
…monstrations" This reverts commit 94f1a99cee4e44e5176450156a684a2316af78e1.
Extract handle_nested_type() to encapsulate logic for determining if a nested type prevents pushdown. Introduce is_nested_type_supported() to isolate type checking for List/LargeList/FixedSizeList and predicate support. Simplify check_single_column() by reducing nesting depth and delegating nested type logic to helper methods.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
DataFusion’s Parquet row-level filter pushdown previously rejected all nested Arrow types (lists/structs), which prevented common and performance-sensitive filters on list columns (for example
array_has,array_has_all,array_has_any) from being evaluated during Parquet decoding.Enabling safe pushdown for a small, well-defined set of list-aware predicates allows Parquet decoding to apply these filters earlier, reducing materialization work and improving scan performance, while still keeping unsupported nested projections (notably structs) evaluated after batches are materialized.
What changes are included in this PR?
Allow a registry of list-aware predicates to be considered pushdown-compatible:
array_has,array_has_all,array_has_anyIS NULL/IS NOT NULLIntroduce
supported_predicatesmodule to detect whether an expression tree contains supported list predicates.Update Parquet filter candidate selection to:
Switch Parquet projection mask construction from root indices to leaf indices (
ProjectionMask::leaves) so nested list filters project the correct leaf columns for decoding-time evaluation.Expand root column indices to leaf indices for nested columns using the Parquet
SchemaDescriptor.Add unit tests verifying:
array_has,array_has_all,array_has_anyactually filter rows during decoding using a temp Parquet file.Add sqllogictest coverage proving both correctness and plan behavior:
EXPLAINshows predicates pushed intoDataSourceExecfor Parquet.Are these changes tested?
Yes.
Rust unit tests in
datafusion/datasource-parquet/src/row_filter.rs:SQL logic tests in
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:array_has,array_has_all,array_has_anyand combinations (OR / AND with other predicates).DataSourceExec ... predicate=...).Are there any user-facing changes?
Yes.
Parquet filter pushdown now supports list columns for the following predicates:
array_has,array_has_all,array_has_anyIS NULL,IS NOT NULLThis can improve query performance for workloads that filter on array/list columns.
No breaking changes are introduced; unsupported nested types (for example structs) continue to be evaluated after decoding.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.