SNOW-2895675: Skip aliases when source/destination column are identical #4037
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-2895675
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
When an alias clause would emit an alias that does not change an input column's name, as in
SELECT "A" AS "A", the alias is elided to justSELECT "A". This produces a substantial reduction in query size in some cases. For example, the SQL for the final query emitted bytest_dataframe_join_suite.py::test_name_alias_on_multiple_joinis reduced from 1044B -> 889B, a 17% reduction. This fix was requested for SCOS; a sample query emulating a user workload similarly experiences a 6% reduction in query size.Implementation Details
Though the original ask was specifically to implement this change for JOIN operations, this PR applies the optimization to all generated queries. It does so with changes in two locations:
unary_expression_extractorinanalyzer.py: Avoids emitting SQL for an alias in locations when possible, when traversing a query plan. This is the simplest location to make this change, as it avoids the need to track down all call sites that generate anAliasnode.derive_column_states_from_subqueryinselect_statement.py: This method compares the analyzed query strings of expressions to see if their values have changed. Previously, aliases always fully resolved ("A" AS "A"), so aliased columns were always assigned the CHANGED_EXP state; now, since redundant aliases resolve to just the column name ("A"), this method assigns UNCHANGED_EXP instead. My understanding is that this behavior should be correct, but this produced bugs in nested joins where a top-level projection did not properly use an alias for an ambiguous column.I could not track down the root cause of the aliasing issue, as it appeared even with simpler fixes like modifying
_alias_if_neededindataframe.py, and replacing analyzer calls withparse_local_nameas suggested by comments withinselect_statement.py. Based on running through the codebase (and asking Cursor), it is likely that a subquery alias mapping somewhere is not being populated somewhere during analysis, but it is not clear what step of the analysis process would be responsible for this.The fix I chose provides the largest benefit (removing redundant aliasing for all queries, not just joins) while adhering as closely as possible to previous behavior when analyzing column change states.