Skip to content

Conversation

@sfc-gh-joshi
Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi commented Dec 16, 2025

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-2895675

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  3. Please describe how your code solves the related issue.

When an alias clause would emit an alias that does not change an input column's name, as in SELECT "A" AS "A", the alias is elided to just SELECT "A". This produces a substantial reduction in query size in some cases. For example, the SQL for the final query emitted by test_dataframe_join_suite.py::test_name_alias_on_multiple_join is reduced from 1044B -> 889B, a 17% reduction. This fix was requested for SCOS; a sample query emulating a user workload similarly experiences a 6% reduction in query size.

Implementation Details

Though the original ask was specifically to implement this change for JOIN operations, this PR applies the optimization to all generated queries. It does so with changes in two locations:

  1. unary_expression_extractor in analyzer.py: Avoids emitting SQL for an alias in locations when possible, when traversing a query plan. This is the simplest location to make this change, as it avoids the need to track down all call sites that generate an Alias node.
  2. derive_column_states_from_subquery in select_statement.py: This method compares the analyzed query strings of expressions to see if their values have changed. Previously, aliases always fully resolved ("A" AS "A"), so aliased columns were always assigned the CHANGED_EXP state; now, since redundant aliases resolve to just the column name ("A"), this method assigns UNCHANGED_EXP instead. My understanding is that this behavior should be correct, but this produced bugs in nested joins where a top-level projection did not properly use an alias for an ambiguous column.

I could not track down the root cause of the aliasing issue, as it appeared even with simpler fixes like modifying _alias_if_needed in dataframe.py, and replacing analyzer calls with parse_local_name as suggested by comments within select_statement.py. Based on running through the codebase (and asking Cursor), it is likely that a subquery alias mapping somewhere is not being populated somewhere during analysis, but it is not clear what step of the analysis process would be responsible for this.

The fix I chose provides the largest benefit (removing redundant aliasing for all queries, not just joins) while adhering as closely as possible to previous behavior when analyzing column change states.

@sfc-gh-joshi sfc-gh-joshi requested review from a team as code owners December 16, 2025 20:30
@github-actions github-actions bot added the local testing Local Testing issues/PRs label Dec 16, 2025
Comment on lines +770 to +771
isinstance(expr.child, (Attribute, UnresolvedAttribute))
and origin == quoted_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not move this check here?

def alias_expression(origin: str, alias: str) -> str:
return origin + AS + alias

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt it would be clearer to put this check outside the call to a direct SQL generation function. IMO it would be very unintuitive if a call to alias_expression had control logic that could emit SQL that did not represent an actual alias expression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

local testing Local Testing issues/PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants