From 3bab2d9de88cfff82e2b0f98fdf645c75e1ceecd Mon Sep 17 00:00:00 2001 From: lichi Date: Fri, 19 Dec 2025 14:45:21 +0800 Subject: [PATCH] [feature]support recusive cte --- docs/query-data/cte.md | 118 ++++++++++++++++- .../current/query-data/cte.md | 120 +++++++++++++++++- 2 files changed, 226 insertions(+), 12 deletions(-) diff --git a/docs/query-data/cte.md b/docs/query-data/cte.md index ad5be0b9141a6..2e9f679947614 100644 --- a/docs/query-data/cte.md +++ b/docs/query-data/cte.md @@ -39,12 +39,118 @@ SELECT b, d FROM cte1 JOIN cte2 WHERE cte1.a = cte2.c; ``` -### Recursive CTE (NOT Support) +### Recursive CTE +A recursive CTE (Common Table Expression with the `RECURSIVE` keyword) is used to express self-referential queries within a single SQL statement, and is commonly applied in scenarios such as tree/hierarchy traversal, graph traversal, and hierarchical aggregation. A recursive CTE consists of two parts: + +- **Anchor query**: The non-recursive part, executed once to generate the initial row set (seed). +- **Recursive query**: Can reference the CTE itself and continue generating new rows based on the new rows produced in the previous iteration. + +The anchor and recursive parts are typically connected by `UNION` or `UNION ALL`. Recursive execution continues until no new rows are generated or a system limit is reached. + +## Syntax + +```sql +WITH [RECURSIVE] cte_name [(col1, col2, ...)] AS ( + -- Non-recursive part (executed once) + UNION [ALL] + -- Recursive part that can reference cte_name +) +SELECT ... FROM cte_name; +``` + +Key Points: +- The `RECURSIVE` keyword allows the CTE definition to reference itself. +- The number of columns and their data types output by the anchor and recursive members must be strictly consistent. +- The `cte_name` can be referenced in the `recursive_query`, usually used in the form of a `JOIN`. + +## Execution Semantics (Iterative Model) + +Typical iterative execution flow: +1. Execute the `anchor_query`, write the results to the output set (Output), and use them as the work set (WorkSet) for the first iteration. +2. While the WorkSet is not empty: + - Use the WorkSet as input for the `recursive_query`, execute the `recursive_query`, and obtain `newRows`. + - If `UNION ALL` is used: Directly append `newRows` to the Output and set `newRows` as the WorkSet for the next iteration. + - If `UNION` (deduplication) is used: Compute the difference set between `newRows` and the existing Output (to remove duplicates), and only add the non-duplicate rows to the Output and the next iteration's WorkSet. +3. Repeat step 2 until `newRows` is empty or a preset system upper limit is triggered (the Doris session variable `cte_max_recursion_depth` limits the recursion depth, with a default value of 100; exceeding this will throw an error). + +Termination occurs when no new rows are generated in the current iteration (or the system's maximum recursion depth limit is reached). + +## UNION vs UNION ALL + +- `UNION ALL`: Retains duplicates and has low execution overhead (no deduplication required). Suitable for scenarios where duplicates are allowed or controlled by business logic in the backend. +- `UNION`: Implicitly performs deduplication, which adds sorting/hash-based deduplication overhead per iteration or globally—this cost is significant, especially with large data volumes. + +Recommendation: Prefer `UNION ALL` if the semantics allow it and duplicates can be post-processed at the application layer. + +## Common Use Cases and SQL Examples + +### 1) Simple Hierarchy Traversal ```sql -WITH r_cte AS ( - SELECT 1 AS user_id, 2 as manager_id - UNION ALL - SELECT user_id, manager_id FROM r_cte INNER JOIN (SELECT 1 AS user_id, 2 as manager_id) t ON r_cte.manager_id = t.user_id +CREATE TABLE tree +( + id int, + parent_id int, + data varchar(100) +) DUPLICATE KEY (id) +DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ('replication_num' = '1'); + +INSERT INTO tree VALUES (0, NULL, 'ROOT'), (1, 0, 'Child_1'), (2, 0, 'Child_2'), (3, 1, 'Child_1_1'); + +WITH RECURSIVE search_tree AS ( + SELECT id, parent_id, data + FROM tree t + WHERE t.id = 0 +UNION ALL + SELECT t.id, t.parent_id, t.data + FROM tree t, search_tree st + WHERE t.parent_id = st.id +) +SELECT * FROM search_tree ORDER BY id; +``` + +### 2) Graph Traversal +```sql +CREATE TABLE graph +( + c_from int, + c_to int, + label varchar(100) +) DUPLICATE KEY (c_from) DISTRIBUTED BY HASH(c_from) BUCKETS 1 PROPERTIES 'replication_num' = '1'; + +INSERT INTO graph VALUES (1, 2, '1 -> 2'), (1, 3, '1 -> 3'), (2, 3, '2 -> 3'), (1, 4, '1 -> 4'), (4, 5, '4 -> 5'); + +WITH RECURSIVE search_graph AS ( + SELECT c_from, c_to, label FROM graph g +UNION ALL + SELECT g.c_from, g.c_to, g.label + FROM graph g, search_graph sg + WHERE g.c_from = sg.c_to ) -SELECT * FROM r_cte +SELECT DISTINCT * FROM search_graph ORDER BY c_from, c_to; +``` + +Note: Using `UNION` performs deduplication in each iteration, resulting in high overhead. + +## Limitations of Recursive CTEs + +- The top-level operator of the internal query must be UNION(ALL). +- Subqueries in the non-recursive part cannot reference the recursive CTE itself. +- Subqueries in the recursive part can only reference the recursive CTE once. +- If a subquery within the recursive part contains another nested subquery, the nested subquery cannot reference the recursive CTE. +- The data types of the output columns of a recursive CTE are determined by the output of the non-recursive subquery. An error will be thrown if the data types of the recursive and non-recursive sides do not match—manual casting is required to ensure consistency between the two sides. +- The session variable `cte_max_recursion_depth` limits the maximum number of recursions to prevent infinite loops (default value: 100). + +## Common Errors, Causes, and Solutions + +### 1. Error: Mismatched number of columns or data types between anchor and recursive members +- **Cause**: The number of columns or their data types in the `SELECT` clauses of the two parts are inconsistent. +- **Solution**: Ensure the number, order, and data types of columns on both sides are consistent. Use `CAST` or explicit column names if necessary. + +### 2. Error: Illegal self-reference in the anchor query +- **Cause**: The anchor query is not allowed to reference the CTE itself. +- **Solution**: Reference the CTE only in the recursive member; check the syntax/parse tree. + +### 3. Error: Infinite recursion / Exceeded maximum recursion depth +- **Cause**: The recursion lacks a convergence condition or the convergence condition is incorrectly configured. +- **Solution**: Add a `WHERE` filter, adjust the system's maximum recursion depth, or correct the query logic if infinite recursion is inherent to the logic. \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-data/cte.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-data/cte.md index 1d01c3e87ff4a..bed85564dae24 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-data/cte.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-data/cte.md @@ -39,12 +39,120 @@ SELECT b, d FROM cte1 JOIN cte2 WHERE cte1.a = cte2.c; ``` -### 递归 CTE(Doris 不支持) +### 递归 CTE + +递归 CTE(Common Table Expression with `RECURSIVE`)用于在单条 SQL 内表达自引用的查询,常用于树/层级遍历、图遍历和分层聚合等场景。递归 CTE 由两部分组成: + +- 锚点(anchor)查询:非递归部分,执行一次生成初始行集(seed)。 +- 递归(recursive)查询:可以引用 CTE 本体,基于上轮产生的新行继续生成新行。 + +锚点与递归部分通常由 `UNION` 或 `UNION ALL` 连接。递归执行直到不再产生新行或达到系统限制。 + +## 语法 + +```sql +WITH [RECURSIVE] cte_name [(col1, col2, ...)] AS ( + -- 非递归部分(一次执行) + UNION [ALL] + -- 可引用 cte_name 的递归部分 +) +SELECT ... FROM cte_name; +``` + +要点: + +- `RECURSIVE` 关键字允许在 CTE 定义中引用自身。 +- 锚点和递归成员输出的列数和类型必须严格一致。 +- `recursive_query` 中能引用 `cte_name`,通常以 `JOIN`形式使用。 + +## 执行语义(迭代模型) + +典型迭代执行流程: + +1. 执行 `anchor_query`,将结果写入输出集合(Output)并作为首轮的工作集合(WorkSet)。 +2. 当 WorkSet 非空时: + - 用 WorkSet 作为 `recursive_query` 的输入,执行 `recursive_query`,得到 `newRows`。 + - 若使用 `UNION ALL`:直接将 `newRows` 追加到 Output,并把 `newRows` 作为下一轮的 WorkSet。 + - 若使用 `UNION`(去重):对 `newRows` 与已有 Output 做差集(去重),只将未出现的行加入 Output 与下一轮 WorkSet。 +3. 重复步骤 2,直到 `newRows` 为空或触发系统预设的上限(Doris session变量限制递归深度`cte_max_recursion_depth` 默认值为100,超出会抛错)。 + +终止当当前轮没有新行被产生(或达到系统最大递归深度限制)。 + +## UNION vs UNION ALL + +- `UNION ALL`:保留重复,执行开销低(无需去重)。适用于允许重复或在后端由业务逻辑控制重复的场景。 +- `UNION`:隐含去重,会在每轮或全局增加排序/哈希去重开销,代价显著,尤其是在大数据量下。 + +建议:如果语义允许且能在应用层后处理重复,优先使用 `UNION ALL`。 + +## 常见用例与 SQL 示例 + +1) 简单层级遍历: ```sql -WITH r_cte AS ( - SELECT 1 AS user_id, 2 as manager_id - UNION ALL - SELECT user_id, manager_id FROM r_cte INNER JOIN (SELECT 1 AS user_id, 2 as manager_id) t ON r_cte.manager_id = t.user_id +CREATE TABLE tree +( + id int, + parent_id int, + data varchar(100) +) DUPLICATE KEY (id) +DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ('replication_num' = '1'); + +INSERT INTO tree VALUES (0, NULL, 'ROOT'), (1, 0, 'Child_1'), (2, 0, 'Child_2'), (3, 1, 'Child_1_1'); + +WITH RECURSIVE search_tree AS ( + SELECT id, parent_id, data + FROM tree t + WHERE t.id = 0 +UNION ALL + SELECT t.id, t.parent_id, t.data + FROM tree t, search_tree st + WHERE t.parent_id = st.id ) -SELECT * FROM r_cte +SELECT * FROM search_tree order BY id; +``` + +2) 图遍历: + +```sql +CREATE TABLE graph +( + c_from int, + c_to int, + label varchar(100) +) DUPLICATE KEY (c_from) DISTRIBUTED BY HASH(c_from) BUCKETS 1 PROPERTIES 'replication_num' = '1'); + +INSERT INTO graph VALUES (1, 2, '1 -> 2'), (1, 3, '1 -> 3'), (2, 3, '2 -> 3'), (1, 4, '1 -> 4'), (4, 5, '4 -> 5'); + +WITH RECURSIVE search_graph AS ( + SELECT c_from, c_to, label FROM graph g +UNION ALL + SELECT g.c_from, g.c_to, g.label + FROM graph g, search_graph sg + WHERE g.c_from = sg.c_to +) +SELECT DISTINCT * FROM search_graph ORDER BY c_from, c_to; +``` + +注意:使用 `UNION` 会在每轮做去重,代价较高。 + +## 递归CTE的限制 + +- 内部查询顶层操作符必须是UNION(ALL)。 +- 非递归部分的子查询不能引用递归CTE自身。 +- 递归部分的子查询只能引用递归CTE一次。 +- 递归部分的子查询如果内部还有子查询,那内部的子查询不能引用递归CTE。 +- 递归CTE的输出列类型由非递归侧的子查询输出决定,如果递归侧和非递归侧类型不一致,会报错。需要人工cast保证两边数据类型一致。 +- session变量`cte_max_recursion_depth`,限制最大递归次数,防止死循环,默认值是100。 + +## 常见错误、原因与解决办法 + +1. 错误:锚点与递归成员列数或类型不匹配 + - 原因:两部分 `SELECT` 列数或列类型不一致。 + - 解决:确保两侧列数、顺序与类型一致,必要时使用 `CAST` 或显式列名。 +2. 错误:锚点引用自身(非法) + - 原因:锚点不允许引用 CTE 本体。 + - 解决:只在递归成员中引用 CTE;检查语法/解析树。 +3. 错误:无限递归 / 超过最大递归深度 + - 原因:递归没有收敛条件或收敛条件设置不正确。 + - 解决:添加`WHERE` 过滤、或调整系统最大递归深度;若逻辑确实无限递归需修正查询逻辑。