Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 112 additions & 6 deletions docs/query-data/cte.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,118 @@ SELECT b, d FROM cte1 JOIN cte2
WHERE cte1.a = cte2.c;
```

### Recursive CTE (NOT Support)
### Recursive CTE

A recursive CTE (Common Table Expression with the `RECURSIVE` keyword) is used to express self-referential queries within a single SQL statement, and is commonly applied in scenarios such as tree/hierarchy traversal, graph traversal, and hierarchical aggregation. A recursive CTE consists of two parts:

- **Anchor query**: The non-recursive part, executed once to generate the initial row set (seed).
- **Recursive query**: Can reference the CTE itself and continue generating new rows based on the new rows produced in the previous iteration.

The anchor and recursive parts are typically connected by `UNION` or `UNION ALL`. Recursive execution continues until no new rows are generated or a system limit is reached.

## Syntax

```sql
WITH [RECURSIVE] cte_name [(col1, col2, ...)] AS (
<anchor_query> -- Non-recursive part (executed once)
UNION [ALL]
<recursive_query> -- Recursive part that can reference cte_name
)
SELECT ... FROM cte_name;
```

Key Points:
- The `RECURSIVE` keyword allows the CTE definition to reference itself.
- The number of columns and their data types output by the anchor and recursive members must be strictly consistent.
- The `cte_name` can be referenced in the `recursive_query`, usually used in the form of a `JOIN`.

## Execution Semantics (Iterative Model)

Typical iterative execution flow:
1. Execute the `anchor_query`, write the results to the output set (Output), and use them as the work set (WorkSet) for the first iteration.
2. While the WorkSet is not empty:
- Use the WorkSet as input for the `recursive_query`, execute the `recursive_query`, and obtain `newRows`.
- If `UNION ALL` is used: Directly append `newRows` to the Output and set `newRows` as the WorkSet for the next iteration.
- If `UNION` (deduplication) is used: Compute the difference set between `newRows` and the existing Output (to remove duplicates), and only add the non-duplicate rows to the Output and the next iteration's WorkSet.
3. Repeat step 2 until `newRows` is empty or a preset system upper limit is triggered (the Doris session variable `cte_max_recursion_depth` limits the recursion depth, with a default value of 100; exceeding this will throw an error).

Termination occurs when no new rows are generated in the current iteration (or the system's maximum recursion depth limit is reached).

## UNION vs UNION ALL

- `UNION ALL`: Retains duplicates and has low execution overhead (no deduplication required). Suitable for scenarios where duplicates are allowed or controlled by business logic in the backend.
- `UNION`: Implicitly performs deduplication, which adds sorting/hash-based deduplication overhead per iteration or globally—this cost is significant, especially with large data volumes.

Recommendation: Prefer `UNION ALL` if the semantics allow it and duplicates can be post-processed at the application layer.

## Common Use Cases and SQL Examples

### 1) Simple Hierarchy Traversal
```sql
WITH r_cte AS (
SELECT 1 AS user_id, 2 as manager_id
UNION ALL
SELECT user_id, manager_id FROM r_cte INNER JOIN (SELECT 1 AS user_id, 2 as manager_id) t ON r_cte.manager_id = t.user_id
CREATE TABLE tree
(
id int,
parent_id int,
data varchar(100)
) DUPLICATE KEY (id)
DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ('replication_num' = '1');

INSERT INTO tree VALUES (0, NULL, 'ROOT'), (1, 0, 'Child_1'), (2, 0, 'Child_2'), (3, 1, 'Child_1_1');

WITH RECURSIVE search_tree AS (
SELECT id, parent_id, data
FROM tree t
WHERE t.id = 0
UNION ALL
SELECT t.id, t.parent_id, t.data
FROM tree t, search_tree st
WHERE t.parent_id = st.id
)
SELECT * FROM search_tree ORDER BY id;
```

### 2) Graph Traversal
```sql
CREATE TABLE graph
(
c_from int,
c_to int,
label varchar(100)
) DUPLICATE KEY (c_from) DISTRIBUTED BY HASH(c_from) BUCKETS 1 PROPERTIES 'replication_num' = '1';

INSERT INTO graph VALUES (1, 2, '1 -> 2'), (1, 3, '1 -> 3'), (2, 3, '2 -> 3'), (1, 4, '1 -> 4'), (4, 5, '4 -> 5');

WITH RECURSIVE search_graph AS (
SELECT c_from, c_to, label FROM graph g
UNION ALL
SELECT g.c_from, g.c_to, g.label
FROM graph g, search_graph sg
WHERE g.c_from = sg.c_to
)
SELECT * FROM r_cte
SELECT DISTINCT * FROM search_graph ORDER BY c_from, c_to;
```

Note: Using `UNION` performs deduplication in each iteration, resulting in high overhead.

## Limitations of Recursive CTEs

- The top-level operator of the internal query must be UNION(ALL).
- Subqueries in the non-recursive part cannot reference the recursive CTE itself.
- Subqueries in the recursive part can only reference the recursive CTE once.
- If a subquery within the recursive part contains another nested subquery, the nested subquery cannot reference the recursive CTE.
- The data types of the output columns of a recursive CTE are determined by the output of the non-recursive subquery. An error will be thrown if the data types of the recursive and non-recursive sides do not match—manual casting is required to ensure consistency between the two sides.
- The session variable `cte_max_recursion_depth` limits the maximum number of recursions to prevent infinite loops (default value: 100).

## Common Errors, Causes, and Solutions

### 1. Error: Mismatched number of columns or data types between anchor and recursive members
- **Cause**: The number of columns or their data types in the `SELECT` clauses of the two parts are inconsistent.
- **Solution**: Ensure the number, order, and data types of columns on both sides are consistent. Use `CAST` or explicit column names if necessary.

### 2. Error: Illegal self-reference in the anchor query
- **Cause**: The anchor query is not allowed to reference the CTE itself.
- **Solution**: Reference the CTE only in the recursive member; check the syntax/parse tree.

### 3. Error: Infinite recursion / Exceeded maximum recursion depth
- **Cause**: The recursion lacks a convergence condition or the convergence condition is incorrectly configured.
- **Solution**: Add a `WHERE` filter, adjust the system's maximum recursion depth, or correct the query logic if infinite recursion is inherent to the logic.
120 changes: 114 additions & 6 deletions i18n/zh-CN/docusaurus-plugin-content-docs/current/query-data/cte.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,120 @@ SELECT b, d FROM cte1 JOIN cte2
WHERE cte1.a = cte2.c;
```

### 递归 CTE(Doris 不支持)
### 递归 CTE

递归 CTE(Common Table Expression with `RECURSIVE`)用于在单条 SQL 内表达自引用的查询,常用于树/层级遍历、图遍历和分层聚合等场景。递归 CTE 由两部分组成:

- 锚点(anchor)查询:非递归部分,执行一次生成初始行集(seed)。
- 递归(recursive)查询:可以引用 CTE 本体,基于上轮产生的新行继续生成新行。

锚点与递归部分通常由 `UNION` 或 `UNION ALL` 连接。递归执行直到不再产生新行或达到系统限制。

## 语法

```sql
WITH [RECURSIVE] cte_name [(col1, col2, ...)] AS (
<anchor_query> -- 非递归部分(一次执行)
UNION [ALL]
<recursive_query> -- 可引用 cte_name 的递归部分
)
SELECT ... FROM cte_name;
```

要点:

- `RECURSIVE` 关键字允许在 CTE 定义中引用自身。
- 锚点和递归成员输出的列数和类型必须严格一致。
- `recursive_query` 中能引用 `cte_name`,通常以 `JOIN`形式使用。

## 执行语义(迭代模型)

典型迭代执行流程:

1. 执行 `anchor_query`,将结果写入输出集合(Output)并作为首轮的工作集合(WorkSet)。
2. 当 WorkSet 非空时:
- 用 WorkSet 作为 `recursive_query` 的输入,执行 `recursive_query`,得到 `newRows`。
- 若使用 `UNION ALL`:直接将 `newRows` 追加到 Output,并把 `newRows` 作为下一轮的 WorkSet。
- 若使用 `UNION`(去重):对 `newRows` 与已有 Output 做差集(去重),只将未出现的行加入 Output 与下一轮 WorkSet。
3. 重复步骤 2,直到 `newRows` 为空或触发系统预设的上限(Doris session变量限制递归深度`cte_max_recursion_depth` 默认值为100,超出会抛错)。

终止当当前轮没有新行被产生(或达到系统最大递归深度限制)。

## UNION vs UNION ALL

- `UNION ALL`:保留重复,执行开销低(无需去重)。适用于允许重复或在后端由业务逻辑控制重复的场景。
- `UNION`:隐含去重,会在每轮或全局增加排序/哈希去重开销,代价显著,尤其是在大数据量下。

建议:如果语义允许且能在应用层后处理重复,优先使用 `UNION ALL`。

## 常见用例与 SQL 示例

1) 简单层级遍历:

```sql
WITH r_cte AS (
SELECT 1 AS user_id, 2 as manager_id
UNION ALL
SELECT user_id, manager_id FROM r_cte INNER JOIN (SELECT 1 AS user_id, 2 as manager_id) t ON r_cte.manager_id = t.user_id
CREATE TABLE tree
(
id int,
parent_id int,
data varchar(100)
) DUPLICATE KEY (id)
DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES ('replication_num' = '1');

INSERT INTO tree VALUES (0, NULL, 'ROOT'), (1, 0, 'Child_1'), (2, 0, 'Child_2'), (3, 1, 'Child_1_1');

WITH RECURSIVE search_tree AS (
SELECT id, parent_id, data
FROM tree t
WHERE t.id = 0
UNION ALL
SELECT t.id, t.parent_id, t.data
FROM tree t, search_tree st
WHERE t.parent_id = st.id
)
SELECT * FROM r_cte
SELECT * FROM search_tree order BY id;
```

2) 图遍历:

```sql
CREATE TABLE graph
(
c_from int,
c_to int,
label varchar(100)
) DUPLICATE KEY (c_from) DISTRIBUTED BY HASH(c_from) BUCKETS 1 PROPERTIES 'replication_num' = '1');

INSERT INTO graph VALUES (1, 2, '1 -> 2'), (1, 3, '1 -> 3'), (2, 3, '2 -> 3'), (1, 4, '1 -> 4'), (4, 5, '4 -> 5');

WITH RECURSIVE search_graph AS (
SELECT c_from, c_to, label FROM graph g
UNION ALL
SELECT g.c_from, g.c_to, g.label
FROM graph g, search_graph sg
WHERE g.c_from = sg.c_to
)
SELECT DISTINCT * FROM search_graph ORDER BY c_from, c_to;
```

注意:使用 `UNION` 会在每轮做去重,代价较高。

## 递归CTE的限制

- 内部查询顶层操作符必须是UNION(ALL)。
- 非递归部分的子查询不能引用递归CTE自身。
- 递归部分的子查询只能引用递归CTE一次。
- 递归部分的子查询如果内部还有子查询,那内部的子查询不能引用递归CTE。
- 递归CTE的输出列类型由非递归侧的子查询输出决定,如果递归侧和非递归侧类型不一致,会报错。需要人工cast保证两边数据类型一致。
- session变量`cte_max_recursion_depth`,限制最大递归次数,防止死循环,默认值是100。

## 常见错误、原因与解决办法

1. 错误:锚点与递归成员列数或类型不匹配
- 原因:两部分 `SELECT` 列数或列类型不一致。
- 解决:确保两侧列数、顺序与类型一致,必要时使用 `CAST` 或显式列名。
2. 错误:锚点引用自身(非法)
- 原因:锚点不允许引用 CTE 本体。
- 解决:只在递归成员中引用 CTE;检查语法/解析树。
3. 错误:无限递归 / 超过最大递归深度
- 原因:递归没有收敛条件或收敛条件设置不正确。
- 解决:添加`WHERE` 过滤、或调整系统最大递归深度;若逻辑确实无限递归需修正查询逻辑。