Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 0 additions & 13 deletions docs/ai/vector-search/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,19 +31,6 @@ To achieve this, we need a mechanism to measure semantic relatedness between a u

Vector retrieval in RAG is not limited to text; it naturally extends to multimodal scenarios. In a multimodal RAG system, images, audio, video, and other data types can also be encoded into vectors for retrieval and then supplied to the generative model as context. For example, if a user uploads an image, the system can first retrieve related descriptions or knowledge snippets, then generate explanatory content. In medical QA, RAG can retrieve patient records and literature to support more accurate diagnostic suggestions.

## Brute-Force Search

Starting from version 2.0, Apache Doris supports nearest-neighbor search based on vector distance. Performing vector search with SQL is natural and simple:

```sql
SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance
FROM vector_table
ORDER BY distance
LIMIT 10;
```

When the dataset is small (under ~1 million rows), Doris’s exact K-Nearest Neighbor search performance is sufficient, providing 100% recall and precision. As the dataset grows, however, most users are willing to trade a small amount of recall/accuracy for significantly lower latency. The problem then becomes Approximate Nearest Neighbor (ANN) search.

## Approximate Nearest Neighbor Search

From version 4.0, Apache Doris officially supports ANN search. No additional data type is introduced: vectors are stored as fixed-length arrays. For distance-based indexing a new index type, ANN, is implemented based on Faiss.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,17 +27,6 @@ under the License.
-->

在生成式 AI 的应用中,单纯依赖大模型自身的参数“记忆”存在明显局限:一方面,模型知识具有时效性,无法覆盖最新信息;另一方面,完全依赖模型直接“生成”容易产生幻觉(Hallucination)。因此,RAG(检索增强生成)应运而生。其核心目标不是让模型凭空构造答案,而是从外部知识库中检索出与用户查询最相关的 Top-K 信息片段,作为生成依据。为实现这一点,需要一种机制衡量“用户查询”与“知识库文档”之间的语义相关性。向量表示正是常用手段:将查询与文档统一编码为语义向量后,可通过向量相似度衡量相关程度。随着预训练模型的发展,生成高质量语义向量已成主流,RAG 的检索阶段也演化为一个标准的向量相似度搜索问题——从大规模向量集合中找出与查询最相似的 K 个向量(候选知识片段)。需要注意,RAG 的向量检索不限于文本,也可扩展到多模态:图片、语音、视频等数据同样可以编码为向量供生成模型使用。例如,用户上传图片后,系统先检索相关描述或知识片段,再辅助生成解释性内容;在医学问答中,可检索病例资料与医学文献,生成更准确的诊断建议。
## 暴力搜索
Apache Doris 自 2.0 版本起支持基于向量距离的最近邻搜索,通过 SQL 实现向量搜索是一个自然且简单的过程。

```sql
SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance
FROM vector_table
ORDER BY distance
LIMIT 10;
```

当数据量不大(小于 100 万行)时,Apache Doris 的精确最近邻(K-Nearest Neighbor)搜索性能足以满足需求,可获得 100% 召回与 100% 精确。但随着数据进一步增长,用户通常愿意牺牲少量召回与精度以换取显著的查询加速,此时问题就转化为向量近似最近邻搜索(Approximate Nearest Neighbor,ANN)。

## 近似最近邻搜索

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,17 +27,6 @@ under the License.
-->

在生成式 AI 的应用中,单纯依赖大模型自身的参数“记忆”存在明显局限:一方面,模型知识具有时效性,无法覆盖最新信息;另一方面,完全依赖模型直接“生成”容易产生幻觉(Hallucination)。因此,RAG(检索增强生成)应运而生。其核心目标不是让模型凭空构造答案,而是从外部知识库中检索出与用户查询最相关的 Top-K 信息片段,作为生成依据。为实现这一点,需要一种机制衡量“用户查询”与“知识库文档”之间的语义相关性。向量表示正是常用手段:将查询与文档统一编码为语义向量后,可通过向量相似度衡量相关程度。随着预训练模型的发展,生成高质量语义向量已成主流,RAG 的检索阶段也演化为一个标准的向量相似度搜索问题——从大规模向量集合中找出与查询最相似的 K 个向量(候选知识片段)。需要注意,RAG 的向量检索不限于文本,也可扩展到多模态:图片、语音、视频等数据同样可以编码为向量供生成模型使用。例如,用户上传图片后,系统先检索相关描述或知识片段,再辅助生成解释性内容;在医学问答中,可检索病例资料与医学文献,生成更准确的诊断建议。
## 暴力搜索
Apache Doris 自 2.0 版本起支持基于向量距离的最近邻搜索,通过 SQL 实现向量搜索是一个自然且简单的过程。

```sql
SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance
FROM vector_table
ORDER BY distance
LIMIT 10;
```

当数据量不大(小于 100 万行)时,Apache Doris 的精确最近邻(K-Nearest Neighbor)搜索性能足以满足需求,可获得 100% 召回与 100% 精确。但随着数据进一步增长,用户通常愿意牺牲少量召回与精度以换取显著的查询加速,此时问题就转化为向量近似最近邻搜索(Approximate Nearest Neighbor,ANN)。

## 近似最近邻搜索

Expand Down
13 changes: 0 additions & 13 deletions versioned_docs/version-4.x/ai/vector-search/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,19 +31,6 @@ To achieve this, we need a mechanism to measure semantic relatedness between a u

Vector retrieval in RAG is not limited to text; it naturally extends to multimodal scenarios. In a multimodal RAG system, images, audio, video, and other data types can also be encoded into vectors for retrieval and then supplied to the generative model as context. For example, if a user uploads an image, the system can first retrieve related descriptions or knowledge snippets, then generate explanatory content. In medical QA, RAG can retrieve patient records and literature to support more accurate diagnostic suggestions.

## Brute-Force Search

Starting from version 2.0, Apache Doris supports nearest-neighbor search based on vector distance. Performing vector search with SQL is natural and simple:

```sql
SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance
FROM vector_table
ORDER BY distance
LIMIT 10;
```

When the dataset is small (under ~1 million rows), Doris’s exact K-Nearest Neighbor search performance is sufficient, providing 100% recall and precision. As the dataset grows, however, most users are willing to trade a small amount of recall/accuracy for significantly lower latency. The problem then becomes Approximate Nearest Neighbor (ANN) search.

## Approximate Nearest Neighbor Search

From version 4.0, Apache Doris officially supports ANN search. No additional data type is introduced: vectors are stored as fixed-length arrays. For distance-based indexing a new index type, ANN, is implemented based on Faiss.
Expand Down