diff --git a/docs/ai/vector-search/overview.md b/docs/ai/vector-search/overview.md index 93018ce9c2ffe..557ae2528c770 100644 --- a/docs/ai/vector-search/overview.md +++ b/docs/ai/vector-search/overview.md @@ -31,19 +31,6 @@ To achieve this, we need a mechanism to measure semantic relatedness between a u Vector retrieval in RAG is not limited to text; it naturally extends to multimodal scenarios. In a multimodal RAG system, images, audio, video, and other data types can also be encoded into vectors for retrieval and then supplied to the generative model as context. For example, if a user uploads an image, the system can first retrieve related descriptions or knowledge snippets, then generate explanatory content. In medical QA, RAG can retrieve patient records and literature to support more accurate diagnostic suggestions. -## Brute-Force Search - -Starting from version 2.0, Apache Doris supports nearest-neighbor search based on vector distance. Performing vector search with SQL is natural and simple: - -```sql -SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance -FROM vector_table -ORDER BY distance -LIMIT 10; -``` - -When the dataset is small (under ~1 million rows), Doris’s exact K-Nearest Neighbor search performance is sufficient, providing 100% recall and precision. As the dataset grows, however, most users are willing to trade a small amount of recall/accuracy for significantly lower latency. The problem then becomes Approximate Nearest Neighbor (ANN) search. - ## Approximate Nearest Neighbor Search From version 4.0, Apache Doris officially supports ANN search. No additional data type is introduced: vectors are stored as fixed-length arrays. For distance-based indexing a new index type, ANN, is implemented based on Faiss. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md index a37ae8a7b845a..33debeba8592a 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md @@ -27,17 +27,6 @@ under the License. --> 在生成式 AI 的应用中,单纯依赖大模型自身的参数“记忆”存在明显局限:一方面,模型知识具有时效性,无法覆盖最新信息;另一方面,完全依赖模型直接“生成”容易产生幻觉(Hallucination)。因此,RAG(检索增强生成)应运而生。其核心目标不是让模型凭空构造答案,而是从外部知识库中检索出与用户查询最相关的 Top-K 信息片段,作为生成依据。为实现这一点,需要一种机制衡量“用户查询”与“知识库文档”之间的语义相关性。向量表示正是常用手段:将查询与文档统一编码为语义向量后,可通过向量相似度衡量相关程度。随着预训练模型的发展,生成高质量语义向量已成主流,RAG 的检索阶段也演化为一个标准的向量相似度搜索问题——从大规模向量集合中找出与查询最相似的 K 个向量(候选知识片段)。需要注意,RAG 的向量检索不限于文本,也可扩展到多模态:图片、语音、视频等数据同样可以编码为向量供生成模型使用。例如,用户上传图片后,系统先检索相关描述或知识片段,再辅助生成解释性内容;在医学问答中,可检索病例资料与医学文献,生成更准确的诊断建议。 -## 暴力搜索 -Apache Doris 自 2.0 版本起支持基于向量距离的最近邻搜索,通过 SQL 实现向量搜索是一个自然且简单的过程。 - -```sql -SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance -FROM vector_table -ORDER BY distance -LIMIT 10; -``` - -当数据量不大(小于 100 万行)时,Apache Doris 的精确最近邻(K-Nearest Neighbor)搜索性能足以满足需求,可获得 100% 召回与 100% 精确。但随着数据进一步增长,用户通常愿意牺牲少量召回与精度以换取显著的查询加速,此时问题就转化为向量近似最近邻搜索(Approximate Nearest Neighbor,ANN)。 ## 近似最近邻搜索 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md index b53f43ce802d0..e7d84f464e44c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md @@ -27,17 +27,6 @@ under the License. --> 在生成式 AI 的应用中,单纯依赖大模型自身的参数“记忆”存在明显局限:一方面,模型知识具有时效性,无法覆盖最新信息;另一方面,完全依赖模型直接“生成”容易产生幻觉(Hallucination)。因此,RAG(检索增强生成)应运而生。其核心目标不是让模型凭空构造答案,而是从外部知识库中检索出与用户查询最相关的 Top-K 信息片段,作为生成依据。为实现这一点,需要一种机制衡量“用户查询”与“知识库文档”之间的语义相关性。向量表示正是常用手段:将查询与文档统一编码为语义向量后,可通过向量相似度衡量相关程度。随着预训练模型的发展,生成高质量语义向量已成主流,RAG 的检索阶段也演化为一个标准的向量相似度搜索问题——从大规模向量集合中找出与查询最相似的 K 个向量(候选知识片段)。需要注意,RAG 的向量检索不限于文本,也可扩展到多模态:图片、语音、视频等数据同样可以编码为向量供生成模型使用。例如,用户上传图片后,系统先检索相关描述或知识片段,再辅助生成解释性内容;在医学问答中,可检索病例资料与医学文献,生成更准确的诊断建议。 -## 暴力搜索 -Apache Doris 自 2.0 版本起支持基于向量距离的最近邻搜索,通过 SQL 实现向量搜索是一个自然且简单的过程。 - -```sql -SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance -FROM vector_table -ORDER BY distance -LIMIT 10; -``` - -当数据量不大(小于 100 万行)时,Apache Doris 的精确最近邻(K-Nearest Neighbor)搜索性能足以满足需求,可获得 100% 召回与 100% 精确。但随着数据进一步增长,用户通常愿意牺牲少量召回与精度以换取显著的查询加速,此时问题就转化为向量近似最近邻搜索(Approximate Nearest Neighbor,ANN)。 ## 近似最近邻搜索 diff --git a/versioned_docs/version-4.x/ai/vector-search/overview.md b/versioned_docs/version-4.x/ai/vector-search/overview.md index 7bde766208894..22b79e08f9b2e 100644 --- a/versioned_docs/version-4.x/ai/vector-search/overview.md +++ b/versioned_docs/version-4.x/ai/vector-search/overview.md @@ -31,19 +31,6 @@ To achieve this, we need a mechanism to measure semantic relatedness between a u Vector retrieval in RAG is not limited to text; it naturally extends to multimodal scenarios. In a multimodal RAG system, images, audio, video, and other data types can also be encoded into vectors for retrieval and then supplied to the generative model as context. For example, if a user uploads an image, the system can first retrieve related descriptions or knowledge snippets, then generate explanatory content. In medical QA, RAG can retrieve patient records and literature to support more accurate diagnostic suggestions. -## Brute-Force Search - -Starting from version 2.0, Apache Doris supports nearest-neighbor search based on vector distance. Performing vector search with SQL is natural and simple: - -```sql -SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance -FROM vector_table -ORDER BY distance -LIMIT 10; -``` - -When the dataset is small (under ~1 million rows), Doris’s exact K-Nearest Neighbor search performance is sufficient, providing 100% recall and precision. As the dataset grows, however, most users are willing to trade a small amount of recall/accuracy for significantly lower latency. The problem then becomes Approximate Nearest Neighbor (ANN) search. - ## Approximate Nearest Neighbor Search From version 4.0, Apache Doris officially supports ANN search. No additional data type is introduced: vectors are stored as fixed-length arrays. For distance-based indexing a new index type, ANN, is implemented based on Faiss.