Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
115 commits
Select commit Hold shift + click to select a range
ade0600
[core] Modify all row lineage to row tracking in codes (#6262)
JingsongLi Sep 16, 2025
6ffe696
[core][flink] Remove withMemoryPool in TableWriteImpl (#6265)
tsreaper Sep 16, 2025
b572f81
[doc] Fix Python code syntax errors and typo in PVFS documentation (#…
plusplusjiajia Sep 17, 2025
21328c6
[hotfix] Fix comments in MergeFileSplitRead and style
JingsongLi Sep 17, 2025
95c540a
[append] Fix error of compaction append table with deletion vectors (…
yuzelin Sep 17, 2025
4f2c4f4
[core] Primary key types should not be changed (#6264)
tsreaper Sep 17, 2025
d050f44
[spark] Avoid unnecessary repeated resolve MergeIntoTable (#6275)
Zouxxyy Sep 17, 2025
d1767f5
[iceberg] Enhance Iceberg timestamp type compatibility with extended …
plusplusjiajia Sep 18, 2025
429827e
[core] Separate indexIncrement into dataIncrement and compactIncreme…
Zouxxyy Sep 19, 2025
3b1f759
[core] Fix the commit kind when performing row-level changes on non-p…
Zouxxyy Sep 19, 2025
f3955cd
[hotfix] Minor refactor for FileStoreCommitImpl
JingsongLi Sep 19, 2025
abe882a
[spark] Resolve function after all args have been resolved (#6292)
Zouxxyy Sep 22, 2025
db8610d
[rest] Add tableType parameter support for listing tables (#6295)
plusplusjiajia Sep 22, 2025
16f0cea
[cdc] Fix cannot be cast to float due to precision loss (#6291)
lizc9 Sep 22, 2025
cee4a54
[lance] Remove shade for paimon-lance (#6310)
JingsongLi Sep 23, 2025
7b49031
[core] nested-update supports limit the input (#6309)
yuzelin Sep 23, 2025
af25ba1
[core] Cross partition can work with fixed bucket and postpone bucket…
JingsongLi Sep 23, 2025
b7f49f8
[spark] System function max_pt should be used as 'sys.max_pt' (#6312)
JingsongLi Sep 23, 2025
909a165
[python] Support reading data by splitting according to rows (#6274)
discivigour Sep 23, 2025
0a5004c
[python] Support ignore_if_exists param for database and table (#6314)
discivigour Sep 24, 2025
501d1db
[arrow] ArrowBatchConverter support reset the reused VectorSchemaRoot…
yuzelin Sep 25, 2025
388fc68
[python] Modify package version to be compatible with python version …
discivigour Sep 25, 2025
f684400
[core] Apply 'file-operation-thread-num' to commit (#6339)
JingsongLi Sep 26, 2025
86de753
[spark] Support version-specific default configuration values (#6334)
kerwin-zk Sep 26, 2025
30efebe
[python] Fix OSSParam to access DLF (#6332)
discivigour Sep 26, 2025
2df6a06
[spark] Improve error msg for creating a function on an existing tmp …
Zouxxyy Sep 28, 2025
565b8aa
[core] Fix comment of paged api (#6348)
XiaoHongbo-Hope Sep 28, 2025
c4bf236
[core] Support incremental clustering for append unaware table (#6338)
LsomeYeah Sep 29, 2025
5caa6f6
[core] Add dv conflict detection during commit (#6303)
Zouxxyy Oct 9, 2025
0df7a71
[spark] Fix sort compact with partition filter (#6371)
Zouxxyy Oct 9, 2025
2e11817
[spark] Fix group by partial partition of a multi partition table (#6…
Zouxxyy Oct 10, 2025
673471f
[spark] Improve the exception msg for unsupported type (#6379)
Zouxxyy Oct 11, 2025
2591ce2
[hotfix] Fix typo in MergeIntoPaimonTable (#6381)
Zouxxyy Oct 11, 2025
867629e
[spark] Ensure compatibility of resolveFilter in lower version spark3…
Zouxxyy Oct 13, 2025
f325c44
[pvfs] Fix file status and input stream for PVFS (#6397)
timmyyao Oct 14, 2025
b1e7fd1
[flink] support performing incremental clustering by flink (#6395)
LsomeYeah Oct 14, 2025
0d1a61e
[parquet] Bump parquet version to 1.15.2 (#6363)
csringhofer Oct 14, 2025
ca48db6
[hotfix] Fix compile error in CompactAction
JingsongLi Oct 14, 2025
e89434d
[hotfix] Disable unstable test CloneActionITCase
JingsongLi Oct 14, 2025
d0c1abd
[spark] Fix the IOManager not work in spark reader (#6401)
Aitozi Oct 14, 2025
689b3ef
[docs] fix a doc error about incremental clustering (#6402)
LsomeYeah Oct 15, 2025
ef998c2
[hotfix][flink] fix the partition error for IncrementalClusterSplitSo…
LsomeYeah Oct 16, 2025
b1cc1e5
[core] Fix that FilesTable cannot output level0 files when dv enabled…
yuzelin Oct 16, 2025
018d61b
[oss] add fs.oss.sld.enabled to support oss private link (#6413)
shyjsarah Oct 16, 2025
05ad2a0
[core] Fix that cannot read binlog table with projection (#6417)
yuzelin Oct 16, 2025
a214e71
[python] support blob type and blob write and read (#6390)
jerry-024 Oct 14, 2025
0e05b7e
[Python] Enable field merge read in row-tracking table (#6399)
discivigour Oct 15, 2025
8bda95e
[Python] Introduce incremental-between read by timestamp (#6391)
discivigour Oct 17, 2025
fe68d21
[core]Python: fix blob write when blob_as_descriptor is true (#6404)
jerry-024 Oct 17, 2025
462c420
[python] Support blob read && write (#6420)
leaves12138 Oct 17, 2025
82fe60a
[python] Blob type more test for descriptor (#6422)
leaves12138 Oct 17, 2025
e036dec
[python] Make FileStoreWrite.max_seq_numbers lazied (#6418)
JingsongLi Oct 17, 2025
3d1b30f
[python] Filter manifest files by partition predicate in scan (#6419)
JingsongLi Oct 17, 2025
fa4671c
[python] Filter manifest entry by advance to reduce memory (#6428)
JingsongLi Oct 20, 2025
d45886b
[Python] optimize codes related to push_down_utils (#6430)
chenghuichen Oct 20, 2025
59c6ae0
[flink] disable clustering during writing if incremental clustering e…
LsomeYeah Oct 20, 2025
7112827
[python] Drop stats for manifest entries reading (#6429)
JingsongLi Oct 20, 2025
139f65c
[Python] clean code for pypaimon (#6433)
chenghuichen Oct 20, 2025
11360cf
[python] Refactor dicts to static fields to improve performance (#6436)
JingsongLi Oct 20, 2025
94644ac
[flink] Produce real random id in SourceSplitGenerator (#6441)
JingsongLi Oct 21, 2025
10a71b1
[Python] SimpleStats supports BinaryRow (#6444)
discivigour Oct 21, 2025
445b217
[Python] Refactor BinaryRow to reuse keys and key fields (#6445)
JingsongLi Oct 21, 2025
8ef5214
[Python] filter_manifest_entry should not evolution primary keys
JingsongLi Oct 21, 2025
80ac5e2
[Python] Remove useless TODO in SimpleStats
JingsongLi Oct 21, 2025
cf0515d
[core] Disable data evolution manifest filter for now (#6443)
leaves12138 Oct 21, 2025
a7500e4
[flink] Incremental Clustering support specify partitions (#6449)
LsomeYeah Oct 22, 2025
4c1408e
[Python] parallel read manifest entries (#6451)
chenghuichen Oct 22, 2025
08ff9ba
[Python] max_workers at least 8 for manifest_file_manager.read_entrie…
JingsongLi Oct 22, 2025
bfbefe8
[Python] Fix bug comparing string and int in row_key_extractor (#6448)
universe-hcy Oct 22, 2025
80c0876
[rest] Add more hint message when commit failed because of NoSuchReso…
qingwei727 Oct 22, 2025
cddedd0
[Python] Support schema evolution read for changing column position (…
discivigour Oct 23, 2025
53fbc2d
[python] Introduce schema cache in SchemaManager
JingsongLi Oct 23, 2025
d09c4f1
[python] Split read should discard predicate for other fields
JingsongLi Oct 23, 2025
f58f0f5
[python] Use _try_to_pad_batch_by_schema in TableRead
JingsongLi Oct 23, 2025
8152dda
[core] disable dv mode for incremental clustering table (#6461)
LsomeYeah Oct 23, 2025
df2c5ac
[Python] Add basic tests for schema evolution read (#6463)
discivigour Oct 23, 2025
1d3e5e5
[Python] Blob read supports with_shard (#6465)
discivigour Oct 24, 2025
0a657b3
[doc] Supplementary the document of python REST API (#6466)
discivigour Oct 26, 2025
25d2bdd
[doc] Refactor names in python-api
JingsongLi Oct 26, 2025
edda9f0
[arrow] Fix java.lang.IllegalArgumentException in ArrowFormatCWriter.…
jichen20210919 Oct 26, 2025
4f9bfd8
[Python] Move __version__ to setup.py (#6491)
discivigour Oct 29, 2025
3c213dc
[doc] fix MarkPartitionDoneProcedure doc. (#6473)
plusplusjiajia Oct 27, 2025
d3a7451
[doc] add view doc (#6469)
jerry-024 Oct 27, 2025
f170713
[doc] Fix typo in flink/sql-ddl.md (#6476)
LuciferYang Oct 27, 2025
b856f28
[typo] fix typo in distinct (#6475)
tclxjunjie2-zhao Oct 27, 2025
5c90f44
[doc] Update batch partition mark done doc (#6478)
Zouxxyy Oct 28, 2025
de1f9c2
[core] support automatically clustering historical partition (#6472)
LsomeYeah Oct 28, 2025
718f124
[core] Refactor HistoryPartitionCluster to load less history partitions
JingsongLi Oct 28, 2025
5625e47
[spark] Fix write non-pk dv table with external paths (#6487)
Zouxxyy Oct 29, 2025
d495257
[core] Fix endian spec for BloomFilter index reader (#6493)
liyubin117 Oct 29, 2025
3e2e8b1
[doc] Update config.toml
JingsongLi Nov 10, 2025
26dfce0
[doc] Fix error link in views page
JingsongLi Nov 10, 2025
75b4486
[core] Support non null column with write type (#6513)
leaves12138 Nov 3, 2025
3d09deb
[hotfix] Add more informat to check partition spec in InternalRowPart…
JingsongLi Nov 3, 2025
81e24ab
[hotfix] Print partition spec and type when error in InternalRowParti…
JingsongLi Nov 3, 2025
f97750d
[Python] Keep the variable names of Identifier consistent with Java (…
universe-hcy Nov 4, 2025
134e1b6
[Python] Suppport multi prepare commit in the same TableWrite (#6526)
discivigour Nov 5, 2025
47c521d
[Python] Rename to BATCH_COMMIT_IDENTIFIER in snapshot.py
JingsongLi Nov 5, 2025
e6ea490
[python] support custom source split target size and split open file …
XiaoHongbo-Hope Nov 5, 2025
e233be1
[core] Fix that cannot get partition info if all files are in level-0…
zhoulii Nov 5, 2025
048c0d4
[python] add test case for reading blob by to_iterator (#6536)
XiaoHongbo-Hope Nov 6, 2025
e9f4b2f
[doc] fix postpone.default-bucket-num default value (#6554)
wombatu-kun Nov 7, 2025
e4ecfac
[Python] Remove the use of reference types in DATA_FILE_META_SCHEMA (…
zjw1111 Nov 7, 2025
4b16ee2
[common] Remove token expire time (#6544)
leaves12138 Nov 7, 2025
787210f
[python] Fix AtomicType.to_dict() inconsistency with java (#6548)
universe-hcy Nov 10, 2025
ae57450
[python] Fix File Source type in data file meta (#6571)
discivigour Nov 10, 2025
ce4ebcd
[Python] License added in test (#6572)
discivigour Nov 10, 2025
a47a28e
[core] support incremental clustering in dv mode (#6559)
LsomeYeah Nov 10, 2025
5360c03
[core] Refactor deletion vectors support for incremental cluster
JingsongLi Nov 10, 2025
1390842
[Python] Add Mixed read and write test between Java and Python. (#6579)
discivigour Nov 11, 2025
2b9eb20
[spark] Fix write multiple cols with key dynamic table (#6585)
Zouxxyy Nov 11, 2025
fff9bfe
[core] Fix spillToBinary in KeyValueBuffer (#6586)
JingsongLi Nov 12, 2025
85fa68b
[core] Optimize IndexFileMetaSerializer#rowArrayDataToDvMetas (#6589)
tsreaper Nov 12, 2025
cafcf4e
[arrow] Fix that complex writers didn't reset inner writer state (#6591)
yuzelin Nov 12, 2025
c9a50e2
Add comprehensive test cases for LanceFileFormat
Dec 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
28 changes: 26 additions & 2 deletions .github/workflows/paimon-python-checks.yml
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ on:

env:
PYTHON_VERSIONS: "['3.6.15', '3.10']"
JDK_VERSION: 8
MAVEN_OPTS: -Dmaven.wagon.httpconnectionManager.ttlSeconds=30 -Dmaven.wagon.http.retryHandler.requestSentEnabled=true


concurrency:
group: ${{ github.workflow }}-${{ github.event_name }}-${{ github.event.number || github.run_id }}
Expand All @@ -49,6 +52,17 @@ jobs:
- name: Checkout code
uses: actions/checkout@v2

- name: Set up JDK ${{ env.JDK_VERSION }}
uses: actions/setup-java@v4
with:
java-version: ${{ env.JDK_VERSION }}
distribution: 'temurin'

- name: Set up Maven
uses: stCarolas/setup-maven@v4.5
with:
maven-version: 3.8.8

- name: Install system dependencies
shell: bash
run: |
Expand All @@ -58,19 +72,29 @@ jobs:
curl \
&& rm -rf /var/lib/apt/lists/*

- name: Verify Java and Maven installation
run: |
java -version
mvn -version

- name: Verify Python version
run: python --version

- name: Build Java
run: |
echo "Start compiling modules"
mvn -T 2C -B clean install -DskipTests

- name: Install Python dependencies
shell: bash
run: |
if [[ "${{ matrix.python-version }}" == "3.6.15" ]]; then
python -m pip install --upgrade pip==21.3.1
python --version
python -m pip install -q readerwriterlock==1.0.9 'fsspec==2021.10.1' 'cachetools==4.2.4' 'ossfs==2021.8.0' pyarrow==6.0.1 pandas==1.1.5 'polars==0.9.12' 'fastavro==1.4.7' zstandard==0.19.0 dataclasses==0.8.0 flake8 pytest py4j==0.10.9.9 requests 2>&1 >/dev/null
python -m pip install -q readerwriterlock==1.0.9 'fsspec==2021.10.1' 'cachetools==4.2.4' 'ossfs==2021.8.0' pyarrow==6.0.1 pandas==1.1.5 'polars==0.9.12' 'fastavro==1.4.7' zstandard==0.19.0 dataclasses==0.8.0 flake8 pytest py4j==0.10.9.9 requests parameterized==0.8.1 2>&1 >/dev/null
else
python -m pip install --upgrade pip
python -m pip install -q readerwriterlock==1.0.9 fsspec==2024.3.1 cachetools==5.3.3 ossfs==2023.12.0 ray==2.48.0 fastavro==1.11.1 pyarrow==16.0.0 zstandard==0.24.0 polars==1.32.0 duckdb==1.3.2 numpy==1.24.3 pandas==2.0.3 flake8==4.0.1 pytest~=7.0 py4j==0.10.9.9 requests 2>&1 >/dev/null
python -m pip install -q readerwriterlock==1.0.9 fsspec==2024.3.1 cachetools==5.3.3 ossfs==2023.12.0 ray==2.48.0 fastavro==1.11.1 pyarrow==16.0.0 zstandard==0.24.0 polars==1.32.0 duckdb==1.3.2 numpy==1.24.3 pandas==2.0.3 flake8==4.0.1 pytest~=7.0 py4j==0.10.9.9 requests parameterized==0.9.0 2>&1 >/dev/null
fi
- name: Run lint-python.sh
shell: bash
Expand Down
16 changes: 8 additions & 8 deletions docs/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

baseURL = '//paimon.apache.org/docs/master'
baseURL = '//paimon.apache.org/docs/1.3'
languageCode = 'en-us'
title = 'Apache Paimon'
enableGitInfo = false
Expand All @@ -24,7 +24,7 @@ pygmentsUseClasses = true
[params]
# Flag whether this is a stable version or not.
# Used for the quickstart page.
IsStable = false
IsStable = true

# Flag to indicate whether an outdated warning should be shown.
ShowOutDatedWarning = false
Expand All @@ -34,14 +34,14 @@ pygmentsUseClasses = true
# we change the version for the complete docs when forking of a release branch
# etc.
# The full version string as referenced in Maven (e.g. 1.2.1)
Version = "1.3-SNAPSHOT"
Version = "1.3.0"

# For stable releases, leave the bugfix version out (e.g. 1.2). For snapshot
# release this should be the same as the regular version
VersionTitle = "1.3-SNAPSHOT"
VersionTitle = "1.3"

# The branch for this version of Apache Paimon
Branch = "master"
Branch = "1.3"

# The most recent supported Apache Flink version
FlinkVersion = "1.20"
Expand All @@ -67,14 +67,14 @@ pygmentsUseClasses = true
["JavaDocs", "//paimon.apache.org/docs/master/api/java/"],
]

StableDocs = "https://paimon.apache.org/docs/1.0"
StableDocs = "https://paimon.apache.org/docs/1.3"

PreviousDocs = [
["master", "https://paimon.apache.org/docs/master"],
["stable", "https://paimon.apache.org/docs/1.2"],
["stable", "https://paimon.apache.org/docs/1.3"],
["1.3", "https://paimon.apache.org/docs/1.3"],
["1.2", "https://paimon.apache.org/docs/1.2"],
["1.1", "https://paimon.apache.org/docs/1.1"],
["1.0", "https://paimon.apache.org/docs/1.0"],
]

BookSection = '/'
Expand Down
175 changes: 175 additions & 0 deletions docs/content/append-table/incremental-clustering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
---
title: "Incremental Clustering"
weight: 4
type: docs
aliases:
- /append-table/incremental-clustering.html
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Incremental Clustering

Paimon currently supports ordering append tables using SFC (Space-Filling Curve)(see [sort compact]({{< ref "maintenance/dedicated-compaction#sort-compact" >}}) for more info).
The resulting data layout typically delivers better performance for queries that target clustering keys.
However, with the current SortCompaction, even when neither the data nor the clustering keys have changed,
each run still rewrites the entire dataset, which is extremely costly.

To address this, Paimon introduced a more flexible, incremental clustering mechanism—Incremental Clustering.
On each run, it selects only a specific subset of files to cluster, avoiding a full rewrite. This enables low-cost,
sort-based optimization of the data layout and improves query performance. In addition, with Incremental Clustering,
you can adjust clustering keys without rewriting existing data, the layout evolves dynamically as cluster runs and
gradually converges to an optimal state, significantly reducing the decision-making complexity around data layout.


Incremental Clustering supports:
- Support incremental clustering; minimizing write amplification as possible.
- Support small-file compaction; during rewrites, respect target-file-size.
- Support changing clustering keys; newly ingested data is clustered according to the latest clustering keys.
- Provide a full mode; when selected, the entire dataset will be reclustered.

**Only append unaware-bucket table supports Incremental Clustering.**

## Enable Incremental Clustering

To enable Incremental Clustering, the following configuration needs to be set for the table:
<table class="table table-bordered">
<thead>
<tr>
<th class="text-left" style="width: 20%">Option</th>
<th class="text-left" style="width: 10%">Value</th>
<th class="text-left" style="width: 5%">Required</th>
<th class="text-left" style="width: 10%">Type</th>
<th class="text-left" style="width: 55%">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><h5>clustering.incremental</h5></td>
<td>true</td>
<td style="word-wrap: break-word;">Yes</td>
<td>Boolean</td>
<td>Must be set to true to enable incremental clustering. Default is false.</td>
</tr>
<tr>
<td><h5>clustering.columns</h5></td>
<td>'clustering-columns'</td>
<td style="word-wrap: break-word;">Yes</td>
<td>String</td>
<td>The clustering columns, in the format 'columnName1,columnName2'. It is not recommended to use partition keys as clustering keys.</td>
</tr>
<tr>
<td><h5>clustering.strategy</h5></td>
<td>'zorder' or 'hilbert' or 'order'</td>
<td style="word-wrap: break-word;">No</td>
<td>String</td>
<td>The ordering algorithm used for clustering. If not set, It'll decided from the number of clustering columns. 'order' is used for 1 column, 'zorder' for less than 5 columns, and 'hilbert' for 5 or more columns.</td>
</tr>
</tbody>

</table>

Once Incremental Clustering for a table is enabled, you can run Incremental Clustering in batch mode periodically
to continuously optimizes data layout of the table and deliver better query performance.

**Note**: Since common compaction also rewrites files, it may disrupt the ordered data layout built by Incremental Clustering.
Therefore, when Incremental Clustering is enabled, the table no longer supports write-time compaction or dedicated compaction;
clustering and small-file merging must be performed exclusively via Incremental Clustering runs.

## Run Incremental Clustering
{{< hint info >}}

only support running Incremental Clustering in batch mode.

{{< /hint >}}

To run a Incremental Clustering job, follow these instructions.

You don’t need to specify any clustering-related parameters when running Incremental Clustering,
these options are already defined as table options. If you need to change clustering settings, please update the corresponding table options.

{{< tabs "incremental-clustering" >}}

{{< tab "Spark SQL" >}}

Run the following sql:

```sql
--set the write parallelism, if too big, may generate a large number of small files.
SET spark.sql.shuffle.partitions=10;

-- run incremental clustering
CALL sys.compact(table => 'T')

-- run incremental clustering with full mode, this will recluster all data
CALL sys.compact(table => 'T', compact_strategy => 'full')
```
{{< /tab >}}

{{< tab "Flink Action" >}}

Run the following command to submit a incremental clustering job for the table.

```bash
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-{{< version >}}.jar \
compact \
--warehouse <warehouse-path> \
--database <database-name> \
--table <table-name> \
[--compact_strategy <minor / full>] \
[--table_conf <table_conf>] \
[--catalog_conf <paimon-catalog-conf> [--catalog_conf <paimon-catalog-conf> ...]]
```

Example: run incremental clustering

```bash
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-{{< version >}}.jar \
compact \
--warehouse s3:///path/to/warehouse \
--database test_db \
--table test_table \
--table_conf sink.parallelism=2 \
--compact_strategy minor \
--catalog_conf s3.endpoint=https://****.com \
--catalog_conf s3.access-key=***** \
--catalog_conf s3.secret-key=*****
```
* `--compact_strategy` Determines how to pick files to be cluster, the default is `minor`.
* `full` : All files will be selected for clustered.
* `minor` : Pick the set of files that need to be clustered based on specified conditions.

Note: write parallelism is set by `sink.parallelism`, if too big, may generate a large number of small files.

You can use `-D execution.runtime-mode=batch` or `-yD execution.runtime-mode=batch` (for the ON-YARN scenario) to use batch mode.
{{< /tab >}}

{{< /tabs >}}

## Implement
To balance write amplification and sorting effectiveness, Paimon leverages the LSM Tree notion of levels to stratify data files
and uses the Universal Compaction strategy to select files for clustering.
- Newly written data lands in level-0; files in level-0 are unclustered.
- All files in level-i are produced by sorting within the same sorting set.
- By analogy with Universal Compaction: in level-0, each file is a sorted run; in level-i, all files together constitute a single sorted run. During clustering, the sorted run is the basic unit of work.

By introducing more levels, we can control the amount of data processed in each clustering run.
Data at higher levels is more stably clustered and less likely to be rewritten, thereby mitigating write amplification while maintaining good sorting effectiveness.
8 changes: 4 additions & 4 deletions docs/content/append-table/row-tracking.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ under the License.

# Row tracking

Row tracking allows Paimon to track row-level lineage in a Paimon append table. Once enabled on a Paimon table, two more hidden columns will be added to the table schema:
- `_ROW_ID`: BIGINT, this is a unique identifier for each row in the table. It is used to track the lineage of the row and can be used to identify the row in case of update, merge into or delete.
- `_SEQUENCE_NUMBER`: BIGINT, this is field indicates which `version` of this record is. It actually is the snapshot-id of the snapshot that this row belongs to. It is used to track the lineage of the row version.
Row tracking allows Paimon to track row-level tracking in a Paimon append table. Once enabled on a Paimon table, two more hidden columns will be added to the table schema:
- `_ROW_ID`: BIGINT, this is a unique identifier for each row in the table. It is used to track the update of the row and can be used to identify the row in case of update, merge into or delete.
- `_SEQUENCE_NUMBER`: BIGINT, this is field indicates which `version` of this record is. It actually is the snapshot-id of the snapshot that this row belongs to. It is used to track the update of the row version.

Hidden columns follows the following rules:
- Whenever we read from one table with row tracking enabled, the `_ROW_ID` and `_SEQUENCE_NUMBER` will be `NOT NULL`.
Expand Down Expand Up @@ -57,7 +57,7 @@ CREATE TABLE t (id INT, data STRING) TBLPROPERTIES ('row-tracking.enabled' = 'tr
INSERT INTO t VALUES (11, 'a'), (22, 'b')
```

You can select the row lineage meta column with the following sql in spark:
You can select the row tracking meta column with the following sql in spark:
```sql
SELECT id, data, _ROW_ID, _SEQUENCE_NUMBER FROM t;
```
Expand Down
2 changes: 1 addition & 1 deletion docs/content/concepts/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,4 +86,4 @@ This statement deletes the existing `parse_str` function from the `mydb` databas

## Functions in Spark

see [SQL Functions]({{< ref "spark/sql-functions#user-defined-function" >}})
see [SQL Functions]({{< ref "spark/sql-functions#user-defined-function" >}})
10 changes: 9 additions & 1 deletion docs/content/concepts/rest/dlf.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@ title: "DLF Token"
weight: 3
type: docs
aliases:
- /concepts/rest/dlf.html
- /concepts/rest/dlf.html
---

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
Expand Down Expand Up @@ -51,6 +52,13 @@ WITH (
);
```

- `uri`: Access the URI of the DLF Rest Catalog Server.
- `warehouse`: DLF Catalog name
- `token.provider`: token provider
- `dlf.access-key-id`: The Access Key ID required to access the DLF service, usually referring to the AccessKey of your
RAM user
- `dlf.access-key-secret`:The Access Key Secret required to access the DLF service

You can grant specific permissions to a RAM user and use the RAM user's access key for long-term access to your DLF
resources. Compared to using the Alibaba Cloud account access key, accessing DLF resources with a RAM user access key
is more secure.
Expand Down
20 changes: 10 additions & 10 deletions docs/content/concepts/rest/pvfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,17 +128,17 @@ Example: execute hadoop shell to list the virtual path

## Python SDK

Python SDK provide fsspec style API, can be easily integrated to Python ecesystem.
Python SDK provide fsspec style API, can be easily integrated to Python ecosystem.

For example, Python code can do:

```python
import pypaimon

options = {
"uri": 'key',
'token.provider' = 'bear'
'token' = '<token>'
'uri': 'key',
'token.provider': 'bear',
'token': '<token>'
}
fs = pypaimon.PaimonVirtualFileSystem(options)
fs.ls("pvfs://catalog_name/database_name/table_name")
Expand All @@ -151,9 +151,9 @@ import pypaimon
import pyarrow.parquet as pq

options = {
"uri": 'key',
'token.provider' = 'bear'
'token' = '<token>'
'uri': 'key',
'token.provider': 'bear',
'token': '<token>'
}
fs = pypaimon.PaimonVirtualFileSystem(options)
path = 'pvfs://catalog_name/database_name/table_name/a.parquet'
Expand All @@ -169,9 +169,9 @@ import pypaimon
import ray

options = {
"uri": 'key',
'token.provider' = 'bear'
'token' = '<token>'
'uri': 'key',
'token.provider': 'bear',
'token': '<token>'
}
fs = pypaimon.PaimonVirtualFileSystem(options)

Expand Down
Loading