[spark] support batch write #2277

YannByron · 2025-12-29T16:05:09Z

Purpose

to support spark batch write.

Linked issue: close #xxx

Brief change log

Tests

org.apache.fluss.spark.row.SparkAsFlussRowTest
org.apache.fluss.spark.row.SparkAsFlussArrayTest
org.apache.fluss.spark.SparkWriteTest

API and Format

Documentation

wuchong

Thanks @YannByron , I left some comments. Besides, I pushed a commit to improve the javadoc a bit.

wuchong · 2025-12-31T15:46:59Z

...s-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/row/SparkAsFlussArray.scala

+   * representation (see {@link TimestampNtz}).
+   */
+  override def getTimestampNtz(pos: Int, precision: Int): TimestampNtz =
+    TimestampNtz.fromMillis(SparkDateTimeUtils.microsToMillis(arrayData.getLong(pos)))


I think we can introduce a method TimestampNtz.fromMicros like TimestampLtz.fromEpochMicros to suppoert convert from micro to TimestampNtz. Converting from micro to millis will lose the nano precisions.

wuchong · 2025-12-31T15:50:28Z

fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/row/SparkAsFlussRow.scala

+   * representation (see {@link TimestampNtz}).
+   */
+  override def getTimestampNtz(pos: Int, precision: Int): TimestampNtz =
+    TimestampNtz.fromMillis(SparkDateTimeUtils.microsToMillis(row.getLong(pos)))


wuchong · 2025-12-31T15:52:19Z

...s-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/row/SparkAsFlussArray.scala

+import org.apache.spark.sql.types.{ArrayType => SparkArrayType, DataType => SparkDataType, StructType}
+
+/** Wraps a Spark [[SparkArrayData]] as a Fluss [[FlussInternalArray]]. */
+class SparkAsFlussArray(arrayData: SparkArrayData, elementType: SparkDataType)


I saw SparkAsFlussRow extends Serializable interface, do we need to make SparkAsFlussArray also extend Serializable?

wuchong · 2026-01-04T02:02:54Z

fluss-spark/fluss-spark-ut/src/test/scala/org/apache/fluss/spark/SparkWriteTest.scala

+      GenericRowBuilder(4)
+        .setField(0, 600L)
+        .setField(1, 21L)
+        .setField(2, 601)
+        .setField(3, BinaryString.fromString("addr1"))
+        .builder(),


We can replace the builder with GenericRow.of(800L, 23L, 603, fromString("addr3")), which is more concise.

wuchong · 2026-01-04T02:03:49Z

fluss-spark/fluss-spark-ut/src/test/scala/org/apache/fluss/spark/SparkWriteTest.scala

+        "+U",
+        GenericRowBuilder(4)
+          .setField(0, 800L)
+          .setField(1, 230L)
+          .setField(2, 603)
+          .setField(3, BinaryString.fromString("addr3"))
+          .builder()),


nit: move this +U near after the corresponding -U message

wuchong · 2026-01-04T02:05:17Z

fluss-spark/fluss-spark-ut/src/test/scala/org/apache/fluss/spark/SparkWriteTest.scala

+          .builder())
+    )
+    assertThat(flussRows2.length).isEqualTo(4)
+    assertThat(flussRows2).containsAll(expectRows2.toIterable.asJava)


As we set the bucket number to 1, so the changelog is in order globally, and we can assert .containsExactlyElementsOf here to also check the changelog order.

OK. But Spark will start two tasks to write multiple records, and there is no way to guarantee the order of these change data between records. So to achieve this goal, I only insert one piece of data at a time to validate the order of -U and +U using containsExactlyElementsOf.

OK, then the original code makes sense to me.

OK. Let revert the latest change to the original one about this.

wuchong · 2026-01-04T02:24:54Z

...s-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/write/FlussDataWriter.scala

+    writer.append(flussRow.replace(record)).whenComplete {
+      (_, exception) =>
+        {
+          if (exception != null) {
+//            logError("Exception occurs while append row to fluss.", exception);
+            throw new RuntimeException("Failed to append record", exception)


We must not throw exceptions directly in the completion callback, as they won’t propagate to the Spark writer and may be silently ignored.

Instead, we should capture any exception in a volatile field (e.g., asyncWriterException) within the Spark writer. Then, we can expose a checkAsyncException() method that throws the captured exception if it’s non-null.

This check should be invoked:

At the beginning of DataWriter#write, to catch failures from prior async operations before processing new records, and

And after writer.flush() in DataWriter#commit, to ensure any failure during flush or finalization is surfaced during commit.

This pattern ensures async errors are properly reported through Spark’s writer lifecycle. You can take FlinkSinkWriter as an example.

wuchong · 2026-01-04T02:25:16Z

...s-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/write/FlussDataWriter.scala

+            logError("Exception occurs while upsert row to fluss.", exception);
+            throw new RuntimeException("Failed to upsert record", exception)


ditto. and we can move the logging to the checkAsyncException() method.

wuchong · 2026-01-04T14:54:11Z

fluss-common/src/main/java/org/apache/fluss/row/TimestampNtz.java

+     *     number is the number of microseconds before {@code 1970-01-01 00:00:00}
+     */
+    public static TimestampNtz fromMicros(long microseconds) {
+        return new TimestampNtz(Math.floorDiv(microseconds, MICROS_PER_MILLIS), 0);


Why not convert the microsecond component to nanoseconds? From my perspective, using zero would actually lose precision. The method org.apache.fluss.row.TimestampLtz#fromEpochMicros intentionally preserves the full microsecond resolution, maybe we can use the same implementation here?

You're right. I will modify this.

[spark] support batch write

b33668a

YannByron force-pushed the main-spark branch from edcfb7a to b33668a Compare December 29, 2025 17:04

YannByron and others added 2 commits December 30, 2025 12:04

update

b6c1256

javadoc improve

34270af

wuchong reviewed Jan 4, 2026

View reviewed changes

update

6759089

wuchong reviewed Jan 4, 2026

View reviewed changes

YannByron added 2 commits January 4, 2026 23:28

update

5cc24a6

update

50887ef

wuchong approved these changes Jan 4, 2026

View reviewed changes

wuchong merged commit 4e49f2d into apache:main Jan 4, 2026
6 checks passed

wuchong linked an issue Jan 4, 2026 that may be closed by this pull request

[Feature] Support batch write of spark #402

Closed

2 tasks

wuchong mentioned this pull request Jan 4, 2026

[Feature] Support batch write of spark #402

Closed

2 tasks

		logError("Exception occurs while upsert row to fluss.", exception);
		throw new RuntimeException("Failed to upsert record", exception)

[spark] support batch write #2277

[spark] support batch write #2277

Uh oh!

Conversation

YannByron commented Dec 29, 2025

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

wuchong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants