SYSTEMDS-3539 Implement delta encoding (Parts 1, 2, and 3) #2361

HanaHalitim · 2025-11-23T21:28:23Z

Implemented delta encoding compression for SystemDS. Added ColGroupDeltaDDC compression type that stores row differences instead of absolute values, improving compression for data with predictable patterns.

Created delta readers that compute row differences on-the-fly during compression, avoiding delta matrix materialization. Wired CUMSUM and ROWCUMSUM operations to automatically use delta encoding for their results.

Extended compression estimation with preferDeltaEncoding flag to evaluate delta encoding as a compression option. Fixed dictionary remapping bug where extractValues() reordered entries, breaking row-to-dictionary mappings.

All tests pass including ColGroupDeltaDDCTest, CLALibUnaryDeltaTest, ReadersDeltaTest, and related compression tests.

janniklinde · 2025-11-25T13:05:03Z

Thanks for the good first PR @HanaHalitim which also includes plenty of tests . There are a few things that need to be addressed. I will leave some comments in the code.

src/main/java/org/apache/sysds/runtime/compress/colgroup/AColGroup.java

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDeltaDDC.java

-// 	}
+	public static AColGroup create(IColIndex colIndexes, IDictionary dict, AMapToData data, int[] cachedCounts) {
+		if(data.getUnique() == 1)
+			return ColGroupConst.create(colIndexes, dict);


src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDeltaDDC.java

+			for(int j = 0; j < nCol; j++) {
+				prevRow[j] = prevRowData[prevOff + _colIndexes.get(j)];
+			}
+		}


src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupFactory.java

+			oldIdToNewId[dac.id] = i;
+			idx += colIndexes.size();
+		}
+		IDictionary dict = new DeltaDictionary(dictValues, colIndexes.size());


src/test/java/org/apache/sysds/test/component/compress/estim/encoding/EncodeDeltaTest.java

+
+package org.apache.sysds.test.component.compress.estim.encoding;
+
+import static org.junit.Assert.assertEquals;


src/test/java/org/apache/sysds/test/component/compress/estim/encoding/EncodeDeltaTest.java

- Fixed incorrect decompression logic for rl > 0 (partial ranges). - Removed unnecessary empty constructors. - Overrode unsupported DDC methods in ColGroupDeltaDDC. - Corrected ColGroupDeltaDDC.create for constant conversion. - Fixed dictionary allocation size for extra flag in ColGroupFactory. - Optimized CUMSUM/ROWCUMSUM to reinterpret DDC groups as DeltaDDC. - Strengthened EncodeDeltaTest assertions and added combine() tests. - Added new tests for partial range decompression and serialization. - Removed unused imports.

- Implemented DeltaDDC conversion to DDC for unsupported scalar/unary ops (e.g., K-Means). - Added comprehensive tests for relational and unary operations in ColGroupDeltaDDCTest.

janniklinde · 2025-12-12T11:13:12Z

Thank you for the improvements and the fix for the failing test case @HanaHalitim. I will leave a few comments in the code that I think should be addressed before merging.

janniklinde · 2025-12-12T11:16:32Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDC.java

+
+		double[] dictVals = null;
+		try {
+			dictVals = _dict.getValues();


Why is this wrapped in a try ... catch? I don't think that there is a scenario where this would fail

janniklinde · 2025-12-12T11:17:20Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDDC.java

+						prevRow[j] = val;
+					}
+				}
+			} else {


Can this case actually happen? Otherwise remove that null check

janniklinde · 2025-12-12T11:20:44Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupFactory.java

+		else if(ct == CompressionType.DeltaDDC) {
+			return directCompressDeltaDDC(colIndexes, cg);
+		}
+		else if(ct == CompressionType.CONST && cs.preferDeltaEncoding) {


Why would you encode CONST as DeltaDDC?

janniklinde · 2025-12-12T11:21:41Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupFactory.java

 		}
 		final IntArrayList[] of = ubm.getOffsetList();
-		if(of.length == 1 && of[0].size() == nRow) { // If this always constant
+		if(of.length == 1 && of[0].size() == nRow && ct != CompressionType.DeltaDDC) { // If this always constant


Why would you encode CONST as DeltaDDC?

janniklinde · 2025-12-12T11:23:23Z

src/main/java/org/apache/sysds/runtime/compress/lib/CLALibUnary.java

 		final int r = m.getNumRows();
 		final int c = m.getNumColumns();
+
+		if(Builtin.isBuiltinCode(op.fn, BuiltinCode.CUMSUM, BuiltinCode.ROWCUMSUM)) {


Don't handle the ROWCUMSUM case, it would not be efficient (and DDC reinterpretation would be wrong)

janniklinde · 2025-12-12T11:26:35Z

src/main/java/org/apache/sysds/runtime/compress/lib/CLALibUnary.java

+
+			if(allDDC && !groups.isEmpty()) {
+				MatrixBlock uncompressed = m.getUncompressed("CUMSUM/ROWCUMSUM requires uncompression", op.getNumThreads());
+				MatrixBlock opResult = uncompressed.unaryOperations(op, null);


Don't uncompress to do this redundant operation. In case of CUMSUM the reinterpretation should always be correct, so no need to verify.

janniklinde · 2025-12-12T11:34:03Z

src/main/java/org/apache/sysds/runtime/compress/lib/CLALibUnary.java

+			MatrixBlock uncompressed = m.getUncompressed("CUMSUM/ROWCUMSUM requires uncompression", op.getNumThreads());
+			MatrixBlock opResult = uncompressed.unaryOperations(op, null);
+
+			CompressionSettingsBuilder csb = new CompressionSettingsBuilder();
+			csb.clearValidCompression();
+			csb.setPreferDeltaEncoding(true);
+			csb.addValidCompression(CompressionType.DeltaDDC);
+			csb.addValidCompression(CompressionType.UNCOMPRESSED);
+			csb.setTransposeInput("false");
+			Pair<MatrixBlock, CompressionStatistics> compressedPair = CompressedMatrixBlockFactory.compress(opResult, op.getNumThreads(), csb);
+			MatrixBlock compressedResult = compressedPair.getLeft();
+
+			if(compressedResult == null) {
+				compressedResult = opResult;
+			}
+
+			CompressedMatrixBlock finalResult;
+			if(compressedResult instanceof CompressedMatrixBlock) {
+				finalResult = (CompressedMatrixBlock) compressedResult;
+			}
+			else {
+				finalResult = CompressedMatrixBlockFactory.genUncompressedCompressedMatrixBlock(compressedResult);
+			}
+
+			return finalResult;


This part is unnecessary. Let it just fall through and let the case LibMatrixAgg.isSupportedUnaryOperator(op) handle it.

janniklinde · 2025-12-12T11:37:00Z

src/main/java/org/apache/sysds/runtime/compress/lib/CLALibUnary.java

+
+			return finalResult;
+		}
+


It might make more sense to put the entire if branch below if (m.isEmpty())...

janniklinde · 2025-12-12T11:38:20Z

src/main/java/org/apache/sysds/runtime/compress/utils/DblArrayCountHashMap.java


 	protected final DArrCounts create(DblArray key, int id) {
-		return new DArrCounts(key, id);
+		return new DArrCounts(new DblArray(key), id);


You don't need to create a copy of key because new DArrCounts(...) already takes care of that. So you can safely revert that change

janniklinde · 2025-12-12T11:48:35Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDeltaDDC.java

+	@Override
+	protected void decompressToSparseBlockDenseDictionary(SparseBlock ret, int rl, int ru, int offR, int offC,
+		double[] values) {
+		final int nCol = _colIndexes.size();
+		final double[] prevRow = new double[nCol];
+
+		if(rl > 0) {
+			final int dictIdx0 = _data.getIndex(0);
+			final int rowIndex0 = dictIdx0 * nCol;
+			for(int j = 0; j < nCol; j++) {
+				prevRow[j] = values[rowIndex0 + j];
+			}
+			for(int i = 1; i < rl; i++) {
+				final int dictIdx = _data.getIndex(i);
+				final int rowIndex = dictIdx * nCol;
+				for(int j = 0; j < nCol; j++) {
+					prevRow[j] += values[rowIndex + j];
+				}
+			}
+		}
+
+		for(int i = rl, offT = rl + offR; i < ru; i++, offT++) {
+			final int dictIdx = _data.getIndex(i);
+			final int rowIndex = dictIdx * nCol;
+
+			if(i == 0 && rl == 0) {
+				for(int j = 0; j < nCol; j++) {
+					final double value = values[rowIndex + j];
+					final int colIdx = _colIndexes.get(j);
+					ret.append(offT, colIdx + offC, value);
+					prevRow[j] = value;
+				}
+			}
+			else {
+				for(int j = 0; j < nCol; j++) {
+					final double delta = values[rowIndex + j];
+					final double newValue = prevRow[j] + delta;
+					final int colIdx = _colIndexes.get(j);
+					ret.append(offT, colIdx + offC, newValue);
+					prevRow[j] = newValue;
+				}
+			}
+		}
+	}


I don't think that this method is covered by any test case

janniklinde · 2025-12-12T11:51:46Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDeltaDDC.java

+	@Override
+	public AColGroup scalarOperation(ScalarOperator op) {
+		if(op.fn instanceof Multiply || op.fn instanceof Divide) {
+			return super.scalarOperation(op);


janniklinde · 2025-12-12T11:51:55Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDeltaDDC.java

+			return super.scalarOperation(op);
+		}
+		else if(op.fn instanceof Plus || op.fn instanceof Minus) {
+			return scalarOperationShift(op);


janniklinde · 2025-12-12T11:52:34Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDeltaDDC.java

+
+		if(nCol == 1) {
+			DoubleCountHashMap map = new DoubleCountHashMap(16);
+			AMapToData mapData = MapToFactory.create(nRow, 256);


What if number of unique items > 256?

janniklinde · 2025-12-12T11:52:41Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDeltaDDC.java

+		}
+		else {
+			DblArrayCountHashMap map = new DblArrayCountHashMap(16);
+			AMapToData mapData = MapToFactory.create(nRow, 256);


What if number of unique items > 256?

janniklinde · 2025-12-12T11:54:02Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDeltaDDC.java

+	@Override
+	public AColGroup sliceRows(int rl, int ru) {
+		AMapToData slicedData = _data.slice(rl, ru);
+		final int nCol = _colIndexes.size();
+		double[] firstRowValues = new double[nCol];
+		double[] dictVals = ((DeltaDictionary)_dict).getValues();
+
+		for(int i = 0; i <= rl; i++) {
+			int dictIdx = _data.getIndex(i);
+			int dictOffset = dictIdx * nCol;
+			if(i == 0) {
+				for(int j = 0; j < nCol; j++) firstRowValues[j] = dictVals[dictOffset + j];
+			} else {
+				for(int j = 0; j < nCol; j++) firstRowValues[j] += dictVals[dictOffset + j];
+			}
+		}
+
+		int nEntries = dictVals.length / nCol;
+		int newId = -1;
+		for(int k = 0; k < nEntries; k++) {
+			boolean match = true;
+			for(int j = 0; j < nCol; j++) {
+				if(dictVals[k * nCol + j] != firstRowValues[j]) {
+					match = false;
+					break;
+				}
+			}
+			if(match) {
+				newId = k;
+				break;
+			}
+		}
+
+		IDictionary newDict = _dict;
+		if(newId == -1) {
+			double[] newDictVals = Arrays.copyOf(dictVals, dictVals.length + nCol);
+			System.arraycopy(firstRowValues, 0, newDictVals, dictVals.length, nCol);
+			newDict = new DeltaDictionary(newDictVals, nCol);
+			newId = nEntries;
+
+			if(newId >= slicedData.getUpperBoundValue()) {
+				slicedData = slicedData.resize(newId + 1);
+			}
+		}
+
+		slicedData.set(0, newId);
+		return ColGroupDeltaDDC.create(_colIndexes, newDict, slicedData, null);
+	}


Is this method covered by tests?

- ColGroupDDC: Reverted defensive try-catch around getValues() to match project convention. - ColGroupFactory: Removed redundant check preventing CONST groups when DeltaDDC is requested. - CLALibUnary: Removed flawed CUMSUM optimization and ROWCUMSUM support; rely on robust recompression fallback. - ColGroupDeltaDDC: Implemented dynamic resizing for map construction to handle unknown unique counts (>256). - ColGroupDeltaDDC: Fixed and verified scalar shift logic with map handling. - DblArrayCountHashMap: Removed redundant object creation. - Tests: Added comprehensive tests for scalar ops in ColGroupDeltaDDCTest; adjusted CLALibUnaryDeltaTest to reflect removed ROWCUMSUM support.

- Corrected scalar Multiply and Divide for DeltaDDC by scaling the dictionary values instead of falling back to default DDC logic (which was incorrect for deltas). - Added unit tests for scalar operations (Plus, Minus, Multiply, Divide) in ColGroupDeltaDDCTest. - Implemented and tested sliceRows support in ColGroupDeltaDDCTest, verifying that slicing DeltaDDC column groups preserves the delta encoding structure. - Refined CLALibUnary structure by moving CUMSUM optimization check after isEmpty() check.

HanaHalitim · 2025-12-14T11:59:02Z

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDeltaDDC.java

Thank you so much for the review Jannik! I have tried to make the adjustments as requested

codecov · 2025-12-17T13:13:31Z

Codecov Report

❌ Patch coverage is 86.50000% with 81 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.38%. Comparing base (3779d50) to head (0d64bff).
⚠️ Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
...ds/runtime/compress/colgroup/ColGroupDeltaDDC.java	84.67%	29 Missing and 13 partials ⚠️
...sds/runtime/compress/colgroup/ColGroupFactory.java	69.76%	11 Missing and 15 partials ⚠️
...g/apache/sysds/runtime/compress/estim/AComEst.java	58.33%	4 Missing and 1 partial ⚠️
...e/sysds/runtime/compress/colgroup/ColGroupDDC.java	94.59%	1 Missing and 1 partial ⚠️
.../compress/colgroup/dictionary/DeltaDictionary.java	92.00%	2 Missing ⚠️
...apache/sysds/runtime/compress/lib/CLALibUnary.java	90.47%	1 Missing and 1 partial ⚠️
...he/sysds/runtime/compress/colgroup/ColGroupIO.java	66.66%	1 Missing ⚠️
...ntime/compress/estim/encoding/EncodingFactory.java	97.72%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2361      +/-   ##
============================================
+ Coverage     72.33%   72.38%   +0.04%     
- Complexity    46911    47175     +264     
============================================
  Files          1513     1516       +3     
  Lines        178198   179309    +1111     
  Branches      34984    35212     +228     
============================================
+ Hits         128897   129790     +893     
- Misses        39556    39705     +149     
- Partials       9745     9814      +69

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

janniklinde · 2025-12-19T13:46:13Z

Thank you @HanaHalitim for the recent changes. Please have a look at the missing test coverage from the codecov report to cover the remaining relevant code parts (excluding NotImplementedException and other minor branches). Also check for test redundancy and relevance.
Once this is addressed, we can attempt to merge the PR and proceed with a final detailed code review. Please double-check that everything works as intended and meets the project’s code quality standards, rather than relying solely on review feedback. If any high-level questions remain, feel free to contact me.

HanaHalitim · 2025-12-24T13:01:09Z

Hi @janniklinde
Thanks a lot for your review, I checked the codecov report and expanded the test coverage accordingly and also checked for redundancy and relevance. I did run the tests locally and they were successful, I'm just waiting for the workflows approval on GitHub

SYSTEMDS-3539 Implement delta encoding (Parts 1, 2, and 3)

b570380

github-project-automation bot added this to SystemDS PR Queue Nov 23, 2025

github-project-automation bot moved this to In Progress in SystemDS PR Queue Nov 23, 2025

janniklinde reviewed Nov 25, 2025

View reviewed changes

src/main/java/org/apache/sysds/runtime/compress/colgroup/AColGroup.java

This comment was marked as resolved.

Sign in to view

janniklinde reviewed Nov 25, 2025

View reviewed changes

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDeltaDDC.java

This comment was marked as resolved.

Sign in to view

janniklinde reviewed Nov 25, 2025

View reviewed changes

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupDeltaDDC.java

for(int j = 0; j < nCol; j++) {

prevRow[j] = prevRowData[prevOff + _colIndexes.get(j)];

}

}

This comment was marked as resolved.

Sign in to view

janniklinde reviewed Nov 25, 2025

View reviewed changes

src/main/java/org/apache/sysds/runtime/compress/colgroup/ColGroupFactory.java

oldIdToNewId[dac.id] = i;

idx += colIndexes.size();

}

IDictionary dict = new DeltaDictionary(dictValues, colIndexes.size());

This comment was marked as resolved.

Sign in to view

janniklinde reviewed Nov 25, 2025

View reviewed changes

src/test/java/org/apache/sysds/test/component/compress/estim/encoding/EncodeDeltaTest.java

package org.apache.sysds.test.component.compress.estim.encoding;

import static org.junit.Assert.assertEquals;

This comment was marked as resolved.

Sign in to view

janniklinde reviewed Nov 25, 2025

View reviewed changes

src/test/java/org/apache/sysds/test/component/compress/estim/encoding/EncodeDeltaTest.java

This comment was marked as resolved.

Sign in to view

Hanhoun02 added 2 commits December 7, 2025 16:31

[SYSTEMDS-3539] DeltaDDC Refinements

e005659

- Implemented DeltaDDC conversion to DDC for unsupported scalar/unary ops (e.g., K-Means). - Added comprehensive tests for relational and unary operations in ColGroupDeltaDDCTest.

janniklinde suggested changes Dec 12, 2025

View reviewed changes

github-project-automation bot moved this from In Progress to In Review in SystemDS PR Queue Dec 12, 2025

janniklinde reviewed Dec 12, 2025

View reviewed changes

Hanhoun02 added 2 commits December 14, 2025 01:20

HanaHalitim commented Dec 14, 2025

View reviewed changes

[SYSTEMDS-3539] Improve DeltaDDC Test Coverage for Codecov

0d64bff


		package org.apache.sysds.test.component.compress.estim.encoding;

		import static org.junit.Assert.assertEquals;

SYSTEMDS-3539 Implement delta encoding (Parts 1, 2, and 3) #2361

Are you sure you want to change the base?

SYSTEMDS-3539 Implement delta encoding (Parts 1, 2, and 3) #2361

Conversation

HanaHalitim commented Nov 23, 2025

Uh oh!

janniklinde commented Nov 25, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

janniklinde commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janniklinde Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janniklinde Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

janniklinde commented Dec 19, 2025

Uh oh!

HanaHalitim commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janniklinde Dec 12, 2025 •

edited

Loading

janniklinde Dec 12, 2025 •

edited

Loading

codecov bot commented Dec 17, 2025 •

edited

Loading