Feature/systemds 3730 multithreaded roll operation #2376

Biranavan-Parameswaran · 2025-12-08T21:07:47Z

This PR implements a multithreaded roll operation as issued in SYSTEMDS-3730

This patch introduces multi-threading support for the roll operation to
improve performance. The RollTest.java has been updated to cover
both single and multithreaded execution modes.

Furthermore, this update adds comprehensive consistency checks to ensure
mathematical correctness. New tests were created to validate both dense
and sparse matrix inputs. Additionally, cross-verification tests were
added to confirm that sparse and dense rolling for single and
multithreaded executions produce identical results.

RollOperation: Single vs Multithreaded Performance Summary

Test Matrix Sizes

Two ranges of matrix sizes were evaluated:

Small/Medium Matrices:
Rows = 2017–2516, Cols = 1001–1500
(Defined as: MIN_ROWS = 2017, MIN_COLS = 1001, plus up to +500 random)
Large Matrices:
Rows ≈ 8000–8500, Cols ≈ 4000–4500
(Same generation logic but higher baseline)

These two ranges allow observing how workload size impacts single-threaded (ST) vs multithreaded (MT) performance.

Dense Matrices

Multithreaded (MT, 10 cores) execution consistently improves performance versus single-threaded (ST, 1 core).
Small/Medium matrices: typically 2×–3.5× speedup.
Large matrices: typically 1.3×–4.7× speedup, depending on layout and shift.
Occasional slowdowns are visible, almost certainly due to JVM warm-up effects during early measurements.

Sparse Matrices (1% sparsity)

Small/Medium sparse matrices: MT is often slower than ST because the workload is too small; overhead dominates.
- Example: 0.5 ms (ST) → 2.0 ms (MT) → 0.25× slowdown
Large sparse matrices: MT becomes consistently beneficial, commonly achieving 3×–4.5× speedups.
- Example: 7.0 ms (ST) → 1.8 ms (MT) → 3.9× speedup

Key Point

Dense: MT provides strong speedups for all matrix sizes. Occasional outliers are due to JVM warm-up.
Sparse: MT helps only when the matrix contains enough actual stored values. Highly sparse matrices should remain single-threaded (ST) to avoid overhead.

This shows that very sparse matrices should stay single-threaded, while denser workloads benefit from multithreading.

@janniklinde will review this PR.

This patch introduces multi-threading support for the roll operation to improve performance. The RollTest.java has been updated to cover both single and multithreaded execution modes. Furthermore, this update adds comprehensive consistency checks to ensure mathematical correctness. New tests were created to validate both dense and sparse matrix inputs. Additionally, cross-verification tests were added to confirm that sparse and dense rolling for single and multithreaded executions produce identical results.

This commit adds logs to measure the speedup between the single threaded and multithreaded roll operation (on dense and sparse matrices).

janniklinde · 2025-12-11T12:01:24Z

Thank you for the very good PR @Biranavan-Parameswaran and thanks for addressing the normalization issue for negative shifts in the roll operation. I particularly like that you added new correctness tests to increase test coverage and that you provide benchmarks to verify speedups for larger / dense matrices.

I suggest that RollOperationThreadSafetyTest is moved to test.java.org.apache.sysds.performance.matrix (and renamed to MatrixRollPerf). As you correctly mentioned, warmups can distort experimental results. Similarly, the overhead of JUnit is also not ideal for performance tests. I would consider doing a performance test in the style of MatrixAggregate to get more accurate results.

It would be interesting to see the performance of larger dense matrices with few numbers of columns (including column vectors). My guess is that performance drops in that scenario because you then do many array copies on small ranges. If the underlying row-major DenseBlock is DenseBlockFP64, I think we could benefit from replacing the row-by-row copy with one (or two in the overflow case) large arraycopy calls (similar to the clen == 1 case in copyDenseMtx(...)).

src/main/java/org/apache/sysds/runtime/matrix/data/LibMatrixReorg.java

codecov · 2025-12-11T12:51:37Z

Codecov Report

❌ Patch coverage is 94.52055% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.33%. Comparing base (79122eb) to head (2978d00).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
...ache/sysds/runtime/matrix/data/LibMatrixReorg.java	95.52%	2 Missing and 1 partial ⚠️
...ds/runtime/instructions/cp/ReorgCPInstruction.java	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2376      +/-   ##
============================================
+ Coverage     72.29%   72.33%   +0.03%     
- Complexity    46937    47025      +88     
============================================
  Files          1513     1513              
  Lines        178421   178806     +385     
  Branches      35036    35092      +56     
============================================
+ Hits         128993   129341     +348     
- Misses        39665    39686      +21     
- Partials       9763     9779      +16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Introduced a new test class to properly measure the performance of single-threaded and multithreaded rolling and remove unnecessary if condition

Biranavan-Parameswaran · 2025-12-16T19:45:09Z

This patch introduces multi threading support for the roll operation to improve performance across a wide range of matrix shapes and sizes. The new MatrixRollPerf benchmark was added to evaluate performance characteristics under different workloads, while the existing RollOperationThreadSafetyTest was intentionally kept to ensure correctness and equivalence between single threaded and multithreaded execution paths.

RollOperation. Single vs Multithreaded Performance Summary

Test Coverage and Motivation

Performance was evaluated across multiple matrix configurations to understand how matrix shape, density, and size affect multithreaded scaling. The goal was to validate expected speedups and identify edge cases where multithreading may not be beneficial.

General Performance Observations

Dense Matrices with Moderate Aspect Ratios

For dense matrices with moderate to large column counts, multithreaded execution consistently improves performance.

Typical speedups range from ~1.3× to ~4×, depending on matrix size and shape.
Larger matrices benefit more from multithreading since overhead becomes less significant.

These results confirm that the multithreaded roll implementation behaves as expected for typical dense workloads.

Sparse Matrices

Sparse matrices show more nuanced behavior.

For small or moderately sized sparse matrices, multithreading can be slower due to scheduling and synchronization overhead.
As matrix sizes grow, multithreading becomes beneficial, with observed speedups up to ~4×.

This aligns with my expectations. Sparse workloads require sufficient actual work per thread to justify parallel execution.

Matrices with Few Columns and Large Row Counts

Additional experiments were conducted for very tall matrices with few columns, including the extreme case of clen == 1 (column vectors). These shapes were specifically tested following @janniklinde's feedback.

Initial Behavior

For column vectors, the original multithreaded implementation performed significantly worse than the single threaded version. This was due to many small copy operations and excessive threading overhead.

Attempted Optimization

An optimized multithreaded implementation was prototyped for the case where the underlying block is clen == 1 (column vectors). This version replaced row by row copies with one or two large contiguous System.arraycopy calls, similar to the existing optimized path in copyDenseMtx.

This optimization significantly improved the multithreaded behavior compared to the previous implementation.

Final Results

Despite the improvement, benchmarking shows that even with this optimization:

For very large column vectors (100M–150M rows), single threaded execution remains faster.
Example results:
- 100M × 1 dense. ~50 ms (ST) vs ~54 ms (MT)
- 150M × 1 dense. ~68 ms (ST) vs ~89 ms (MT)

This indicates that the operation is memory bandwidth bound and already near optimal when using one or two large arraycopy calls. The overhead of multithreading outweighs its benefits, even at large sizes.

Because the optimized multithreaded path for clen == 1 does not provide a net performance benefit, this change was not committed, as it would increase complexity without improving performance.

Conclusion and Proposed Direction

Multithreading provides strong and consistent speedups for dense and sufficiently large sparse matrices.
Sparse matrices and column vectors can suffer from threading overhead.
Column vectors (clen == 1) are best handled by a single threaded bulk copy, as this already saturates memory bandwidth.

Based on these findings, a reasonable follow up improvement would be to explicitly force a single threaded roll path for clen == 1, while retaining multithreaded execution for wider matrices where it demonstrably improves performance.

If there is agreement on this approach, I can follow up with a targeted change implementing this specialization.

janniklinde · 2025-12-17T13:19:57Z

Thank you for the changes and the detailed performance analysis @Biranavan-Parameswaran. There is still a missing license that should be added.

Biranavan-Parameswaran · 2025-12-18T00:02:35Z

Thank you for the reminder @janniklinde . I have added the missing license.

janniklinde · 2025-12-18T09:15:45Z

LGTM @mboehm7 - one possible concern I have both with the single-threaded and multi-threaded implementation is that clen == 1 has to guarantee that in.getDenseBlockValues() does not throw an exception or returns only the first row. Because the single-threaded case already had that implementation, I am not sure if it is intentional or an issue.

Thank you again for your contribution, bugfixes, and detailed tests and benchmarks @Biranavan-Parameswaran.

It appears that System.arraycopy already saturates memory bandwidth (shown by your single column benchmarks). An optional improvement I could see for DenseBlockFP64 is to always perform two large arraycopies instead of many small ones (currently this is only limited to clen==1). Due to the contiguous row-major layout of some dense block implementations, performing two large arraycopies should be possible.

Biranavan Parameswaran added 2 commits December 7, 2025 23:58

[SYSTEMDS-3730] Add time measurement

1d41175

This commit adds logs to measure the speedup between the single threaded and multithreaded roll operation (on dense and sparse matrices).

github-project-automation bot added this to SystemDS PR Queue Dec 8, 2025

github-project-automation bot moved this to In Progress in SystemDS PR Queue Dec 8, 2025

janniklinde reviewed Dec 11, 2025

View reviewed changes

src/main/java/org/apache/sysds/runtime/matrix/data/LibMatrixReorg.java Outdated Show resolved Hide resolved

[SYSTEMDS-3730] Add MatrixRollPerf.java

c69ba3c

Introduced a new test class to properly measure the performance of single-threaded and multithreaded rolling and remove unnecessary if condition

[SYSTEMDS-3730] Add missing license

2978d00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/systemds 3730 multithreaded roll operation #2376

Feature/systemds 3730 multithreaded roll operation #2376

Uh oh!

Biranavan-Parameswaran commented Dec 8, 2025 •

edited

Loading

Uh oh!

janniklinde commented Dec 11, 2025

Uh oh!

Uh oh!

codecov bot commented Dec 11, 2025 •

edited

Loading

Uh oh!

Biranavan-Parameswaran commented Dec 16, 2025

Uh oh!

janniklinde commented Dec 17, 2025

Uh oh!

Biranavan-Parameswaran commented Dec 18, 2025

Uh oh!

janniklinde commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature/systemds 3730 multithreaded roll operation #2376

Are you sure you want to change the base?

Feature/systemds 3730 multithreaded roll operation #2376

Uh oh!

Conversation

Biranavan-Parameswaran commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

RollOperation: Single vs Multithreaded Performance Summary

Test Matrix Sizes

Dense Matrices

Sparse Matrices (1% sparsity)

Key Point

Uh oh!

janniklinde commented Dec 11, 2025

Uh oh!

Uh oh!

codecov bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Biranavan-Parameswaran commented Dec 16, 2025

RollOperation. Single vs Multithreaded Performance Summary

Test Coverage and Motivation

General Performance Observations

Dense Matrices with Moderate Aspect Ratios

Sparse Matrices

Matrices with Few Columns and Large Row Counts

Initial Behavior

Attempted Optimization

Final Results

Conclusion and Proposed Direction

Uh oh!

janniklinde commented Dec 17, 2025

Uh oh!

Biranavan-Parameswaran commented Dec 18, 2025

Uh oh!

janniklinde commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Biranavan-Parameswaran commented Dec 8, 2025 •

edited

Loading

codecov bot commented Dec 11, 2025 •

edited

Loading