Skip to content

Conversation

@Biranavan-Parameswaran
Copy link

@Biranavan-Parameswaran Biranavan-Parameswaran commented Dec 8, 2025

This PR implements a multithreaded roll operation as issued in SYSTEMDS-3730

This patch introduces multi-threading support for the roll operation to
improve performance. The RollTest.java has been updated to cover
both single and multithreaded execution modes.

Furthermore, this update adds comprehensive consistency checks to ensure
mathematical correctness. New tests were created to validate both dense
and sparse matrix inputs. Additionally, cross-verification tests were
added to confirm that sparse and dense rolling for single and
multithreaded executions produce identical results.

RollOperation: Single vs Multithreaded Performance Summary

Test Matrix Sizes

Two ranges of matrix sizes were evaluated:

  1. Small/Medium Matrices:
    Rows = 2017–2516, Cols = 1001–1500
    (Defined as: MIN_ROWS = 2017, MIN_COLS = 1001, plus up to +500 random)
  2. Large Matrices:
    Rows ≈ 8000–8500, Cols ≈ 4000–4500
    (Same generation logic but higher baseline)

These two ranges allow observing how workload size impacts single-threaded (ST) vs multithreaded (MT) performance.


Dense Matrices

  • Multithreaded (MT, 10 cores) execution consistently improves performance versus single-threaded (ST, 1 core).
  • Small/Medium matrices: typically 2×–3.5× speedup.
  • Large matrices: typically 1.3×–4.7× speedup, depending on layout and shift.
  • Occasional slowdowns are visible, almost certainly due to JVM warm-up effects during early measurements.

Sparse Matrices (1% sparsity)

  • Small/Medium sparse matrices: MT is often slower than ST because the workload is too small; overhead dominates.
    • Example: 0.5 ms (ST) → 2.0 ms (MT) → 0.25× slowdown
  • Large sparse matrices: MT becomes consistently beneficial, commonly achieving 3×–4.5× speedups.
    • Example: 7.0 ms (ST) → 1.8 ms (MT) → 3.9× speedup

Key Point

  • Dense: MT provides strong speedups for all matrix sizes. Occasional outliers are due to JVM warm-up.
  • Sparse: MT helps only when the matrix contains enough actual stored values. Highly sparse matrices should remain single-threaded (ST) to avoid overhead.

This shows that very sparse matrices should stay single-threaded, while denser workloads benefit from multithreading.

@janniklinde will review this PR.

Biranavan Parameswaran added 2 commits December 7, 2025 23:58
This patch introduces multi-threading support for the roll operation to
improve performance. The RollTest.java has been updated to cover
both single and multithreaded execution modes.

Furthermore, this update adds comprehensive consistency checks to ensure
mathematical correctness. New tests were created to validate both dense
and sparse matrix inputs. Additionally, cross-verification tests were
added to confirm that sparse and dense rolling for single and
multithreaded executions produce identical results.
This commit adds logs to measure the speedup between the single threaded and multithreaded roll operation (on dense and sparse matrices).
@janniklinde
Copy link
Contributor

Thank you for the very good PR @Biranavan-Parameswaran and thanks for addressing the normalization issue for negative shifts in the roll operation. I particularly like that you added new correctness tests to increase test coverage and that you provide benchmarks to verify speedups for larger / dense matrices.

I suggest that RollOperationThreadSafetyTest is moved to test.java.org.apache.sysds.performance.matrix (and renamed to MatrixRollPerf). As you correctly mentioned, warmups can distort experimental results. Similarly, the overhead of JUnit is also not ideal for performance tests. I would consider doing a performance test in the style of MatrixAggregate to get more accurate results.

It would be interesting to see the performance of larger dense matrices with few numbers of columns (including column vectors). My guess is that performance drops in that scenario because you then do many array copies on small ranges. If the underlying row-major DenseBlock is DenseBlockFP64, I think we could benefit from replacing the row-by-row copy with one (or two in the overflow case) large arraycopy calls (similar to the clen == 1 case in copyDenseMtx(...)).

@codecov
Copy link

codecov bot commented Dec 11, 2025

Codecov Report

❌ Patch coverage is 94.52055% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.33%. Comparing base (79122eb) to head (2978d00).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...ache/sysds/runtime/matrix/data/LibMatrixReorg.java 95.52% 2 Missing and 1 partial ⚠️
...ds/runtime/instructions/cp/ReorgCPInstruction.java 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2376      +/-   ##
============================================
+ Coverage     72.29%   72.33%   +0.03%     
- Complexity    46937    47025      +88     
============================================
  Files          1513     1513              
  Lines        178421   178806     +385     
  Branches      35036    35092      +56     
============================================
+ Hits         128993   129341     +348     
- Misses        39665    39686      +21     
- Partials       9763     9779      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Introduced a new test class to properly measure the performance of single-threaded and multithreaded rolling and remove unnecessary if condition
@Biranavan-Parameswaran
Copy link
Author

This patch introduces multi threading support for the roll operation to improve performance across a wide range of matrix shapes and sizes. The new MatrixRollPerf benchmark was added to evaluate performance characteristics under different workloads, while the existing RollOperationThreadSafetyTest was intentionally kept to ensure correctness and equivalence between single threaded and multithreaded execution paths.


RollOperation. Single vs Multithreaded Performance Summary

Test Coverage and Motivation

Performance was evaluated across multiple matrix configurations to understand how matrix shape, density, and size affect multithreaded scaling. The goal was to validate expected speedups and identify edge cases where multithreading may not be beneficial.


General Performance Observations

Dense Matrices with Moderate Aspect Ratios

For dense matrices with moderate to large column counts, multithreaded execution consistently improves performance.

  • Typical speedups range from ~1.3× to ~4×, depending on matrix size and shape.
  • Larger matrices benefit more from multithreading since overhead becomes less significant.

These results confirm that the multithreaded roll implementation behaves as expected for typical dense workloads.


Sparse Matrices

Sparse matrices show more nuanced behavior.

  • For small or moderately sized sparse matrices, multithreading can be slower due to scheduling and synchronization overhead.
  • As matrix sizes grow, multithreading becomes beneficial, with observed speedups up to ~4×.

This aligns with my expectations. Sparse workloads require sufficient actual work per thread to justify parallel execution.


Matrices with Few Columns and Large Row Counts

Additional experiments were conducted for very tall matrices with few columns, including the extreme case of clen == 1 (column vectors). These shapes were specifically tested following @janniklinde's feedback.

Initial Behavior

For column vectors, the original multithreaded implementation performed significantly worse than the single threaded version. This was due to many small copy operations and excessive threading overhead.

Attempted Optimization

An optimized multithreaded implementation was prototyped for the case where the underlying block is clen == 1 (column vectors). This version replaced row by row copies with one or two large contiguous System.arraycopy calls, similar to the existing optimized path in copyDenseMtx.

This optimization significantly improved the multithreaded behavior compared to the previous implementation.

Final Results

Despite the improvement, benchmarking shows that even with this optimization:

  • For very large column vectors (100M–150M rows), single threaded execution remains faster.
  • Example results:
    • 100M × 1 dense. ~50 ms (ST) vs ~54 ms (MT)
    • 150M × 1 dense. ~68 ms (ST) vs ~89 ms (MT)

This indicates that the operation is memory bandwidth bound and already near optimal when using one or two large arraycopy calls. The overhead of multithreading outweighs its benefits, even at large sizes.

Because the optimized multithreaded path for clen == 1 does not provide a net performance benefit, this change was not committed, as it would increase complexity without improving performance.


Conclusion and Proposed Direction

  • Multithreading provides strong and consistent speedups for dense and sufficiently large sparse matrices.
  • Sparse matrices and column vectors can suffer from threading overhead.
  • Column vectors (clen == 1) are best handled by a single threaded bulk copy, as this already saturates memory bandwidth.

Based on these findings, a reasonable follow up improvement would be to explicitly force a single threaded roll path for clen == 1, while retaining multithreaded execution for wider matrices where it demonstrably improves performance.

If there is agreement on this approach, I can follow up with a targeted change implementing this specialization.

@janniklinde
Copy link
Contributor

Thank you for the changes and the detailed performance analysis @Biranavan-Parameswaran. There is still a missing license that should be added.

@Biranavan-Parameswaran
Copy link
Author

Thank you for the reminder @janniklinde . I have added the missing license.

@janniklinde
Copy link
Contributor

LGTM @mboehm7 - one possible concern I have both with the single-threaded and multi-threaded implementation is that clen == 1 has to guarantee that in.getDenseBlockValues() does not throw an exception or returns only the first row. Because the single-threaded case already had that implementation, I am not sure if it is intentional or an issue.

Thank you again for your contribution, bugfixes, and detailed tests and benchmarks @Biranavan-Parameswaran.

It appears that System.arraycopy already saturates memory bandwidth (shown by your single column benchmarks). An optional improvement I could see for DenseBlockFP64 is to always perform two large arraycopies instead of many small ones (currently this is only limited to clen==1). Due to the contiguous row-major layout of some dense block implementations, performing two large arraycopies should be possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants