[SYSTEMDS-3891] [Major] Refactor Cache Manager / Performance Improvements / OOC Statistics #2387
+4,135
−940
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR depends on #2368.
This patch provides a major rework of the
OOCEvictionManagerby separating cache scheduling logic from I/O handling. The rework is neededNullPointerExceptions can occur under memory pressure, especially when requiring multiple blocks to be resident in memory simultaneously)Further, we introduce detailed out-of-core statistics and a fine-grained event log that can be exported to CSV using the CLI options
. The bottom graph shows compute tasks and idle times of the fixed sized
-oocStats [topNHeavyHitters]and-oocLogEvents [savedir]. The event log can be visualized to identify bottlenecks (see image below; performance of pca on 1Mx1000 input matrix). Detailed information to the experiment can be found onThreadPool. The y-axis of the bottom three graphs shows the Thread ID of the worker performing the read/write/compute tasks.Currently, it is still possible to exceed hard limits of the cache because of uncontrolled producers that are not yet unified with the cache system (e.g.,
ReblockOOCInstruction).